Abstract
The transition from prebiotic chemistry to living systems requires the emergence of a scheme for enzyme-free genetic replication. Here, we analyze a recently proposed prebiotic replication scenario, the so-called Virtual Circular Genome (VCG) [Zhou et al., RNA 27, 1-11 (2021)]: Replication takes place in a pool of oligomers, where each oligomer contains a subsequence of a circular genome, such that the oligomers encode the full genome collectively. While the sequence of the circular genome may be reconstructed based on long oligomers, short oligomers merely act as replication feedstock. We observe a competition between the predominantly error-free ligation of a short oligomer to a long oligomer and the predominantly erroneous ligation of two long oligomers. Increasing the length of long oligomers and reducing their concentration decreases the fraction of erroneous ligations, enabling high-fidelity replication in the VCG. Alternatively, the problem of erroneous products can be mitigated if only monomers are activated, such that each ligation involves at least one monomer. Surprisingly, in such systems, shorter oligomers are extended by monomers more quickly than long oligomers, a phenomenon which has already been observed experimentally [Ding et al., JACS 145, 7504-7515 (2023)]. Our work provides a theoretical explanation for this behavior, and predicts its dependence on system parameters such as the concentration of long oligomers. Taken together, the VCG constitutes a promising scenario of prebiotic information replication: It could mitigate challenges of in non-enzymatic copying via template-directed polymerization, such as short lengths of copied products and high error rates.
Introduction
In order to delineate possible pathways towards the emergence of life, it is necessary to understand how a chemical reaction network capable of storing and replicating genetic information might arise from prebiotic chemistry. RNA is commonly assumed to play a central role on this path, as it can store information in its sequence and catalyze its own replication [1–3]. While ribozymes capable of replicating strands of their own length have been demonstrated in the laboratory [4], it remains elusive how enzyme-free self-replication might have worked before the emergence of such complex ribozymes.
One possible mechanism is template-directed primer extension [5–11]: In this process, a primer hybridizes to a template and is extended by short oligonucleotides, thereby forming a (complementary) copy of the template strand. Considerable progress has been made in optimizing template-directed primer extension, but challenges remain: (i) The produced copies are likely to be incomplete. So far, at most 12 nt have been successfully added to an existing primer [10]. Moreover, as the pool of primer strands needs to emerge via random polymerization, the primer is likely to hybridize to the template at various positions, and not only to its 3’-end, leaving part of the 3’-end of the template uncopied [12]. (ii) Errors in enzyme-free copying are frequent due to the limited thermodynamic discrimination between correct Watson-Crick pairing and mismatches [13–16] The error probability constrains the length of the genome that can be reliably replicated.
The issue of insufficient thermodynamic discrimination can, in principle, be mitigated by making use of kinetic stalling after the incorporation of a mismatch [15]. By introducing a competition between the reduced polymerization rate and a characteristic timescale of the non-equilibrium environment, it is possible to filter correct sequences from incorrect ones [16]. To address the challenge of incomplete copies, Zhou et al. propose an new scenario of replication, the so-called Virtual Circular Genome (VCG) [17]. In this scenario, genetic information is stored in a pool of oligomers that are shorter than the circular genome they collectively encode: Each oligomer bears a subsequence of the circular genome, such that the collection of all oligomers encodes the full circular genome virtually. Within the pool, each oligomer can act as template or primer [17]. The oligomers hybridize to each other and form complexes that allow for templated ligation of two oligomers, or for the extension of an oligomer by a monomer. Because the sequences of the ligated strands and the template are part of the genome sequence, most of the products should also retain the sequence of the genome. That way, long oligomers encoding the circular genome are produced at the expense of short oligomers [17]. The long strands, in turn, can assemble into catalytically active ribozymes. With a continuous influx of short oligomers, the VCG might allow for continuous replication of the virtually encoded circular genome. Importantly, replication in the VCG is expected to avoid the issue of incomplete copies. Since the genome is circular, it does not matter which part of the genome an oligomer encodes, as long as the sequence is compatible with the genome sequence. An additional feature of the VCG scenario is that replication should be achievable without the need of adding many nucleotides to a primer: Provided the concentration of oligomers decreases exponentially with their length, the concentration of each oligomer in the pool can be doubled by extending each oligomer only by a few nucleotides [17]. The extension of an oligomer by a few nucleotides in a VCG pool has already been demonstrated experimentally [18].
A recent computational study points out that the VCG scenario is prone to loss of genetic information via ”sequence scrambling” [19]: If the genome contains identical sequence motifs at multiple different loci, replication in the VCG will mix the sequences of these loci, thus destroying the initially defined genome. More generally, it is currently unclear, which conditions could prevent genome instability of VCGs, such that their genetic information is retained. Length distribution, sequence composition, oligonucleotide concentration and environmental conditions, such as temperature, all affect the stability of complexes and thus the replication dynamics of the VCG pool. Here, we characterize the replication fidelity and yield of VCG pools using a kinetic model, which explicitly incorporates association and dissociation of RNA strands as well as templated ligation. We study a broad spectrum of prebiotically plausible and experimentally accessible oligomer pools, from pools containing only monomers and long oligomers of a single length to pools including a range of long oligomers with uniform or exponential concentration profile. The length of the included oligomers as well as their concentration are free parameters of our model.
We find that, regardless of the details of the pool, there is a competition between three types of template-directed ligation reactions: (i) Ligations of two short oligomers, e.g., monomers, forming oligomers that are too short to specify a unique locus on the genome. (ii) Ligations of a short oligomer to a long oligomer, which form longer oligomers with sequences that are typically compatible to the genome. (iii) Ligations of two long oligomers, which often create oligomers that are incompatible with the genome sequence. Therefore, the fidelity is primarily determined by the competition between the error-free extension of a long oligomer and the erroneous ligation of two long oligomers. The likeli-hood of the latter reaction can be minimized by lowering the relative concentration of long oligomers, even though this increases the fraction of short oligomers that are used up for ligations of the first type. Consequently, the fidelity can be improved at the expense of reducing the yield. The efficiency, meaning the yield attainable at a fixed high fidelity, thus depends on the length distribution of the oligomers in the pool.
Alternatively, the issue of erroneous ligations is solved if only monomers are activated, such that each ligation incorporates only one monomer at a time, as in the experimental study by Ding et al. [18]. In this case, the VCG concentration can be chosen arbitrarily large without compromising fidelity. Interestingly, our model predicts an unexpected feature: In the limit of high VCG concentrations, the rate of primer extension is higher for shorter oligomers than for long oligomers, even though, intuitively, complexes containing longer oligomers are expected to be more stable and thus more productive. The same behavior was indeed observed experimentally [18]. We provide an explanation for this feature, and discuss its dependence on system parameters such as the length and the concentration of long oligomers in the pool.
Model
The VCG is a replication scenario where a circular genome is stored in a pool of oligomers. The oligomers are shorter than the genome that they encode. Each oligomer bears a subsequence of the circular genome, such that, collectively, the oligomers encode the full genome (Fig. 1A). To set up our model for the dynamics of VCG pools, we specify (i) our choice of circular genomes, (ii) how to map the genomes to pools of oligomers, and (iii) which chemical reactions govern the dynamics.
Circular Genomes
For a given genome length, LG, there are distinct circular genomes (the factor 1/2LG accounts for the freedom to select the starting position and to choose the Watson or the Crick strand as reference sequence). A key property of the genome is its minimal unique motif length, LS, defined as the shortest length such that all possible subsequences of length L ≥ LS appear at most once in the genome. In other words, all subsequences of length L ≥ LS specify a unique locus on the genome. In principle, each circular genome has another length scale, corresponding to the maximal motif length, up to which all possible sequence motifs are contained in the genome. Here, for convenience, we restrict ourselves to genomes in which this length-scale is LS − 1, such that all possible subsequences shorter than LS are contained in the sequence of the genome. Additionally, we choose only unbiased genomes in which all possible subsequences of length L < LS are contained at equal frequency.
Construction of VCG Pools
To specify a VCG pool encoding a genomic sequence (chosen as described above), one must select the subsequences to be included in the pool, as well as their concentrations. We consider unbiased pools, where the concentration of subsequences, c(L), depends only on their length, L, i.e., all subsequences of a given length are included at equal concentration. We refer to the length-dependent concentration profile as the length distribution of the pool. Depending on their length, oligomers fall into two categories (Fig. 1B): (i) short feedstock oligomers and (ii) long VCG oligomers. Feedstock oligomers are oligomers that are shorter than the unique subsequence length LS. Since their sequence appears multiple times on the genome, they do not encode a specific position on the genome. Thus, they serve as feedstock for the elongation of VCG oligomers rather than as information storage. Conversely, the VCG oligomers, which are longer than the unique subsequence length LS, have a unique locus on the circular genome. Collectively, the VCG oligomers enable the reconstruction of the full genome.
The full length distribution, c(L), can be decomposed into the contributions of feedstock oligomers and VCG oligomers,
We assume that both types of oligomers have an exponential length distribution. In our model, the concentration of VCG oligomers can be varied independently of the concentration of feedstock oligomers, and the length scales for the exponential decay ( vs.) may differ between VCG oligomers and feedstock oligomers. Additionally, we can restrict the set of oligomer lengths included in the pool by setting minimal and maximal lengths for feedstock oligomers and for VCG oligomers individually,
For any other oligomer length, the concentrations equal zero. This parametrization includes uniform length distributions as a special case (κF = 0 and κV = 0), and also allows for concentration profiles that are peaked. Peaked length distributions can emerge from the interplay of templated ligation, (de)hybridization and outflux in open systems [20]. We define the total concentration of feedstock oligomers, , as well as the total concentration of VCG oligomers, . Their ratio will turn out to be an important determinant of the VCG dynamics.
(De)hybridization Kinetics
Oligomers can hybridize to each other to form double-stranded complexes, or dehybridize from an existing complex. For simplicity, we do not include self-folding within a strand, which is a reasonable assumption for short oligomers. The stability of a complex is determined by its hybridization energy, with lower hybridization energy indicating greater stability. We use a simplified nearest-neighbor energy model to compute the hybridization energy [20–22]: The total energy equals the sum of the energy contributions of all nearest-neighbor blocks in a given complex (Fig. 1C). The energy contribution associated with a block of two Watson-Crick base pairs (matches) is denoted γ < 0, and dangling end blocks involving one Watson-Crick pair and one free base contribute γ/2. Nearest-neighbor blocks involving mismatches increase the hybridization energy by γMM > 0 per block, thus reducing the stability of the complex. The rate constants of hybridization and dehybridization are related via
where c° = 1 M is the standard concentration, and ΔG is the free energy of hybridization. The association rate constant kon is proportional to the encounter rate constant, kenc = 1/(c°t0). The encounter timescale t0 serves as the elementary time unit of the kinetic model, with all reaction timescales measured relative to it.
Templated Ligation
Two oligomers A and B that are hybridized adjacently to each other on a third oligomer C can produce a new oligomer A-B via templated ligation (Fig. 1D). Depending on the length of A and B, we distinguish three types of ligation reactions (Fig. 1E): (i) F+F ligations, in which two feedstock oligomers ligate, (ii) F+V ligations, where a VCG oligomer is extended by a feedstock oligomer, and (iii) V+V ligations involving two VCG oligomers. The formation of a covalent bond via templated ligation is not spontaneous, but requires the presence of an activation reaction. Usually, these reactions add a leaving group to the 5’-end of the nucleotide which is cleaved during bond formation [10, 23]. We assume that the concentration of activating agent is sufficiently high for the activation to be far quicker than the formation of the covalent bond, such that activation and covalent bond formation can be treated as a single effective reaction. We consider pools in which all oligomers are activated, as well as pools in which only monomers are activated.
Observables
Essentially, templated ligation in the pool produces VCG oligomers at the expense of feed-stock oligomers. To characterize these dynamics, we distinguish between different contributions to the production flux. While the product of an F+V ligation is always a VCG oligomer, F+F ligations can produce feed-stock oligomers or VCG oligomers. In both cases, a produced VCG oligomer can be correct (compatible with the genome) or incorrect (incompatible). V+V ligations always produce VCG oligomers that can be correct or incorrect. We quantify these processes by measuring extension fluxes in units of nucleotides ligated to an existing strand (counting the length of the shorter ligated strand as the number of incorporated bases). In particular, we define the fidelity f as the extension flux resulting in correct VCG oligomers relative to the flux resulting in any VCG oligomer,
In addition, we introduce the yield y as the proportion of total extension flux that produces VCG oligomers,
Efficient replication of the VCG requires both, high fidelity and high yield. Hence, we introduce the efficiency of replication η as the product of fidelity and yield,
In order to discern the contributions of different types of templated ligations (F+F, F+V, V+V) to fidelity, yield, and efficiency, we define the ligation share s of a ligation type,
Results
Replication efficiency reaches a maximum at intermediate concentrations of VCG oligomers
To study the dynamics of VCG pools, we start by picking a exemplary genome of length LG = 16 nt,
5’-AAAGAGGACACGGCAU-3’
3’-UUUCUCCUGUGCCGUA-5’.
In this genome, the minimal unique subsequence length is LS = 3 nt. Consequently, monomers as well as dimers are part of the feedstock and VCG oligomers are at least 3 nt long. We first consider the simplest oligomer pool possible, which includes only monomers and VCG oligomers of a single length LV (Fig. 2A).
These initial VCG pools are evolved in time using a stochastic simulation based on the Gillespie algorithm [20–22]. Since the Gillespie algorithm operates on the level of counts of molecules instead of concentrations, we must assign a volume to each system. We choose volumes in the range 1 μm3 to 10 000 μm3 (Supplementary Material Section S-I). In addition to the volume, the reaction rate constants need to be chosen appropriately: (i) Experimentally determined association rate constants are typically around 106 M−1 s−1 [24–26], which corresponds to an association timescale of t0 ≈ 1 μs. (ii) The timescale of dehybridization is computed via Eq. (3) using the energy contribution γ = −2.5 kBT for a matching nearest-neighbor block and γMM = 25.0 kBT in case of mismatches. The high energy penalty of nearest-neighbor blocks involving mismatches, γMM, is chosen to suppress the formation of mismatches, while the value of γ roughly matches the average energy of all matching nearest-neighbor blocks given by the Turner energy model of RNA hybridization [27]. (iii) For templated ligation, we select a reaction rate constant of . This choice of klig is consistent with ligation rates observed in enzyme-free template-directed primer extension experiments, which range from around 10−6 s−1 [9, 10] to roughly 10−4 s−1 [23], depending on the underlying activation chemistry.
Based on the ligation events observed in the time evolution, we compute the observables introduced above (Model section). Due to the small ligation rate constant, the time evolution can only be simulated for a few ligation time units. Consequently, ligation events are scarce, which leads to high variances in the computed observables. We mitigate this issue by calculating the observables based on the concentration of complexes that are in a productive configuration, even if they do not undergo templated ligation within the time window of the simulation (Supplementary Material Section S-I).
From the ligation events, we obtain the total production flux of VCG oligomers and feedstock oligomers. Our observable y (yield) quantifies the fraction of this total flux directed to VCG oligomers. Fig. 2B shows how the yield depends on the concentration of VCG oligomers, , at a fixed total monomer concentration, . Here, the data points (with error bars) represent the simulation results with the different colors corresponding to different choices of initial VCG oligomer length, LV. We observe that the yield increases monotonically with the concentration of VCG oligomers, and approaches 100% for high . The concentration at which the pool reaches a yield of 50% depends on the oligomer length LV: Shorter VCG oligomers require higher concentrations for high yield. The concentration dependence of the yield can be rationalized by the types of templated ligations that are involved. For low VCG concentration, most templated ligation reactions are dimerizations (1+1) with the VCG oligomers merely acting as templates (Fig. 2C). As dimers are shorter than the unique subsequence length LS, their formation does not contribute to the yield, which explains the low yield in the limit of small . Conversely, for high VCG concentration, most of the templated ligations are F+V or V+V ligations, which produce oligomers of length L ≥ LS, implying high yield.
However, Fig. 2C also shows that the relative contribution of V+V ligation increases with increasing , with a large fraction of them producing incorrect products. This reduction in fidelity, f, causes a trade-off between fidelity and yield, which is visible as a peak at intermediate VCG concentrations in the replication efficiency, η = f · y (Fig. 2D). Using hexamers as an example, Fig. 2E illustrates why V+V ligations are prone to forming incorrect oligomers: In order to ensure that two oligomers only ligate if their sequences are adjacent to each other in the true circular genome, both oligomers need to share an overlap of at least LS nucleotides with the template oligomer. Otherwise, the product oligomers might form sequences incompatible with the true genome. The probability of forming incorrect products is a consequence of the combinatorics of possible subsequences in the VCG. Specifically, for a genome with LG = 16 nt, there are 32 different hexamers. For example, if the left educt hexamer only has one nucleotide of overlap with the template, there are 1/4 · 32 = 8 possible educt oligomers. However, only one out of those 8 hexamers is the correct partner for the right educt hexamer, implying an error probability of 1 − 1/8 = 7/8 (first example in Fig. 2E).
Characterizing replication efficiency using simulation data is computationally expensive. Depending on the parameters, up to a hundred simulations, each taking a few days, are needed to obtain a single data point in Fig. 2B-D to sufficient accuracy. To enable rapid computation of the observables of interest over a wide range of parameter values, we introduce an additional approximate approach that relies on (i) the separation of time-scales between (de)hybridization and templated ligation, and (ii) on a coarse-grained representation of the oligomers in the VCG pool. The curves shown together with the simulation data in Fig. 2 are computed using this “adiabatic approach”. The approach is detailed in Supplementary Material Sections S-II and S-III, and only outlined briefly in the following: Due to the slow rate of ligation, the (de)hybridization equilibrium is typically reached long before the first ligation event happens. To determine the hybridization equilibrium, we assume oligomers of the same length to have identical concentration, regardless of their sequence. The concentration of complexes comprised of multiple oligomers is calculated using the law of mass action: It equals the product of the individual oligomer concentrations divided by the respective dissociation constant Kd, which in turn depends on the hybridization energy, ΔG. Depending on the length of the hybridization region, one or multiple hybridization partners may be possible based on the oligomer sequences present in the pool. This is accounted for by weighting the complex concentrations by combinatorial multiplicities. For each oligomer length, we impose a mass conservation constraint ensuring that the equilibrium concentration of free oligomers and the concentration of oligomers bound in complexes adds up to the total concentration in the pool. Solving the law of mass action together with the mass conservation equations numerically yields the equilibrium concentrations of free VCG and feedstock strands, and , for any given total concentration, and . Knowing the equilibrium concentration of free strands allows us to compute concentrations of productive complexes and thus observables characterizing templated ligation.
The results of the adiabatic approach agree well with the data obtained from the simulations (Fig. 2B-D). For instance, the adiabatic approach predicts the same non-monotoneous concentration dependence of replication efficiency as the simulations: VCG pools containing longer oligomers achieve higher maxima in replication efficiency at lower VCG concentration. While the available simulation data only allows for a qualitative description of this trend, the adiabatic approach enables a quantitative analysis. Fig. 2F (solid line) shows the equilibrium concentration ratio at which replication efficiency is maximal, as a function of the VCG oligomer length, LV. As expected from the qualitative trend, pools containing longer oligomers reach their maximum for lower concentration of VCG oligomers. The shaded area indicates the range of VCG concentrations within which the pool’s efficiency deviates by no more than one percent from its optimum. We observe that this range of close-to-optimal VCG concentrations increases with LV. Thus, pools containing longer oligomers require less fine-tuning of the VCG concentration for replication with high efficiency.
In addition to the numerical results, we utilize the adiabatic approach to study the optimal equilibrium VCG concentration analytically (Supplementary Material Section S-V). We find that, for any choice of LV, replication efficiency reaches its maximum when the fractions of dimerization (1+1) reactions and erroneous V+V ligations are equal (Fig. 2C for LV = 6 nt). This criterion can be used to derive a scaling law for the optimal equilibrium concentration ratio as a function of the oligomer length LV,
which is shown as a dashed line in Fig. 2F (the length-scale ΛF+F is defined in Supplementary Material Section S-V). The optimal equilibrium concentration ratio decreases exponentially with LV, while the hybridization energy γ/2 sets the inverse length-scale of the exponential decay. Analytical estimate and numerical solution agree well, as long as the hybridization is weak and oligomers are sufficiently short. For strong binding and long oligomers, complexes involving more than three strands play a non-negligible role, but such complexes are neglected in the analytical approximation.
Fig. 2G shows how the maximal replication efficiency depends on the VCG oligomer length. Consistent with the qualitative trend observed in Fig. 2D, longer oligomers enable higher maxima in replication efficiency. Regardless of the choice of γ, replication efficiency reaches 100% if LV is sufficiently high. Using Eq. (4), we find the following approximation for the maximal replication efficiency attainable at a given oligomer length (see dashed lines in Fig. 2G),
where η° is a genome-dependent constant (Supplementary Material Section S-VI). This equation provides guidance for the construction of VCG pools with high replication efficiency: Given a target efficiency, the necessary oligomer length and hybridization energy, i.e., temperature, can be calculated. In Fig. 2G, we determine the oligomer length necessary to achieve ηmax = 95% for varying hybridization energies γ. At higher temperature (weaker binding), VCG pools require longer oligomers for replication with high efficiency.
Eq. (5) is not restricted to the example genome of length LG = 16 nt introduced previously, but may be applied to genomes of arbitrary length (provided the genomes follow the construction principles outlined in the method section). In such genomes, the minimal unique subsequence length scales logarithmically with the length of the genome, LS ∼ ln(LG). Based on LG and LS, we can compute the genome-dependent quantities ΛF+F and η° via the combinatorial multiplicities of productive complexes, which allow us to compute the characteristic VCG oligomer length at which the system reaches ηmax = 95%. Just like LS, this characteristic VCG oligomer length also increases logarithmically with the genome length (Fig. 2H), allowing VCG pools with moderate oligomer sizes to encode long circular genomes.
Pools containing VCG oligomers of different lengths are dominated by the long oligomers
In the previous section, we characterized the behavior of pools containing VCG oligomers of a single length. We observed that V+V ligations are error-prone due to insufficient overlap between the educt strands and the template, whereas F+V ligations extend the VCG oligomers with high efficiency. The F+V ligations will gradually broaden the length distribution of the VCG pool, raising the question of how this broadening affects the replication behavior. In principle, introducing multiple oligomer lengths into the VCG pool might even improve the fidelity of V+V ligations, since a long VCG oligomer could serve as a template for the correct ligation of two shorter VCG oligomers.
To analyze this question quantitatively, we first consider the simple case of a VCG pool that only contains monomers, tetramers, and octamers. The concentration of monomers is set to c(1) = 0.1 mM, while the concentrations of the VCG oligomers are varied independently (Fig. 3A). Replication efficiency reaches its maximum at c(8) ≈ 0.1 μM and very low tetramer concentration, c(4) ≈ 7.4 pM, effectively resembling a single-length VCG pool containing only octamers. As shown in Fig. 3B, the maximal efficiency is surrounded by a plateau of close-to-optimal efficiency. The octamer concentration can be varied by more than one order of magnitude without significant change in efficiency. Similarly, adding tetramers does not affect efficiency as long as the tetramer concentration does not exceed the octamer concentration. Fig. 3C illustrates that the plateau of close-to-optimal efficiency coincides with the concentration regime where the ligation of a monomer to an octamer with another octamer acting as template, , is the dominant ligation reaction. For high tetramer concentration and intermediate octamer concentration, templated ligation of tetramers on octamer templates, , surpasses the contribution of ligations (green shaded area in Fig. 3C). The reactions give rise to a ridge of increased efficiency, which represents a local maximum, but is small compared to the plateau of close-to-optimal efficiency (Fig. 3B). Even though reactions of the type produce correct products in most cases, they compete with error-prone ligations like or , which reduce fidelity. Increasing the binding affinity, i.e., lowering γ, enhances the contribution of ligations at the expense of ligations, resulting in an increased local maximum in replication efficiency (Supplementary Material Fig. S9). However, very strong hybridization energy, γ = −5.0 kBT, is necessary for ligations to give rise to similar efficiency as ligations. Thus, in this example, high efficiency replication facilitated by the ligation of short VCG oligomers on long VCG oligomers is in theory possible, but requires unrealistically strong binding affinity. Moreover, this mechanism is only effective if educts and template differ significantly with respect to their length. In case short and long oligomers are similar in size, templated ligation of two short oligomers tends to be error-prone, irrespective of the choice of binding affinity (Supplementary Material Fig. S10 and S11). In that case, F+V ligations remain the most reliable replication mechanism.
We observe similar behavior in VCG pools containing a range of oligomer lengths from to (Fig. 4A), again using γ = −2.5 kBT (typical RNA hybridization energy). We maximize replication efficiency with respect to the concentration of VCG oligomers under the assumption that all VCG oligomers of any length have the same concentration (uniform profile). Efficiency again reaches its maximum when F+V ligations of long VCG oligomers are the predominant type of templated ligation (see Fig. 4D, which shows the relative contribution of different ligation types in VCG pools with fixed nt and varying nt). Adding short VCG oligomers to the pool leaves the relative contributions of the dominant ligation types almost unchanged and increases the fraction of unproductive F+F ligations only marginally. Replication efficiency is governed by the behavior of the long oligomers included in the pool. Decreasing for fixed reduces the achievable maximal replication efficiency only slightly. Conversely, lowering the maximal oligomer length, , while keeping fixed causes the efficiency to decrease significantly (Fig. 4B). This is plausible because, for any , replication efficiency is bounded by the replication efficiency of the longest oligomer in the pool (blue curve in Fig. 4B, which corresponds to the orange curve in Fig. 2G). In addition, for small , short and long VCG oligomers are increasingly similar in length, which gives rise to a broad spectrum of erroneous V+V ligations that compete with the error-free F+V ligations (Fig. 4E).
In a realistic prebiotic setup, the concentration profile of the VCG pool would not be uniform. Depending on the mechanism that produces the pool and its coupling to the non-equilibrium environment, it might have a concentration profile that decreases or increases exponentially with length. We use the parameter κV to control this exponential length dependence (Fig. 4A): For negative κV, the concentration increases as a function of length, while exponentially decaying length distributions have positive κV. We find that replication efficiency is high if the concentration of long VCG oligomers exceeds or at least matches the concentration of short VCG oligomers (see κV ≤ 0 in Fig. 4C). In that case, replication efficiency is dominated by the long oligomers in the pool, since these form the most stable complexes. As the concentration of long oligomers is decreased further (κV > 0), the higher stability of complexes formed by longer oligomers is eventually insufficient to compensate for the reduced concentration of long oligomers. Replication efficiency is then governed by short VCG oligomers, which exhibit lower replication efficiency. In the limit κV → ∞, replication efficiency approaches the replication efficiency of a single-length VCG pool containing only oligomers of length (Fig. 4C and Fig. 2G).
Adding dinucleotides to the feedstock decreases the efficiency of replication
So far, we focused on ensembles that contain solely monomers as feedstock. However, examining the influence of dimers on replication in VCG pools is of interest, since dinucleotides have proven to be interesting candidates for enzyme-free RNA copying [9–11, 23]. For this reason, we study oligomer pools like those illustrated in Fig. 5A: The ensemble contains monomers, dimers and VCG oligomers of a single length, LV. As our default scenario, the dimer concentration is set to 10% of the monomer concentration, corresponding to κF = 2.3, but this ratio can be modified by changing κF.
Fig. 5B compares the replication efficiency of a pool with and without dimers . In both cases, pools that are rich in VCG oligomers exhibit low efficiency of replication. As erroneous V+V ligations are the dominant type of reaction in this limit, the same efficiency is obtained regardless of the presence of dimers. In contrast, when the pool is rich in feedstock, pools with and without dimers behave differently: If only monomers are included, the efficiency approaches zero, as dimerizing monomers do not contribute to the yield, and thus not to the efficiency. However, the presence of dimers enables the ligation of monomers and dimers to form trimers and tetramers, which lead to a non-zero yield. Given the low VCG concentration, dimers are more abundant than VCG oligomers and thus more likely to act as templates. Therefore, educt oligomers can only hybridize to the template with a single nucleotide-long hybridization region. Consequently, many products are incompatible with the circular genome, causing low efficiency.
Replication efficiency reaches its maximum at intermediate VCG concentrations, where replication is dominated by F+V ligations. Notably, the maximal attainable efficiency is significantly lower for ensembles with dimers than without, as dimers increase the number and the stability of complex configurations that can form incorrect products (Fig. 5C). For , ligation products are only incorrect if the overlap between the VCG educt oligomer and template is shorter than the unique subsequence length, LS. For , however, dangling end dimers can cause incorrect products even in case of long overlap of educt oligomer and template (see right column in Fig. 5C). The stability of the latter complexes depends on the length of the VCG oligomers, LV, whereas the stability of complexes facilitating incorrect monomer addition is independent of oligomer length (Fig. 5C).
In the presence of dimers, the length-dependent stability of complexes facilitating correct and incorrect F+V ligations causes a competition, which sets an upper bound on the efficiency of replication (Supplementary Material Section S-VIII),
Here, we introduced effective association constants 𝒦a, which depend differently on the VCG oligomer length, LV. While the effective association constant of complexes enabling incorrect 1+V ligations is length-independent, the effective association constant for incorrect 2+V ligations, , scales exponentially with LV,
The effective association constants for correct 1+V and 2+V ligations also scale exponentially with the oligomer length (Supplementary Material Fig. S12).
In systems without dimers, i.e., κF → ∞, ηmax approaches 100%, which is consistent with the behavior observed in the previous sections. Conversely, in systems containing dimers, the maximal efficiency remains at a value below 100%, which depends on the concentration of dimers in the feedstock. Fig. 5C shows the dependence of maximal replication efficiency on the length of VCG oligomers in pools containing monomers, dimers and VCG oligomers. As LV increases, ηmax converges towards the upper bound defined in Eq. (6) (see dashed line in Fig. 5C).
Monomers preferentially ligate to short VCG oligomers in VCG-rich pools
Efficiency of replication is limited by the same mechanism regardless of the specifics of the underlying VCG pool: Oligomers that have an insufficiently long region of hybridization (shorter than LS) to their template are likely to produce an incorrect product upon ligation. In the previous sections, we minimized the contribution of such ligations by fine-tuning the concentration of VCG oligomers. Alternatively, incorrect templated ligations could be suppressed if only monomers are activated, while longer oligomers remain inactive. This scenario has already been investigated experimentally [18]. In a prebiotic setting, this scenario would emerge, for instance, if activated monomers are produced in the environment by an in situ activation chemistry, and subsequently transported into the compartment containing the VCG, while no activation takes place within the compartment [28, 29].
To explore the replication efficiency of this monomer-only activation scenario, we first consider a pool consisting only of activated monomers as well as non-activated dimers and VCG oligomers of a single length (Fig. 6A). We vary the concentration of VCG oligomers, but keep the feedstock concentrations constant. For small VCG concentrations, we observe low efficiencies (Fig. 6B), as the ligation of two monomers, or one monomer and one dimer, are most likely. Conversely, high facilitates the formation of complexes in which VCG oligomers are extended by monomers, which implies high efficiency (Fig. 6B). Note that, unlike in systems where all oligomers are activated, replication efficiency does not decrease for high VCG concentration, as erroneous V+V ligations are impossible. Instead, perfect replication efficiency (100%) is reached for sufficiently high LV.
While replication efficiency characterizes the relative amount of nucleotides used for the correct elongation of VCG oligomers, we can also analyze which fraction of VCG oligomers is in a monomer extension-competent state,
Here, denotes the equilibrium concentration of all complexes enabling the addition of a monomer to any VCG oligomer. We find that r1+V depends on the VCG concentration qualitatively in the same way as the efficiency: r1+V is small for small VCG concentration, butapproaches a value asymptotically for high VCGconcentration (Fig. 6C). In this limit, almost all VCG oligomers are part of a duplex. Monomers can bind to these duplexes to form complexes allowing for 1+V ligations. Since almost all VCG oligomers are already part of a duplex, increasing further does not increase the fraction of VCG oligomers that can be extended by monomers. Instead, the asymptotic value is determined by the concentration of monomers and their binding affinity Kd(1) to an existing duplex (Supplementary Material Section S-IX),
In contrast to the more complex case considered further below, here, the asymptotic value does not depend on the length of the VCG oligomers. However, the threshold VCG concentration at which scales exponentially with LV (Supplementary Material Section S-IX),
This scaling implies that longer oligomers require exponentially lower VCG concentration to achieve a given ratio r1+V (Fig. 6C), as their greater length allows them to form more stable complexes. This observation implies that pools with longer oligomers will always be more productive than pools with shorter oligomers (at equal VCG concentration).
The behavior becomes more complex in pools containing VCG oligomers of multiple lengths, due to the competitive binding within such heterogeneous pools. To illustrate this, we examine an ensemble containing VCG oligomers ranging from to along with the same feedstock as previously (activated monomers and non-activated dimers, see Fig. 7A). For simplicity, we assume that the length-distribution of VCG oligomers is uniform. We study the fraction of oligomers in a monomer extension-competent state as a function of oligomer length,
where denotes the equilibrium concentration of all complexes that allow monomer-extension of VCG oligomers of length L. At low VCG concentration, longer oligomers are more likely to be extended by monomers than shorter ones (Fig. 7B). This behavior is intuitive, as longer oligomers tend to form more stable complexes, which lead to higher productivity. Surprisingly, increasing the VCG concentration reverses the length-dependence of the productivity, such that short oligomers are more likely to be extended by monomers than long ones (note the three crossings of the curves in Fig. 7B). For example, 8-mers are more likely to undergo primer-extension than 9-mers once the VCG concentration exceeds correct value ≈ 0.5 μM. We derived a semi-analytical expression for the threshold VCG concentrations at which oligomers of two different lengths have equal productivity (see dashed lines in Fig. 7B, Supplementary Material Sections S-X and S-XI).
To understand the mechanism underlying the inversion of productivity, we analyze how different complex types contribute to r1+V(L). We introduce
which denotes the fraction of oligomers of length LE that are in a monomer extension-competent complex configuration that uses an template. The term oligomer of length LT as includes the sum over all possible configurations of complexes with the given lengths. Focusing on 8-mers and 9-mers as an example, we observe that complexes utilizing the 9-mer as template cause the inversion of productivity (top two curves in Fig. 7C): As the VCG concentration increases, the fraction of monomer-extendable 8-mers eventually surpasses the fraction of monomer-extendable 9-mers, i.e., (Fig. 7C). Two factors give rise to this feature: (i) Triplexes of type and have similar stability (Fig. 7E), and (ii) the equilibrium concentration of free 8-mers is higher than that of 9-mers (Fig. 7D). As a result, 8-mers are more likely than 9-mers to hybridize to an existing duplex , and, given the stability of the complex , 8-mers remain bound almost as stably as 9-mers. In summary, short oligomers sequester long oligomers as templates to enhance their monomer-extension rate, while long oligomers cannot make use of short oligomers as templates due to the relative instability of the corresponding complexes.
It is noteworthy that Ding et al. have already observed the inversion of productivity experimentally [18]. In their study, they included activated monomers, activated imidazolium-bridged dinucleotides and oligomers up to a length of 12 nt, and observed that the primer extension rate for short primers is higher than the extension rate of long primers. Even though our model differs from their setup in some aspects (e.g., different circular genome, no bridged dinucleotides) evaluating our model using parameters similar to those of the experimentally studied system predicts inversion of productivity that qualitatively agrees with the experimental findings (Supplementary Material Section S-XII). We therefore assume that the mechanism underlying inversion of productivity described here also applies to the experimental observations.
Discussion and conclusion
While significant progress has been made in understanding the prebiotic formation of (ribo)nucleotides [30–33] and characterizing ribozymes that might play a role in an RNA world [34–37], a convincing scenario bridging the gap between prebiotic chemistry and ribozyme-catalyzed replication is still missing. Here, we studied a scenario proposed by Zhou et al. [17] (the ‘Virtual Circular Genome’, VCG) using theoretical and computational approaches. We analyzed the process whereby template-directed ligation replicates the genomic information that is collectively stored in the VCG oligomers. Our analysis revealed a trade-off between the fidelity and the yield of this process: At low concentration of VCG oligomers, most of the ligations produce oligomers that are too short to specify a unique locus on the genome, resulting in a low yield of replication (Fig. 2B-C). At high VCG concentration, erroneous templated ligations limit the fidelity of replication (Fig. 2C-D). We considered two solutions to this trade-off: (i) a VCG pool composition that optimizes its replication behavior within the bounds of the fidelity-yield trade-off, and (ii) breaking the fidelity-yield trade-off by assuming that only monomers are activated.
The first solution maximizes the yield of replication for fixed fidelity. In pools containing only monomers and VCG oligomers of a single length, replication efficiency can be maximized by increasing the length of VCG oligomers and decreasing their concentration (Fig. 2F-G). This reduces the likelihood of error-prone templated ligation of long oligomers. When the pool contains VCG oligomers of multiple lengths, replication efficiency is typically governed by the longest oligomer in the pool (Fig. 4D). Including dimers as feedstock for the replication increases the error fraction (Fig. 5B-C), as dimers that bind to a template with a dangling end are prone to form an incorrect product (Fig. 5D).
The second solution eliminates the error-prone templated ligation of two VCG oligomers by assuming that only monomers are activated. This enables both fidelity and yield to remain high even at high VCG concentrations (Fig. 6B), effectively breaking the fidelity-yield trade-off. Unsurprisingly, longer VCG oligomers are then more likely to be extended than shorter oligomers at equal concentration, (Fig. 6C). However, this is only true for pools with VCG oligomers of a single length — once multiple VCG oligomer lengths compete with each other, shorter oligomers can be more productive than longer ones (Fig. 7B). This feature, which has also been observed experimentally [18], is caused by an asymmetry in the productive interaction between short and long oligomers (Fig. 7C): While short oligomers can sequester longer oligomers as templates for their extension by a monomer, short oligomers are unlikely to serve as templates for longer oligomers (Fig. 7D-E).
If the VCG scenario is to close the gap between prebiotic chemistry and ribozyme-catalyzed replication, VCG pools need to be capable of replicating (parts of) ribozymes that play a role in the emergence of life. While there are cases of small ribozymes [36] or ribozymes with small active sites, such as the Hammerhead ribozyme [38], ribozymes obtained experimentally via in vitro evolution are typically more than a hundred nucleotides long [4, 39–41]. In principle, long genomes (e.g., 1000 nt long) could be replicated in VCG pools containing relatively short oligomers (e.g., 10 nt long) with high efficiency (Fig. 2H). However, this requires a very special sequence composition of the genome. In our analysis, we assumed a clear separation between feedstock and VCG oligomers, defined by a single length scale LS, where all possible sequences of length L < LS are part of the feedstock and occur in the genome sequence, while any sequence of length L ≥ LS specifies a unique locus on the virtual genome. Such a sequence composition is particularly beneficial for the virtual circular genome scenario. However, the sequence composition of a generic genome sequence is characterized by two distinct length scales LS1 and LS2: (i) The former is defined as the largest length LS1 such that all possible subsequences of length L < LS1 are present in the genome, while (ii) LS2 is the smallest length such that all subsequences of the genome with length L ≥ LS2 are unique, i.e., they appear only once in the genome. These two length scales coincide for the specific sequence composition assumed by our analysis, but LS2 could be significantly longer than LS1 in prebiotically interesting genomes containing ribozyme sequences. Note that LS1 will generally scale logarithmically with the genome length, LS1 ∼ ln LG, while LS2 is constrained by the functional requirements of the biomolecules encoded by the genome.
For efficient replication of VCG pools, the second length scale, LS2, of the genome composition will determine the required length LV of the VCG oligomers. Our analysis showed that replication is efficient, if the VCG oligomers are significantly (more than two-fold) longer than LS, which is equal to LS2 for our idealized genome composition. We expect the interplay of the competing template-directed ligation processes in VCG pools in our idealized scenario to be transferable to the general case, such that genomes with arbitrary LS2 can also be efficiently replicated, if the oligomers in the pool are more than two-fold longer than LS2. These sufficiently long oligomers enable the formation of productive complexes, in which the hybridization regions are typically longer than LS2, leading to mostly correct products. On the contrary, if the characteristic length of oligomers in the VCG pool is too short, sequence scrambling leads to a loss of the information of the VCG [19].
Based on this reasoning, it is interesting to note that the virtual circular genome scenario creates a selection pressure for prebiotic genomes to reduce their LS2: A VCG pool encoding a genome containing multiple copies of certain long sequence motifs would replicate less efficiently than a pool for which a shorter motif length already specifies a unique locus on the genome. This comparison assumes all other properties of the two systems being the same, in particular the length distribution of the oligomers in the two pools. As our analysis shows, this length distribution is the key factor determining the replication efficiency for a given genome sequence. The length distribution, in turn, is determined by an interplay between the chemical kinetics of the system and the molecular transport controlled by its physical environment: For example, templated ligation in an open system with continuous in- and outflux of oligomers can lead to a non-monotoneous length distribution with a concentration peak at a characteristic oligomer length, Lc, which is set by the competition between dehybridization and outflux [20]. Via this emergent length-scale, the environment exerts a selection pressure on replication in the VCG scenario: While a VCG pool replicates its genome with high efficiency if the environment preferentially forms long oligomers, Lc > LS2, replication in environments with small characteristic oligomer length is expected to eliminate repeating sequence motifs that are longer than Lc. Mutational errors could replace long repeated sequence motifs by alternative sequences with shorter unique motifs, which may be functionally equivalent.
Given the broad range of prebiotically plausible non-equilibrium environments [42], it is reasonable to expect that some environments provide the required conditions for efficient replication. The constraints formulated in this work can help to guide the search for self-replicating VCG pools, in the vast space of possible concentration profiles and non-equilibrium environments.
Acknowledgements
We thank T. Göppel, M. Battipede, J. C. Espinoza Campos, J. Harth-Kitzerow for stimulating discussions. This work was supported by the Deutsche Forschungs-gemeinschaft (DFG, German Research Foundation) via the CRC/TRR 392 Molecular Evolution (Project-ID 521256690), and under Germany’s Excellence Strategy (EXC-2094-390783311, ORIGINS).
Supplementary Material
S-I. Computing replication observables based on the kinetic simulation
We simulate the dynamics of VCG pools using a kinetic simulation that is based on the Gillespie algorithm. In the simulation, oligomers can hybridize to each other to form complexes, or dehybridize from an existing complex. Moreover, two oligomers can undergo templated ligation if they are hybridized adjacent to each other on a third oligomer. At each time t, the state of the system is determined by a list of all single-stranded oligomers and complexes as well as their respective copy number. We refer to the state of the system at the time t as the ensemble of compounds ℰt. Given the copy numbers, the rates ri of all possible chemical reactions i ∈ ℐ can be computed. To evolve the system in time, we need to perform two steps: (i) We sample the waiting time until the next reaction, τ, from an exponential distribution with mean , and update the simulation time, t → t + τ. (ii) We pick which reaction to perform by sampling from a categorical distribution. Here, the probability to pick reaction i equals .The copy numbers are updated according to the sampled reaction, yielding ℰt+τ. Steps (i) and (ii) are repeated until the simulation time t reaches the desired final time, tfinal. A more detailed explanation of the kinetic simulation is presented in [20, 21].
Our goal is to compute observables characterizing replication in the VCG scenario based on the full kinetic simulation. For clarity, we focus on one particular observable (yield) for the derivation. The results for other observables are stated directly, as their derivations follow analogously. Recall the definition of the yield introduced in the main text,
As we are interested in the initial replication performance of the VCG, we compute the yield based on the ligation events that take place until the characteristic timescale of ligations t0. We would like to compute the yield based on the templated ligation events that we observe in the simulation. Unfortunately, for reasonable system parameters, it is impossible to simulate the system long enough to observe sufficiently many ligation events to compute y to reasonable accuracy. For example, for a VCG pool containing monomers at a total concentration of mM and VCG oligomers of length L = 8 at a total concentration of , it would take about 1700 hours of simulation time to reach t = 5·1012 t0. Multiple such runs would be needed to estimate the mean and the variance of the observables of interest, rendering this approach unfeasible.
Instead, we compute the replication observables based on the copy number of complexes that could potentially perform a templated ligation, i.e., complexes, in which two strands are hybridized adjacent to each other, such that they could form a covalent bond. It can be shown analytically that the number of potentially productive complexes is a good approximation for the number of incorporated nucleotides. The number of incorporated nucleotides can be computed as the integral over the ligation flux, weighted by the number of nucleotides that are added in each templated ligation reaction,
Here, N (C) denotes the copy number of the complex C in the pool ℰt. Le,1 and Le,2 denote the lengths of the oligomers that undergo ligation, and 𝟙 is an indicator function which enforces that only complexes in a configuration that allows for templated ligation contribute to the reaction flux. As only few ligation events are expected to happen until τlig, it is reasonable to assume that the ensembles ℰt do not change significantly during t ∈ [0, τlig]. Therefore, the integration over time may be interpreted as a multiplication by τlig,
where ⟨…⟩ denotes the average over realizations of the ensembles ℰt within the time interval t ∈[τeq, τlig]. Note that, at this point, we made the additional assumption that no templated ligations are taking place between [0, τeq]. This assumption is reasonable, as (i) the equilibration process is very short compared to the characteristic timescale of ligation, and (ii) the number of complexes that might allow for templated ligation during equilibration is lower than in the equilibrium (we start the simulation with an ensemble of single-stranded oligomers), implying that the rate of templated ligation is small.
In order to compute the average over different realizations of ensembles ℰ, we need to sample a set of uncorrelated ensembles that have reached the hybridization equilibrium. For this purpose, we run a full kinetic simulation. The simulation starts with a pool containing only single-stranded oligomers, and reaches the (de)hybridization equilibrium after a time τeq. We identify this timescale of equilibration by fitting an exponential function to the total hybridization energy of all complexes in the system, ΔGtot (Fig. S1A). In the set of ensembles used to evaluate the average in Eq. (S1), we only include ensembles for time t > τeq to ensure that the ensembles have reached (de)hybridization equilibrium. To ensure that the ensembles are uncorrelated, we require that the time between two ensembles that contribute to the average is at least τcorr. The correlation time, τcorr, is determined via an exponential fit to the autocorrelation function of ΔGtot (Fig. S1B). Besides computing the expectation value (Eq. S1), we are also interested in the “uncertainty” of this expectation value, i.e., in the standard deviation of the sample mean σ⟨X⟩. (We use X as a short-hand notation for ∑C∈ℰ N (C) min(Le,1, Le,2) 𝟙(C allows for templated ligation).) The standard deviation of the sample mean, σ⟨X⟩, is related to the standard deviation of X, σX, by the number of samples, σ⟨X⟩ = (Ns)−1/2σX. Moreover, based on the van-Kampen system size expansion, we expect the standard deviation of X to be proportional to V −1/2. Thus, σ⟨X⟩ ∝ (NsV)−1/2.
Using Eq. (S1) (as well as an analogous expression for the number of nucleotides that are incorporated in VCG oligomers), the yield can be expressed as
The additional condition 𝟙(Le,1 + Le,2≥LS) in the numerator ensures that the product oligomer is long enough to be counted as a VCG oligomer, i.e., at least LS nucleotides long. Analogously, the expression for the fidelity of replication reads
Multiplying fidelity and yield results in the efficiency of replication,
The ligation share of a particular type of templated ligation s(type), that is, the relative contribution of this templated-ligation type to the nucleotide extension flux, can be represented in a similar form as the other observables,
As all observables are expressed as the ratio of two expectation values, Ƶ= ⟨X⟩ / ⟨Y⟩, we can compute the uncertainty of the observables via Gaussian error propagation,
Since the variances, and , as well as the covariance, , are proportional to (NsV)−1, the standard deviation of the observable mean, σZ, scales with the inverse square root of the number of samples and the system volume, i.e., σZ ∝(NsV)−1/2. Therefore, the variance of the computed observable can be reduced by either increasing the system volume or increasing the number of samples used for averaging. Both approaches incur the same computational cost: (i) Increasing the number of samples, Ns, requires running the simulation for a longer duration, with the additional runtime scaling linearly with the number of samples. (ii) Similarly, the additional runtime needed due to increased system volume, V, also scales linearly with V (Fig. S2). One update step in the simulation always takes roughly the same amount of runtime, but the change in simulation time per update step depends on the total rate of all reactions in the system. The total rate is dominated by the association reactions, and their rate is proportional to the volume. Therefore, the change in simulation time per update step is proportional to V −1. The runtime, which is necessary to reach the same simulation time in a system with volume V as in a system with volume 1, is a factor of V longer in the larger system. With this in mind, it makes no difference whether the variance is reduced by increasing the volume or the number of samples. For practical reasons (post-processing of the simulations is then less memory- and time-consuming), we opt to choose moderate number of samples, but slightly higher system volumes to compute the observables of interest. The simulation parameters (length of oligomers, concentrations, hybridization energy, volume, number of samples, characteristic timescales) used to obtain the results presented in the main text (Main Text Fig. 2) are summarized in Tab. SI.
S-II. Coarse-grained representation of complexes in the adiabatic approach
It is computationally expensive to evaluate the replication observables via the full kinetic simulation. For this reason, we develop an adiabatic approach, which allows us to compute the replication observables, provided that templated ligation is far slower than (de)hybridization. The adiabatic approach relies on a coarse-grained representation of the oligomers in the pool, which is introduced in this section.
Single Strands
In the coarse-grained description, oligomers of identical length are assumed to have equal concentration, irrespective of their sequence. This assumption is justified for two reasons: (i) We initialize the VCG pool without sequence bias, i.e., all oligomers compatible with the genome sequence are included at equal concentration. (ii) Hybridization energy in our simplified energy model (and therefore also the stability of complexes) only depends on the length of the hybridization site, not on its sequence, provided there is no mismatch. Each coarse-grained oligomer is uniquely identified by its length L, and it represents a group of oligomers with ℭ (L) distinct sequences. We refer to the number ℭ (L) as the combinatorial multiplicity of the coarse-grained oligomer. The value of ℭ (L) depends on the choice of the encoded genome. By construction (see main text), we assume that all possible oligomer sequences of length L < LS are included in the genome. For L ≥ LS, only a subset of all possible 4L sequences is included, but no sequence is repeated multiple times across the genome. Therefore, the combinatorial multiplicity equals
Two strands can form a duplex by hybridizing to each other. We refer to the bottom oligomer as ‘Crick” strand C and to the top oligomer as “Watson” strand W. A duplex is uniquely characterized by the lengths of the oligomers, LC and LW, as well as their relative alignment (Fig. S3A). The alignment index i denotes the position of the Watson strand with respect to the Crick strand. As there needs to be at least one nucleotide of overlap between the strands for a duplex to exist, the alignment index needs lie in the interval i ∈ [ − (LW −1), LC − 1]. Using the alignment index, we can also determine if the duplex has a left (or right) dangling end. The corresponding indicator variables are called dl (or dr),
Moreover, the length of the hybridization region Lo can be computed via
The hybridization energy of the duplex depends on the length of the hybridization region as well as on the existence/absence of dangling ends. For a hybridization site of length Lo, there are Lo − 1 nearest-neighbor energy blocks each of which contributes γ to the energy. Moreover, each dangling end contributes to the energy,
To compute the combinatorial multiplicity for a duplex with fixed LC, LW and alignment index i, we need to multiply the combinatorial multiplicity of the Crick strand by the number of possible hybridization partners. We assume that a hybridization partner is possible if its sequence is perfectly complementary to the lower strand within the hybridization region, whereas hybridization partners with mismatches are not accounted for. This is sensible as long as the energetic penalty for mismatches in the full kinetic simulation is sufficiently large to suppress mismatches. The number of possible hybridization partners is determined by the length of the overlap region Lo: If Lo ≥ LS, the pool contains only one oligomer sequence that can act as hybridization partner by construction of the genome. For shorter hybridization regions, multiple hybridization partners might be possible. Their number is set by the combinatorial multiplicity of the Watson oligomer divided by the combinatorial multiplicity of the hybridization region,
To avoid double-counting, we only account for complexes in which the Crick strand is at least as long as the Watson strand, LC ≥ LW, and multiply ℭ (LW, LC, i) by 1/2 if LW = LC.
Triplexes
Triplexes, i.e., complexes comprised of three strands, are uniquely characterized by the length of the three oligomers, LC, LW,1, LW,2, as well as their respective alignment (Fig. S3B). The alignment index i denotes the position of strand W1 relative to strand C. Analogously, j denotes the relative position of W2 relative to oligomer C. Two strands that are hybridized to each other need to have a hybridization region of at least one nucleotide. Moreover, the strands W1 and W2 must not occupy the same position on the template strand C. Taking both requirements together, the alignment indices fall within the intervals,
A triplex may have a dangling end not only on its left or right end, but also in the gap between strands W1 and W2. Three boolean variables are necessary to denote the presence/absence of the respective dangling ends,
The length of the two hybridization regions are given by
The hybridization energy depends on the length of the overlap regions as well as on the existence of dangling ends: As in the duplex, each overlap region of length Lo,1 (or Lo,2) comprises Lo,1 −1 (or Lo,2 −1) nearest neighbor blocks, each of which contributes γ to the total energy. Moreover, every dangling end contributes γ/2. Note that the presence of a gap between strands W1 and W2, i.e., dm = 1, implies that there are two dangling ends, one for W1 and another for W2. Gaps in between two complexes contribute γ/2 per each dangling end, adding up to γ. If there is no gap between the strands, i.e., dm = 0, there are no dangling end contributions, but a new full nearest neighbor block emerges, which contributes γ to the energy. Therefore, the total energy reads,
The combinatorial multiplicity of a triplex is computed in the same way as for the duplex: The combinatorial multiplicity of the strand C is multiplied by the number of possible hybridization partners W1 and W2. Again, the number of possible partners is set by the length of the hybridization regions,
We use 𝟙 to denote the indicator function which returns 1 in case the condition in the bracket is fulfilled and zero otherwise. As all triplexes are asymmetric, there is no need to introduce a symmetry correction factor.
Tetraplexes
The largest complexes to be accounted for in our coarse-grained adiabatic approach are tetraplexes, i.e., complexes comprised of four strands. We need to distinguish three types of such complexes: (i) 3-1 tetraplexes, (ii) left-tilted 2-2 tetraplexes and (iii) right-tilted 2-2 tetraplexes. In 3-1 tetraplexes, three Watson strands are hybridized to one Crick strand (Fig. S4), whereas in 2-2 tetraplexes, two Watson strands are hybridized to two Crick strands (Fig. S5 and Fig. S6).
3-1 Tetraplexes
Fig. S4 depicts a typical 3-1 tetraplex. Such a tetraplex is uniquely characterized by the length of its oligomers, LC, LW,1, LW,2, LW,3, as well as their relative position to each other denoted by the alignment indices i, j, and k. All positions within the triplex are measured relative to the left end of the C strand. Any W strand needs to have at least one nucleotide of overlap with the C strand, but two W strands must never occupy the same position on the C strand. Consequently, the alignment indices fall within the intervals,
There are two dangling ends (left and right) and potentially two gaps between the W strands: one gap between W1 and W2 and another one between W2 and W3. The following boolean variables indicate the presence/absence of the respective dangling ends,
The length of the hybridization regions is given by
Following the same reasoning as in the case of triplexes, the energy equals
Similarly, the combinatorial multiplicity of 3-1 tetraplexes is constructed using the same reasoning as in the case of triplexes,
As 3-1 tetraplexes are not symmetric under rotation, no symmetry correction of the combinatorial multiplicity is necessary.
Left-Tilted 2-2 Tetraplexes
A 2-2 tetraplex is comprised of two C strands and two W strands. We call a 2-2 tetraplex left-tilted if strand W1 is connected to strand W2 via strand C1 (Fig. S5A). The lengths of the oligomers are called LW,1, LW,2, LC,1 and LC,2. The positions of the strands relative to each other are governed by the alignment indices. All positions are measured relative to the position of the left end of strand C1. The alignment indices may take on the following values,
The complex can have dangling ends on the right and on the left end of the complex; the presence of these dangling ends is indicated by the boolean variables dl and dr. Moreover, two gaps are possible: There might be a gap between strands W1 and W2, or a gap between C1 and C2. The respective boolean variables read
We refer to the hybridization region of strand W1 and C1 as overlap region 1, to the hybridization region of strand W2 and C1 as overlap region 2 and to the hybridization region of strand W2 and C2 as overlap region 3. The length of these hybridization regions is computed via
Given the length of the hybridization region as well as the presence/absence of dangling ends, we can compute the hybridization energy,
The combinatorial multiplicity of a left-tilted 2-2 tetraplex is constructed using the same reasoning as in the case of a 3-1 tetraplex,
To prevent double-counting the same tetraplex, we include either the complex or its rotated representation in the container of possible complexes, but not both. If the complex is symmetric under rotation, we multiply the combinatorial multiplicity by 1/2. Given a left-tilted 2-2 tetraplex (LC,1, LC,2, LW,1, LW,2, i, j, k), we can compute the corresponding rotated tetraplex by applying a linear map. The mapping of the oligomers lengths is illustrated in Fig. S5B. We see that strand C2 after rotation corresponds to strand W1 before rotation, implying that . The same reasoning can be applied to all strands leading to the following map,
In order to compute the map of the alignment indices under rotation, we need to express the relative positions of the strands with respect to the position of strand C1 after rotation, which corresponded to W2 before rotation. For example, |irot| corresponds to the number of nucleotides by which strand C2 (before rotation) protrudes beyond strand W2 (before rotation). Expressed in terms of variables before rotation, this distance may be written as k +LC,2 jLW,2. Analogous relations can be derived for all alignment indices,
Right-tilted 2-2 Tetraplexes
A 2-2 tetraplex is called right-tilted if strand W1 is connected to strand W2 via strand C2 (Fig. S6). As in the case of the left-tilted 2-2 tetraplex, the oligomer lengths are again called LW,1, LW,2, LC,1 and LC,2, but the values of the alignment indices that are possible for the right-tilted tetraplex differ from the ones of the left-tilted tetraplex,
Note that the range of i is chosen such that at least one nucleotides of strand W1 always extends to the right beyond the end of strand C1, allowing for a hybridization region between strand C2 and W1. The boolean variables denoting the presence/absence of dangling ends read
The length of the overlap regions is given by
Like in the case of left-tilted 2-2 tetraplexes, the total hybridization energy is computed via
and the combinatorial multiplicity via
We include either the tetraplex or its rotated representation in the list of possible complexes to avoid double-counting of tetraplexes. Moreover, the combinatorial multplicity is divided by 2, if the tetraplex is symmetric under rotation. It turns out that the rotation map for the right-tilted 2-2 tetraplex is identical to the one of the left-tilted 2-2 tetraplex,
S-III. Numerical solution of the (de)hybridization equilibrium in the adiabatic approach
Based on the list of complexes constructed in the previous section, we can compute the equilibrium concentration of strands and complexes reached in the (de)hybridization equilibrium. In the following, we denote the concentration of a coarse-grained oligomer with length L as c(L), and the concentration of an oligomer with length L and known sequence as cs(L). Recall that we assumed that all sequences of a given length that are compatible with the circular genome are equally likely in the pool (Section S-II). Thus, the concentration of the coarse-grained oligomer and the concentration of an oligomer with specified sequence are related by the combinatorial multiplicity,
In order to compute the concentration of a complex based on the concentration of single strands, we make use of the law of mass action. The concentration of a specific sequence realization of a complex is computed as the product of concentrations of the strands forming the complex divided by the dissociation constant Kd,
Here, is the vector denoting the lengths of the strands comprising the complex, and are the alignment indices. The dissociation constant is set by the hybridization energy, ∆G, of the complex,
where c° = 1 M is the standard concentration. Just as in the case of single strands, the concentration of the sequence-independent coarse-grained complex is related to the concentration of a complex with specific sequence realization via the combinatorial prefactor,
It can be useful to combine the combinatorial multiplicity and the dissociation constant into a single effective association constant,
Note that the effective association constant including the combinatorial multiplicity is denoted by 𝒦a (in curly font), while the association constant without combinatorial multiplicity is denoted (in regular font).
In the adiabatic approach, we study the behavior of the system on timescales that are long enough for the system to reach the (de)hybridization equilibrium, but too short for templated ligation events to take place. Therefore, the length of the oligomers are expected not to change throughout the equilibration process, and we need to introduce a separate mass conservation law for each coarse-grained oligomer, i.e., each oligomer length, that is included in the pool. For each length, the concentration of single-stranded coarse-grained oligomers of length L and the concentration of coarse-grained oligomers of length L bound in a complex need to add up to their total concentration ctot(L) set by the initial condition,
In this equation, s.t. denotes the summation over all complexes that contain at least one strand of length L. Note that we referred to the total concentration of oligomers of length L as c(L) in the main text, but for clarity we use ctot(L) in the supplementary material.
Combining the mass conservation requirement with the mass action law gives a set of coupled polynomial equations. The number of equations equals the number of distinct oligomer lengths included in the pool. The polynomial equations are of degree 4, as tetraplexes are the largest complexes to be accounted for and their concentrations equals the product of the four strands comprising the complex. We determine the equilibrium concentrations by finding the root of this set of fourth-degree polynomial equations using the Levenberg-Marquardt algorithm.
S-IV. Computing replication observables based on the adiabatic approach
As we are not modeling templated ligation events explicitely in the adiabatic approach, we compute replication observables based on the equilibrium concentration of complexes that are in a configuration which allows for a templated ligation reaction to happen. Templated ligations are possible if two strands in the complex are adjacent to each other, i.e., there is no gap in between two oligomers that are hybridized to the same template strand (Fig. S7). Recall that the absence of a gap between two oligomers in the complex implies that the dangling end indicator variable dm = 0 (Section S-II). The length of the product strand P is equal to the sum of the lengths of the two educt strands E1 and E2, LP = LE,1 + LE,2. We can use the information about the length of the product strand to compute the yield of replication. By definition (see main text), the yield equals the fraction of nucleotides used to form VCG oligomers, i.e., strands longer than LS,
We can express this quantity in terms of the equilibrium concentration of complexes facilitating templated ligation,
denotes the summation over all complexes, in which (i) the strands E1 and E2 are adjacent to each other, i.e., dm = 0, and (ii) the length of the product Le,1 + Le,2 ≥ LS. We multiply the equilibrium concentration by the length of the shorter educt strand to account for the number of incorporated nucleotides in line with the definition of the yield, Eq. (S2a).
In order to compute the fidelity of replication, we need to distinguish between product oligomers sequences that are compatible with the genomes (correct sequences) and sequences that are incompatible with the genome (false sequences). As we do not know about the details of the sequences due to the coarse-grained representation of the complexes, we need to invoke a combinatorial argument to determine the fraction of correct products. To this end, we compare the number of product sequences that might be produced in a complex of given oligomer lengths and relative oligomer position, to the number of correct products. The combinatorial multiplicity of the products that could be produced by a complex of given configuration is set by the combinatorial multiplicity of the possible templates, ℭ (Lo,1 + Lo,2), multiplied by the multiplicity of the educt strands hybridizing to the template with given lengths of the hybridization regions, Lo,1 and Lo,2,
The multiplicity of correct products equals the combinatorial multiplicity of strands that have the same length as the product,
This implies that the probability for a complex of given shape to form a correct product is given by
Using this probability, we can compute the fidelity of replication,
as well as the replication efficiency,
S-V. Scaling law of the equilibrium concentration ratio that maximizes replication efficiency
We consider a pool containing monomers and VCG oligomers of a single length. As the feedstock only contains monomers, the total monomer concentration and the total feedstock concentration are identical, . Similarly, the total VCG concentration equals the total concentration of oligomers of length . In such a system, the replication efficiency can be expressed as
where the concentrations denote the equilibrium concentration of all complexes facilitating the indicated type of templated ligation. Note that we do not include dimerizations, i.e., F+F ligations, in the numerator as they do not contribute to the yield. Moreover, the V+V ligations are multiplied by the length of the oligomers LV, as each V+V extends an existing oligomer by LV nucleotides.
Assuming that complexes comprised of three strands (triplexes) are the dominant type of complexes, we can express the equilibrium concentrations as product of the equilibrium concentrations of the free strands and the effective association constants,
Here, denotes the equilibrium concentration of a VCG oligomer with a specific sequence; the equilibrium concentration of all VCG oligomers for any sequence is computed by multiplying with the combinatorial multiplicity, . The same holds for . Eq. (S2) allows us to write the replication efficiency as,
where we introduced the ratio of equilibrium concentrations, . Maximizing Eq. (S3) with respect to rs yields,
The effective association constant are computed using the combinatorial rules outlined in Section S-III. We find that is similar to , deviating at most by 10%. Moreover,
and,
These observations allow us to obtain an significantly simplified approximate expression for ,
We used the identity in the last step. To derive the scaling of as a function of LV, we need to analyze how the effective association constants depend on LV.
scales linearly with LV. In F+F ligations, an VCG oligomer acts as template facilitating the ligation of two monomers. Independent of the length of the VCG oligomer, the two monomers always have the same binding affinity to their template. However, the number of possible positions that the two adjacent monomers can have on the template scales linearly in template length, causing the effective association constant to depend linearly on LV,
scales exponentially with LV. As illustrated in Fig. 2E in the main text, the length of the hybridization region equals L in complexes facilitating incorrect V+V ligations. This leads to exponential scaling of with LV, as the hybridization energy is proportional to the length of the hybridization site. The number of complexes facilitating incorrect product formation is only a function of the unique subsequence length LS, but does not depend on the oligomer length LV. Therefore, there is no multiplicative prefactor proportional LV,
Fig. S8 shows the LV-dependence of the effective association constants. The circles represent the effective association constants derived based on the combinatoric rules discussed in Section S-III, while the solid lines show the length-dependent scaling computed via Eq. (S4) and (S5).
Given the scaling of and , we find that the optimal ratio of equilibrium concentrations equals
We obtain ropt (Fig. 2) from by multiplying with the appropriate combinatorial multiplicity,
S-VI. Characteristic length-scale of VCG oligomers enabling high replication efficiency
Starting from Eq. S3, the optimal efficiency of replication can be expressed as
As (for sufficiently high LV), we can expand in this small argument,
To understand the scaling of ηopt as function of LV, we need to know the scaling of in LV. A typical complex allowing for a F+V ligation is depicted in Fig. 2E in the main text. The length of the overlap Lo region between template and educt VCG oligomer can vary; it is at least one nucleotide, and at most LV − 1 nucleotides (one base pairing position needs to remain available for the monomer). We obtain the effective association constant by summing the contributions of all of these configurations,
Therefore, we use the following scaling ansatz for ,
Combining the scaling laws for , and with Eq. S6, we find
S-VII. Enhanced replication efficiency in multi-length VCG pools
We investigate if a pool containing a range of VCG oligomer lengths exhibits higher replication efficiency than single-length pools. To this end, we study a VCG pool comprised of monomers (with fixed concentration), as well as tetramers and octamers (with variable concentration). For moderate binding affinity, γ = −2.5 kBT, the pool reaches optimal replication efficiency, if the tetramer concentration is smaller than the octamer concentration (Main Text Fig. 3), with extensions of octamers by monomers (1+8) being the most dominant type of templated ligation reactions. For strong binding affinity, γ = −5.0 kBT, templated ligations of tetramers on octamer templates eventually achieve the same replication efficiency as (1+8) ligations (Fig. S9). The mechanism relies on a significant difference in the length of the template (octamers) and educt strands (tetramers). If longer educt strands (heptamers) are chosen, highest efficiency is achieved in the parameter regime, in which VCG oligomers are extended by monomers, regardless of the choice of hybridization energy per matching nearest-neighbor block, γ (Fig. S10 and Fig. S11).
S-VIII. Dimer-related upper bound of replication efficiency
In a VCG pool comprised of monomers, dimers and VCG oligomers of single length, the replication efficiency can be expressed as
provided the pool operates in the concentration regime where all types of ligations other than F+V ligations are negligible. In this equation, stands for the equilibrium concentration of all complexes allowing for the tem-plated ligation of oligomers of length L to oligomers of length M if templated by oligomers of length denotes the same concentration under the additional constraint that the product oligomers need to be correct.
Complexes allowing for templated ligation need to involve at least three oligomers, but may be comprised of more strands. For the following analytical derivation, we restrict ourselves to triplexes, while the full numerical solution (see continuous lines in Main Text Fig. 5C) also includes tetraplexes. If only triplexes are accounted for, each equilibrium concentration may be written as a product of the effective association constant and the equilibrium concentrations of the involved oligomers,
Using this expression, the equation for the replication efficiency may be simplified,
The effective association constants in the denominator may be expressed as the sum of the association constants of correct and false products,
In the following, we simplify the notation by referring to as 1+V ligations, and to as 2+V ligations.
The effective association constants differ with respect to their dependence on LV: While and increase exponentially in LV, is length-independent (Fig. S12). Introducing the parametrizations,
we can express η as
Taking the limit LV → ∞ yields and upper bound for the replication efficiency,
Provided that the ratio of equilibrium concentrations of dimers and monomers is comparable to the ratio of total concentration of dimers and monomers, ceq(2)/ceq(1) ≈ ctot(2)/ctot(1), we can write
Therefore, the upper bound of the replication efficiency may be expressed as
For the genome that is analyzed in the main text, . Comparison of the analytical approach (involving only triplexes) to the numerical solution (involving tetraplexes) reveals that the contribution of tetraplexes is indeed negligible (Main Text Fig. 5)
S-IX. Asymptotic fraction of oligomers undergoing extension by monomers
In pools containing only activated monomers, as well as non-activated dimers and VCG oligomers of a single length, we observe that the fraction of oligomers that can be extended by a monomer reaches an asymptotic value for high VCG concentrations. Notably, this asymptotic value is independent of the oligomer length.
The fraction of oligomers that are in a monomer extension-competent state can be computed based on the equilibrium concentration of complexes facilitating the extension of an oligomer by a monomer,
Note that, in principle, the monomer can be added to the 5’- or the 3’-end of the oligomer, leading to two contributions in the numerator, which are identical due to symmetry, .
Just like in the previous section, denotes the equilibrium concentration of all complexes allowing for the templated ligation of oligomers of length L to oligomers of length M if templated by oligomers of length N. We assume that triplexes are the dominant type of complexes facilitating templated ligation, and write the equilibrium concentration as,
Therefore, the fraction of monomer-extendable oligomers reads,
In the limit of high VCG concentration, almost all VCG oligomers form duplexes. Hence, the equilibrium concentration of free oligomers can be approximated by,
implying,
As almost all monomers are free in solution, their equilibrium concentration may be approximated by the total concentration of monomers, .
From this representation of , it is not clear yet why should be independent of LV. In order to derive an LV-independent expression for , we need to express the association constant of triplexes, , in terms of the duplex association constant, . As illustrated in Fig. S13, the effective association constant of the triplex may be obtained by multiplying the effective association constant of the duplex by the binding affinity of the monomer. We need to distinguish the contributions of duplexes based on the length of their hybridization region: If the two strands in the duplex form a hybridization region that is as long as the oligomers, there is no free base pair for the monomer to hybridize; such duplexes must be excluded in the computation of the triplex association constant.
Similarly, monomers that hybridize to the end of a strand have a different binding affinity than monomers hybridizing to the center of an oligomer (Fig. S13A vs. Fig. S13B-C). All in all, the effective triplex association constant reads,
The notation implies that only complex configurations with a specific overlap length Lo are included. The full effective duplex association constant may be expressed as
We write the duplex association constants in terms of the binding energy due to their overlap length,
Therefore, the asymptotic ratio of monomer-extended oligomers reads,
Addition of a single base pair will decrease the energy by the energy of a half nearest neighbor block, i.e., by γ/2. Thus, we assign Kd(1) = exp (−|γ|/2), and find,
S-X. Analytical solution for equilibrium concentrations in multi-length VCG pools
We consider pools comprised of monomers, dimers and VCG oligomers of multiple lengths. For simplicity, we restrict ourselves to profiles with a uniform distribution of VCG oligomers, such that all VCG oligomers have the same concentration. Assuming that almost all oligomer mass is contained in single strands and duplexes, the mass conservation equations read
Note that there is a mass conservation equation for each oligomer length individually, i.e., we are dealing with a set of multiple coupled quadratic equations. We can make the system of equations dimensionless by introducing a dimensionless concentration ,
This rescaling is chosen as it is the solution to the (approximative) mass conservation equation in the high concentration limit,
under the assumption that . The latter assumption is reasonable given the concentration profile is uniform. We expect to be of order 1 in the high concentration limit.
Writing the full mass conservation equation in terms of the dimensionless concentration yields,
We drop all ratios of total concentrations, as we assume that the total concentration is the same for each oligomer length of VCG oligomer (uniform concentration profile),
By introducing the dimensionless prefactors,
we can rewrite the mass conservation equation,
To solve for , we make the assumption that the dimensionless equilibrium concentration is close to 1 for any length L. For this reason, we may assume that . This assumption is crucial, as it allows us to de-couple the quadratic equations. Therefore, the mass conservation reads,
We can use this representation to compute the equilibrium concentrations recursively: We start the recursion with the assumption , and compute with this assumption,
In the first recursion step, we compute , and solve the mass conservation equation for the equilibrium concentration,
This scheme can be applied until the approximated values of match the true values (obtained via numerical root finding) sufficiently well,
Note that the equilibrium concentration obtained after the (i + 1)-th iteration step depends on the total concentration ctot(L) via the rescaling prefactor as well as via αL. It turns out that the approximation converges after about five iteration steps; the relative error between the true (numerically computed) equilibrium concentration and the approximation is already below 1% in the first iteration step.
S-XI. Threshold concentration for the inversion of productivity
In VCG pools that contain VCG oligomers of multiple lengths as well as activated monomers, we observe “inversion of productivity”: The fraction of short VCG oligomers in a monomer extension-capable complex configuration exceeds that of long oligomers. However, this is only the case for sufficiently high concentration of VCG oligomers. The threshold concentration that is necessary for oligomers of length M to exceed the monomer-extension fraction of oligomers of length L (assuming M < L) is set by the condition r1+V(L) = r1+V(M). To compute r1+V(L), we need to account for all possible complexes, in which an oligomer L can be extended by a monomers,
As we are considering a uniform concentration profile, all VCG oligomers have the same total concentration, i.e.,ctot(L) = ctot(M). Thus, we can express the condition for the threshold concentration as
We combine this condition with the analytical approximation of the equilibrium concentration up to the first iteration step derived in Section S-X, cf. Eqs. S9-S11. As the equilibrium concentrations ceq(L), ceq(M), ceq(N), ceq(P) depend on the total concentration of VCG oligomers ctot(L) = ctot(M) = ctot(N) = ctot(P), this yields an semi-analytical criterion for the threshold concentration , at which the fraction of monomer-extended M -mers exceeds the fraction of monomer-extended L-mers.
S-XII. Inversion of productivity using parameters of system studied by Ding et al
In their experimental study, Ding et al. focus on a genome of length LG = 12 nt. In the VCG pool encoding the genome, Ding et al. include (Tab. S1 in [18])
dimers with 11 different sequences,
trimers with 20 different sequences, and
tetramers up to 12-mers with 24 different sequences each.
Every oligomer sequence is included with a concentration of 1 μM (so-called 1x profile), adding up to a total concentration of . The feedstock for the replication is comprised of activated imidazolium-bridged homo-dinucleotides with a total concentration of 20 mM. Over a timescale of around 4 h, these homo-dinucleotides react with each other to form 6 additional imidazolium-bridged hetero-dinucleotides as well as activated mononucleotides (Fig. S1 in [18]).
We construct a genome that mimics the properties of the genome used by Ding et al., but obeys our genome construction principles outlined in the method section of the main text. To this end, we consider a genome of length LG = 12 nt and a minimal unique subsequence length of LS = 3 nt. This implies that our VCG pool contains
dimers with 16 different sequences, and
trimers up to 12-mers with 24 different sequences.
Our model does not include imidazolium-bridged dinucleotides explicitely. Instead, we assume that the concentration of activated mononucleotides in our model equals the total concentration of activated homo-dinucleotides (20 mM) used in the experimental study. The total concentration of non-activated oligomers is treated as a free parameters.
We find that the system exhibits inversion of productivity: For the entire considered range of concentration of non-activated oligomers, 10-mers are more likely to be extended by a monomer than 12-mers (Fig. S15). Moreover, 8-mers are more productive than 12-mers (provided the concentration of non-activated oligomers exceeds roughly 0.3 μM), and more productive than 10-mers (provided ). Thus, for the experimentally used concentration of non-activated oligomers (, see vertical dashed line in Fig. S15), 8-mers are more productive than 10-mers, which are in turn more productive than 12-mers. Unlike in the experimental system, 6-mers are less likely than 8-mers and 10-mers to be extended by a monomer. We suppose that this difference can be attributed to the differences between the experimental setup and our theoretical model: (i) We choose a (slightly) different genome than Ding et al.. (ii) We model imidazolium-bridged dinucleotides as mononucleotides, as imidazolium-bridged dinucleotides only incorporate one mononucleotide at a time, just like activated mononucleotides. However, dinucleotides bind more stably to an existing complex than mononucleotides, which will affect the fraction of monomer-extended oligomers predicted.
References
- [1]RNA evolution and the origins of lifeNature 338
- [2]The origins of the RNA worldCold Spring Harb Perspect Biol 4
- [3]The RNA world: molecular cooperation at the origins of lifeNat. Rev. Genet 16
- [4]In-ice evolution of RNA polymerase ribozyme activityNat. Chem 5
- [5]The effect of leaving groups on binding and reactivity in enzyme-free copying of DNA and RNANucleic Acids Res 44:5504–5514
- [6]A highly reactive imidazolium-bridged dinucleotide intermediate in nonenzymatic rna primer extensionJ. Am. Chem. Soc 138:11996–12002
- [7]Nonenzymatic copying of rna templates containing all four letters is catalyzed by activated oligonucleotideseLife 5https://doi.org/10.7554/elife.17756
- [8]Enzyme-free replication with two or four basesAngew. Chem. Int. Ed. 57:8911–8915
- [9]Enzyme-free ligation of dimers and trimers to RNA primersNucleic Acids Res 47
- [10]Enzyme-free copying of 12 bases of RNA with dinucleotidesAngew. Chem. Int. Ed. 61
- [11]Prolinyl nucleotides drive enzyme-free genetic copying of RNAAngew. Chem. Int. Ed. 62
- [12]An optimal degree of physical and chemical heterogeneity for the origin of life?Philos. Trans. R. Soc. B 366:2894–2901
- [13]Templating efficiency of naked dnaProc. Natl. Acad. Sci. USA 107:12074–12079
- [14]The prebiotic evolutionary advantage of transferring genetic information from rna to dnaNucleic Acids Res 39:8135–8147
- [15]Effect of stalling after mismatches on the error catastrophe in nonenzymatic nucleic acid replicationJ. Am. Chem. Soc 132:5880–5885
- [16]A kinetic error filtering mechanism for enzymefree copying of nucleic acid sequencesbiorXiv https://doi.org/10.1101/2021.08.06.455386
- [17]The virtual circular genome model for primordial RNA replicationRNA 27
- [18]Experimental tests of the virtual circular genome model for nonenzymatic RNA replicationJ. Am. Chem. Soc 145
- [19]Computer simulations of template-directed RNA synthesis driven by temperature cycling in diverse sequence mixturesPLoS Comput. Biol 18
- [20]Self-assembly of informational polymers by templated ligationPhys. Rev. X 11
- [21]Thermodynamic and kinetic sequence selection in enzyme-free polymer self-assembly inside a nonequilibrium RNA reactorLife 12
- [22]Emergence of homochirality via template-directed ligation in an RNA reactorPRX Life 2
- [23]A kinetic model of nonen-zymatic RNA polymerization by cytidine-5’-phosphoro-2-aminoimidazolideBiochemistry 56
- [24]Kinetics of renaturation of DNAJ. Mol. Biol 31
- [25]Proton NMR study of the base-pairing reactions of d(GGAATTCC): salt effects on the equilibria and kinetics of strand associationBiochemistry 30
- [26]Theoretical analysis of the kinetics of DNA hybridization with gel-immobilized oligonucleotidesBiophys. J 71
- [27]Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structureProc. Natl. Acad. Sci. USA 101
- [28]Cyclophospholipids enable a protocellular life cycleACS Nano 17:23772–23783
- [29]Template-based information transfer in chemically fueled dynamic combinatorial librariesNat. Chem https://doi.org/10.1038/s41557-024-01570-5
- [30]Synthesis of activated pyrimidine ribonucleotides in prebiotically plausible conditionsNature 459:239–242
- [31]Synthesis of carbohydrates in mineral-guided prebiotic cyclesJ. Am. Chem. Soc 133
- [32]Asphaltwater, and the prebiotic synthesis of ribose, ribonucleo-sides, and RNA, Acc. Chem. Res 45
- [33]A high-yielding, strictly regioselective prebiotic purine nucleoside formation pathwayScience 352
- [34]Freeze–thaw cycles as drivers of complex ribozyme assemblyNature Chemistry 7:502–508
- [35]Ribozyme-catalysed rna synthesis using triplet building blockseLife 7
- [36]Mapping a systematic ribozyme fitness landscape reveals a frustrated evolutionary network for self-aminoacylating RNAJ. Am. Chem. Soc 141
- [37]An rna polymerase ribozyme that synthesizes its own ancestorProc. Natl. Acad. Sci. USA 117:2906–2913
- [38]The hammerhead ribozymeProg. Mol. Biol. Transl. Sci Elsevier :1–23
- [39]RNA-catalyzed RNA polymerization: Accurate and general RNA-templated primer extensionScience 292
- [40]Improved polymerase ribozyme efficiency on hydrophobic assembliesRNA 14
- [41]Ribozyme-catalyzed transcription of an active ribozymeScience American Association for the Advancement of Science
- [42]Physical non-equilibria for prebiotic nucleic acid chemistryNat. Rev. Phys 5:185–195
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2025, Ludwig Burger & Ulrich Gerland
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- view
- 1
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.