Toward stable replication of genomic information in pools of RNA molecules
eLife Assessment
This important theoretical study examines the possibility of encoding genomic information in a collective of short overlapping strands (e.g., the Virtual Circular Genome (VCG) model). The study presents convincing theoretical arguments, simulations and comparisons to experimental data to point at potential features and limitations of such distributed collective encoding of information. The work should be of relevance to colleagues interested in molecular information processing and to those interested in pre-Central Dogma or prebiotic models of self-replication.
https://doi.org/10.7554/eLife.104043.3.sa0Important: Findings that have theoretical or practical implications beyond a single subfield
- Landmark
- Fundamental
- Important
- Valuable
- Useful
Convincing: Appropriate and validated methodology in line with current state-of-the-art
- Exceptional
- Compelling
- Convincing
- Solid
- Incomplete
- Inadequate
During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Abstract
The transition from prebiotic chemistry to living systems requires the emergence of a scheme for enzyme-free genetic replication. Here, we analyze a recently proposed prebiotic replication scenario, the so-called Virtual Circular Genome (VCG) [Zhou et al., RNA 27, 1-11 (2021)]: Replication takes place in a pool of oligomers, where each oligomer contains a subsequence of a circular genome, such that the oligomers encode the full genome collectively. While the sequence of the circular genome may be reconstructed based on long oligomers, monomers and short oligomers merely act as replication feedstock. We observe a competition between the predominantly error-free ligation of a feedstock molecule to a long oligomer and the predominantly erroneous ligation of two long oligomers. Increasing the length of long oligomers and reducing their concentration decreases the fraction of erroneous ligations, enabling high-fidelity replication in the VCG. Alternatively, the formation of erroneous products can be suppressed if each ligation involves at least one monomer, while ligations between two long oligomers are effectively prevented. This kinetic discrimination (favoring monomer incorporation over oligomer–oligomer ligation) may be an intrinsic property of the activation chemistry, or can be externally imposed by selectively activating only monomers in the pool. Surprisingly, under these conditions, shorter oligomers are extended by monomers more quickly than long oligomers, a phenomenon that has already been observed experimentally [Ding et al., JACS 145, 7504-7515 (2023)]. Our work provides a theoretical explanation for this behavior and predicts its dependence on system parameters such as the concentration of long oligomers. Taken together, the VCG constitutes a promising scenario of prebiotic information replication: It could mitigate challenges in non-enzymatic copying via template-directed polymerization, such as short lengths of copied products and high error rates.
Introduction
In order to delineate possible pathways toward the emergence of life, it is necessary to understand how a chemical reaction network capable of storing and replicating genetic information might arise from prebiotic chemistry. RNA is commonly assumed to play a central role on this path, as it can store information in its sequence and catalyze its own replication (Higgs and Lehman, 2015; Joyce, 1989; Robertson and Joyce, 2012). While ribozymes capable of replicating strands of their own length have been demonstrated in the laboratory (Attwater et al., 2013), it remains elusive how enzyme-free self-replication might have worked before the emergence of such complex ribozymes.
One possible mechanism is template-directed primer extension (Ding et al., 2022; Kervio et al., 2016; Leveau et al., 2022; Walton and Szostak, 2016; Welsch et al., 2023). In this process, a primer hybridizes to a template and is extended by short oligonucleotides, thereby forming a (complementary) copy of the template strand. Considerable progress has been made in optimizing template-directed primer extension, but challenges remain: (i) The produced copies are likely to be incomplete. So far, at most 12 nt have been successfully added to an existing primer (Leveau et al., 2022). Moreover, as the pool of primer strands needs to emerge via random polymerization, the primer is likely to hybridize to the template at various positions, and not only to its 3′-end, leaving part of the 3′-end of the template uncopied (Szostak, 2011). (ii) Errors in enzyme-free copying are frequent due to the limited thermodynamic discrimination between correct Watson-Crick pairing and mismatches (Kervio et al., 2010; Leu et al., 2013; Leu et al., 2011). While some activation chemistries (relying on bridged dinucleotides) have been shown to exhibit improved fidelity (Duzdevich et al., 2021), the error probability still constrains the length of the genome that can be reliably replicated.
The issue of insufficient thermodynamic discrimination can, in principle, be mitigated by making use of kinetic stalling after the incorporation of a mismatch (Leu et al., 2013; Rajamani et al., 2010). By introducing a competition between the reduced polymerization rate and a characteristic timescale of the non-equilibrium environment, it is possible to filter correct sequences from incorrect ones (Göppel et al., 2021). To address the challenge of incomplete copies, Zhou et al. propose a new scenario of replication, the so-called Virtual Circular Genome (VCG) (Zhou et al., 2021). In this scenario, genetic information is stored in a pool of oligomers that are shorter than the circular genome they collectively encode: Each oligomer bears a subsequence of the circular genome, such that the collection of all oligomers encodes the full circular genome virtually. Within the pool, each oligomer can act as a template or primer (Zhou et al., 2021). The oligomers hybridize to each other and form complexes that allow for templated ligation of two oligomers, or for the extension of an oligomer by a monomer. Because the sequences of the ligated strands and the template are part of the genome, most of the products should also retain the sequence of the genome. That way, long oligomers encoding the circular genome can be produced at the expense of short oligomers (Zhou et al., 2021). The long strands, in turn, can assemble into catalytically active ribozymes. With a continuous influx of short oligomers, the VCG might allow for continuous replication of the virtually encoded circular genome. Importantly, replication in the VCG is expected to avoid the issue of incomplete copies. Since the genome is circular, it does not matter which part of the genome an oligomer encodes, as long as the sequence is compatible with the genome sequence. An additional feature of the VCG scenario is that replication should be achievable without the need of adding many nucleotides to a primer: Provided the concentration of oligomers decreases exponentially with their length, the concentration of each oligomer in the pool can be doubled by extending each oligomer only by a few nucleotides (Zhou et al., 2021). The extension of an oligomer by a few nucleotides in a VCG pool has already been demonstrated experimentally (Ding et al., 2023).
A recent computational study points out that the VCG scenario is prone to loss of genetic information via ‘sequence scrambling’ (Chamanian and Higgs, 2022). If the genome contains identical sequence motifs at multiple different loci, replication in the VCG will mix the sequences of these loci, thus destroying the initially defined genome. It is currently unclear which conditions could prevent this genome instability of VCGs, such that their genetic information is retained. Length distribution, sequence composition, oligonucleotide concentration, and environmental conditions, such as temperature, all affect the stability of complexes and thus the replication dynamics of the VCG pool. Here, we characterize the replication fidelity and yield of VCG pools using a kinetic model, which explicitly incorporates association and dissociation of RNA strands as well as templated ligation. We study a broad spectrum of prebiotically plausible and experimentally accessible oligomer pools, from pools containing only monomers and long oligomers of a single length to pools including a range of long oligomers with uniform or exponential concentration profile. The length of the included oligomers as well as their concentration is a free parameter of our model.
We find that, regardless of the pool composition, three competing types of template-directed ligation reactions emerge: (i) ligations between two short oligomers (or monomers), producing products too short to specify a unique genomic locus, (ii) ligations between a short and a long oligomer, typically generating longer products compatible with the genome sequence, and (iii) ligations between two long oligomers, which often yield sequences incompatible with the genome. These erroneous ligations of type (iii) are a key driver of sequence scrambling, as they covalently link oligomers originating from non-adjacent genomic loci, effectively ‘mixing’ distant regions of the genome. Fidelity is primarily determined by the competition between the correct extension of a long oligomer and the erroneous ligation of two long oligomers. The likelihood of the latter can be reduced by decreasing the relative abundance of long oligomers, even though this increases the frequency of unproductive ligations between short oligomers. As a result, fidelity can be improved at the cost of reduced yield. The efficiency, meaning the yield attainable at a fixed high fidelity, thus depends on the length distribution of the oligomers in the pool.
Alternatively, the issue of erroneous ligations is mitigated if ligations of long oligomers are kinetically suppressed, such that each ligation incorporates only one monomer at a time, as in the experimental study by Ding et al., 2023. In this case, the VCG concentration can be chosen arbitrarily large without compromising fidelity. Interestingly, our model predicts an unexpected feature: In the limit of high VCG concentrations, short oligomers are more likely to be extended than long oligomers, even though, intuitively, complexes containing longer oligomers are expected to be more stable and thus more productive. The same behavior was indeed observed experimentally (Ding et al., 2023). We provide an explanation for this feature and discuss its dependence on system parameters such as the length and the concentration of long oligomers in the pool.
Results
Modelling approach
In the VCG scenario, a circular genome is stored in a pool of oligomers, with each oligomer shorter than the genome it helps encode. Each oligomer bears a subsequence of the circular genome, such that, collectively, the oligomers encode the full genome (Figure 1A). As the spontaneous emergence of such VCG pools is expected to be rare (Chamanian and Higgs, 2022), our study focuses on the conditions under which an existing VCG pool can reliably replicate. We therefore begin with a known genome and an associated VCG pool, without addressing the question of origin. To set up our model of VCG dynamics, we specify (i) the circular genome used, (ii) the procedure by which the genome is mapped to a set of oligomers, and (iii) the chemical reactions governing the system’s evolution.
Model.
(A) In the Virtual Circular Genome (VCG) scenario, a circular genome (depicted in green) as well as its sequence complement are encoded in a pool of oligomers (depicted in blue and orange). Collectively, the pool of oligomers encodes the whole sequence of the circular genome. Depending on their length, two types of oligomers can be distinguished: Long VCG oligomers specify a unique locus on the genome, while feedstock molecules (monomers or short oligomers) are too short to do so. (B) The length distribution of oligomers included in the VCG pool is assumed to be exponential. The concentration of feedstock and VCG oligomers as well as their respective length scales of exponential decay and can be varied independently. The set of included oligomer lengths can be restricted via , and , . (C) The hybridization energy of complexes is computed using a simplified nearest-neighbor model: Each full block comprised of two base pairs (depicted in pink) contributes , while dangling end blocks (depicted in blue) contribute . (D) Oligomers form complexes via hybridization reactions, or dehybridize from an existing complex. The ratio of hybridization and dehybridization rate is governed by the hybridization energy (Equation 1). If two oligomers are adjacent to each other in a complex, they can undergo templated ligation. (E) Based on the length of the reacting oligomers, we distinguish three types of templated ligation: Ligation of two feedstock molecules (F+F), ligation of a feedstock molecule to a VCG oligomer (F+V) and ligation of two VCG oligomers (V+V).
Circular genomes
For a given genome length, , there are distinct circular genomes (the factor accounts for the freedom to select the starting position and to choose the Watson or the Crick strand as reference sequence). A key property of the genome is its unique motif length, , defined as the shortest length such that all possible subsequences of length appear at most once in the genome. In other words, all subsequences of length specify a unique locus on the genome. In addition, each circular genome has another length scale, corresponding to the maximal motif length, up to which all possible motifs are contained in the genome. We refer to this length as the exhaustive coverage length, . We typically analyze unbiased genomes in which all possible subsequences of length are contained at equal frequency.
Construction of VCG pools
To specify a VCG pool that encodes a genomic sequence, one must select which subsequences are included in the pool at which concentrations. We consider unbiased pools, where the concentration of subsequences, , depends only on their length, , that is all subsequences of a given length are included at equal concentration. We refer to the length-dependent concentration profile as the length distribution of the pool. Depending on their length, oligomers fall into two categories (Figure 1B): (i) short feedstock molecules (monomers and oligomers) and (ii) long VCG oligomers. Feedstock oligomers are oligomers that are shorter than the unique motif length . Since their sequence appears multiple times on the genome, they do not encode a specific position on the genome. Thus, they serve as feedstock for the elongation of VCG oligomers rather than as information storage. Conversely, VCG oligomers, which are longer than the unique motif length , have a unique locus on the circular genome. Collectively, the VCG oligomers enable the reconstruction of the full genome. The full-length distribution, , can be decomposed into the contributions of feedstock and VCG oligomers,
We assume that both and follow an exponential length distribution. In our model, the concentration of VCG oligomers can be varied independently of the concentration of feedstock, and the length scales for the exponential decay ( vs. ) may differ between feedstock and VCG oligomers. Additionally, we can restrict the set of oligomer lengths included in the pool by setting minimal and maximal lengths for feedstock and VCG oligomers individually,
For any other oligomer length, the concentrations equal zero. This parameterization includes uniform length distributions as a special case ( and ), and also allows for concentration profiles that are peaked. Peaked length distributions can emerge from the interplay of templated ligation, (de)hybridization, and outflux in open systems (Rosenberger et al., 2021). We define the total concentration of feedstock, , as well as the total concentration of VCG oligomers, . Their ratio will turn out to be an important determinant of the VCG dynamics.
(De)hybridization kinetics
Oligomers can hybridize to each other to form double-stranded complexes, or dehybridize from an existing complex. For simplicity, we do not include self-folding within a strand, which is a reasonable assumption for short oligomers. The stability of a complex is determined by its hybridization energy, with lower hybridization energy indicating greater stability. We use a simplified nearest-neighbor energy model to compute the hybridization energy (Göppel et al., 2022; Laurent et al., 2024; Rosenberger et al., 2021): The total energy equals the sum of the energy contributions of all nearest-neighbor blocks in a given complex (Figure 1C). The energy contribution associated with a block of two Watson-Crick base pairs (matches) is denoted , and dangling end blocks involving one Watson-Crick pair and one free base contribute . Nearest-neighbor blocks with mismatches increase the hybridization energy by per block, thus reducing the stability of the complex. The rate constants of hybridization and dehybridization are related via
where is the standard concentration, is the Boltzmann factor, and is the free energy of hybridization. The association rate constant is proportional to the encounter rate constant, . The encounter timescale serves as the elementary time unit of the kinetic model, with all reaction timescales measured relative to it.
Templated ligation
Two oligomers A and B that are hybridized adjacently to each other on a third oligomer C can produce a new oligomer A-B via templated ligation (Figure 1D). Depending on the length of A and B, we distinguish three types of ligation reactions (Figure 1E): (i) F+F ligations, in which two feedstock molecules ligate, (ii) F+V ligations, where a VCG oligomer is extended by a feedstock molecule, and (iii) V+V ligations involving two VCG oligomers. The formation of a covalent bond via templated ligation is not spontaneous but requires the presence of an activation reaction. Usually, these reactions add a leaving group to the 5′-end of the nucleotide, which is cleaved during bond formation (Kervio et al., 2016; Walton and Szostak, 2016). We assume that the concentration of activating agent is sufficiently high for the activation to be far quicker than the formation of the covalent bond, such that activation and covalent bond formation can be treated as a single effective reaction. When not otherwise stated, we assume that all possible templated ligation reactions occur with the same rate constant .
Observables
Templated ligation in the pool forms longer oligomers at the expense of shorter oligomers and monomers. While the product of an F+V ligation (or V+V ligation) is always a VCG oligomer, F+F ligations can produce feedstock or VCG oligomers. In both cases, a produced VCG oligomer can be correct (compatible with the genome) or incorrect (incompatible). We quantify these processes by measuring extension fluxes in units of nucleotides ligated to an existing strand (counting the length of the shorter ligated strand as the number of incorporated bases). In particular, we define the fidelity as the extension flux resulting in correct VCG oligomers relative to the flux resulting in any VCG oligomer,
In addition, we introduce the yield as the proportion of total extension flux that produces VCG oligomers,
Efficient replication of the VCG requires both high fidelity and high yield. Hence, we introduce the efficiency of replication as the product of fidelity and yield,
Moreover, we define the ligation share of a ligation type, which allows us to discern the contributions of different types of templated ligations (F+F, F+V, V+V) to fidelity, yield, and efficiency,
Replication efficiency reaches a maximum at intermediate concentrations of VCG oligomers
We begin our analysis of the dynamics of VCG pools with an exemplary genome of length ,
5 ′-AAAGAGGACACGGCAU-3′
3 ′-UUUCUCCUGUGCCGUA-5′.
This genome contains all possible monomers and dimers with equal frequency, ensuring that all motifs up to are represented. Identifying a unique address on this genome requires at least three nucleotides. Therefore, the unique motif length is , and VCG oligomers need to be at least 3 nt long. Further below, we also explore genomes of different lengths , as well as varying characteristic sequence length scales and (genome construction detailed in the Methods section).
Based on the genome, we construct the initial oligomer pool. As a first step, we focus on a simple scenario in which the pool contains only monomers (serving as feedstock) and VCG oligomers of a single, defined length. The VCG pools are evolved in time using a stochastic simulation based on the Gillespie algorithm (Göppel et al., 2022; Laurent et al., 2024; Rosenberger et al., 2021). Since the Gillespie algorithm operates on the level of counts of molecules instead of concentrations, we must assign a volume to each system (in the range to , see Methods). Besides the volume, we also need to choose the reaction rate constants appropriately: (i) The association time is the fundamental time unit in our kinetic model, and all other times are expressed relative to . Experimentally determined association rate constants are typically around (Ashwood et al., 2023; Braunlin and Bloomfield, 1991; Todisco et al., 2024a; Wetmur and Davidson, 1968). For the purpose of estimating absolute timescales, we therefore assume a constant association timescale of in the following. (ii) The timescale of dehybridization is computed via Equation 1 using the energy contribution for a matching nearest-neighbor block and in case of mismatches. The high energy penalty of nearest-neighbor blocks involving mismatches, , is chosen to suppress the formation of mismatches, while the value of roughly matches the average energy of all matching nearest-neighbor blocks given by the Turner energy model of RNA hybridization (Mathews et al., 2004). (iii) For templated ligation, we select a reaction rate constant of . This choice of is consistent with ligation rates measured in enzyme-free template-directed primer extension experiments, which range from around (Leveau et al., 2022; Sosson et al., 2019) to roughly (Walton and Szostak, 2017), depending on the underlying activation chemistry. This indicates that, for sufficiently short oligomers (up to about 11 nt long), hybridization and dehybridization occur much faster than ligation, so binding equilibrates before ligation takes place.
Based on the ligation events observed in the simulation, we compute the observables introduced above. Due to the small ligation rate constant, it is computationally unfeasible to simulate the time evolution for more than a few ligation time units. Consequently, ligation events are scarce, which leads to high variances in the computed observables. We mitigate this issue by calculating the observables based on the concentration of complexes that are in a productive configuration, even if they do not undergo templated ligation within the time window of the simulation (Methods).
The observable (yield) quantifies the fraction of this total flux directed to producing VCG oligomers. Figure 2B shows how the yield depends on the concentration of VCG oligomers, , at a fixed total monomer concentration, . Here, the data points (with error bars) represent the simulation results with the different colors corresponding to different choices of initial VCG oligomer length, . We observe that the yield increases monotonically with the concentration of VCG oligomers and approaches 100% for high . The concentration at which the pool reaches a yield of 50% depends on the oligomer length : Shorter VCG oligomers require higher concentrations for high yield. The concentration dependence of the yield can be rationalized by the types of templated ligations that are involved. For low VCG concentration, most templated ligation reactions are dimerizations (1+1) with the VCG oligomers merely acting as templates (Figure 2C). As dimers are shorter than the unique motif length , their formation does not contribute to the yield, which explains the low yield in the limit of small . Conversely, for high VCG concentration, most of the templated ligations are F+V or V+V ligations, which produce oligomers of length , implying high yield.
Replication performance of VCG pools containing VCG oligomers of a single length (single-length VCG pools).
(A) The pool contains a fixed concentration of monomers, , as well as VCG oligomers of a single length, , at variable concentration (the VCG oligomers cover all possible subsequences of the genome and its complement at equal concentration). (B) The yield increases as a function of , because dimerizations become increasingly unlikely for high VCG concentrations. (C) The ligation share of different ligation types depends on the total VCG concentration: In the low concentration limit, dimerization (F+F) dominates; for intermediate concentrations, F+V ligations reach their maximum, while, for high concentrations, a substantial fraction of reactions are V+V ligations. The panel depicts the behavior for . (D) Replication efficiency is limited by the small yield for small . In the limit of high , replication efficiency decreases due to the growing number of error-prone V+V ligations. Maximal replication efficiency is reached at intermediate VCG concentration. (E) V+V ligations are prone to the formation of incorrect products due to the short overlap between educt strand and template. In general, the probability of correct product formation, , depends on the choice of circular genome and as well as its mapping to the VCG pool. The probabilities listed here refer to a VCG pool with , and . (F) The optimal equilibrium concentration ratio of free VCG strands to free feedstock strands, which maximizes replication efficiency, decays as a function of length (continuous line). The analytical scaling law (dashed line, Equation 2) captures this behavior. The window of close-to-optimal replication, within which efficiency deviates no more than 1% from its optimum (shaded areas), increases with facilitating reliable replication without fine-tuning to match the optimal concentration ratio. (G) Maximal replication efficiency, which is attained at the optimal VCG concentration depicted in panel E, increases as a function of and approaches a plateau of 100%. For high efficiency, Equation 3 provides a good approximation of the length-dependence of (dashed lines). The oligomer length at which replication efficiency equals 95% is determined using Equation 3 (vertical dotted lines). (H) The unique motif length, increases logarithmically with the length of the genome, . The length of VCG oligomers, at which the optimal replication efficiency reaches 95% (computed using Equation 3) exhibits the same logarithmic dependence on .
Figure 2C also shows that the relative contribution of V+V ligations increases with increasing , with a large fraction of them producing incorrect products (denoted V+V,f). This reduces the fidelity of replication, , and leads to a trade-off between fidelity and yield, which causes replication efficiency, , to reach a maximum at intermediate VCG concentrations (Figure 2D). Using hexamers as an example, Figure 2E illustrates why V+V ligations are prone to forming incorrect oligomers: In order to ensure that two oligomers only ligate if their sequences are adjacent to each other in the true circular genome, both oligomers need to share an overlap of at least nucleotides with the template oligomer. Otherwise, the hybridization region is too short to identify the locus of the oligomer uniquely, and two oligomers from non-adjacent loci might ligate. The probability of forming incorrect products is a consequence of the combinatorics of possible subsequences in the VCG. Specifically, for a genome with , there are 32 different hexamers. For example, if the left educt hexamer only has one nucleotide of overlap with the template, there are possible educt oligomers. However, only one out of those eight hexamers is the correct partner for the right educt hexamer, implying an error probability of (first example in Figure 2E).
Characterizing replication efficiency via full simulations is computationally expensive. Depending on parameters, obtaining a single data point in Figure 2B-D can require hundreds of simulations, each lasting several days. To explore a broad parameter space more easily, we introduce an approximate adiabatic method that (i) assumes ligation is much slower than any hybridization or dehybridization event, and (ii) relies on a coarse-grained sequence-independent representation of oligomers. Details are provided in the Methods section. In brief, because ligation is rare, we first compute the equilibrium distribution of free and bound oligomers. Oligomers of the same length share a common concentration, and complex concentrations are determined via the mass action law using length-dependent dissociation constants. Combining the mass action law with a mass conservation constraint for each oligomer length allows us to compute the equilibrium concentrations of free VCG and feedstock strands, and , given total concentrations and . With these equilibrium values, we determine the concentrations of productive complexes and thus obtain the desired ligation‐based observables without running full stochastic simulations.
The results of the adiabatic approach agree well with the simulation data (Figure 2B-D), supporting that the replication efficiency depends non-monotonously on the concentration of VCG oligomers, with a maximum at intermediate concentration. While the available simulation data only allows for a qualitative characterization of this trend, the adiabatic approach enables a quantitative analysis. For instance, we use the adiabatic approach to determine the equilibrium concentration ratio at which replication efficiency is maximal as a function of the VCG oligomer length (solid lines in Figure 2F). As expected from the qualitative trend observed in the simulation, pools containing longer oligomers reach their maximum for lower concentration of VCG oligomers. The shaded area indicates the range of VCG concentrations within which the pool’s efficiency deviates by no more than one percent from its optimum. We observe that this range of close-to-optimal VCG concentrations increases with . Thus, pools containing longer oligomers require less fine-tuning of the VCG concentration for replication with high efficiency.
In addition to the numerical results, we utilize the adiabatic approach to study the optimal equilibrium VCG concentration analytically (Appendix 1). We find that, for any choice of , replication efficiency reaches its maximum when the fractions of dimerization (1+1) reactions and erroneous V+V ligations are equal (Figure 2C for ). This criterion can be used to derive a scaling law for the optimal equilibrium concentration ratio as a function of the oligomer length ,
which is shown as dashed lines in Figure 2F (the length-scale is defined in Appendix 1). The optimal equilibrium concentration ratio decreases exponentially with , while the hybridization energy sets the inverse length scale of the exponential decay. Analytical estimate and numerical solution agree well, as long as the hybridization is weak and oligomers are sufficiently short. For strong binding and long oligomers, complexes involving more than three strands play a non-negligible role, but such complexes are neglected in the analytical approximation.
Figure 2G shows how the maximal replication efficiency depends on the VCG oligomer length. Consistent with the qualitative trend observed in Figure 2D, longer oligomers enable higher maxima in replication efficiency. Regardless of the choice of , replication efficiency reaches 100% if is sufficiently high. Starting from Equation 2, we find the following approximation for the maximal replication efficiency attainable at a given oligomer length (dashed lines in Figure 2G),
where is a genome-dependent constant (Appendix 1). This equation can provide guidance for the construction of VCG pools with high replication efficiency: Given a target efficiency, the necessary oligomer length and hybridization energy, that is temperature, can be calculated. In Figure 2G, we determine the oligomer length necessary to achieve for varying hybridization energies . At higher temperatures (weaker binding), VCG pools require longer oligomers to replicate with high efficiency.
Equation 3 is not restricted to the specific example genome of length , but applies more generally to genomes of arbitrary length. Any genome of length can contain at most distinct motifs. Consequently, the minimum length required to specify a unique address on the genome equals . By the same logic, the longest motif length for which all possible sequences can be exhaustively represented within the genome is . Both of these characteristic lengths scale logarithmically with genome size. For genomes where is taken to be maximal and minimal, we find that the characteristic VCG oligomer length required for replication with high efficiency () also scales logarithmically with genome length (Figure 2H). Across genome lengths, the offset between and is roughly constant.
In genomes with different choices of and , the characteristic VCG oligomer length required for efficient replication, , is primarily determined by the unique motif length . Specifically, the length of VCG oligomers, , must exceed , regardless of the value of . This is shown for genomes of length in Appendix 2, where we generate genomes with specified length scales and using a Metropolis–Hastings algorithm and analyze their replication efficiency. Intuitively, the required VCG oligomer length is set by , because defines the minimal hybridization region needed to ensure correct ligation via sequence-specific recognition. The precise offset between and depends on genome-specific features, such as the frequency distribution of motifs with lengths between and . If this distribution is nearly uniform (noting that at least one motif must repeat, or else it would constitute a unique address), then will be close to (Appendix 2—figure 1B). In contrast, strongly biased motif distributions require larger to achieve reliable replication (Appendix 2—figure 1A), though even in this case, typically exceeds by only a few nucleotides.
Replication in multi-length VCG pools is dominated by the longest oligomers
In the previous section, we characterized the behavior of pools containing VCG oligomers of a single length. We observed that V+V ligations are error-prone due to insufficient overlap between the educt strands and the template, whereas F+V ligations extend the VCG oligomers with high efficiency. The F+V ligations will gradually broaden the length distribution of the VCG pool, raising the question of how this broadening affects the replication behavior. In principle, introducing multiple oligomer lengths into the VCG pool might even improve the fidelity of V+V ligations, since a long VCG oligomer could serve as a template for the correct ligation of two shorter VCG oligomers.
To analyze this question quantitatively, we first consider the simple case of a VCG pool that only contains monomers, tetramers, and octamers. The concentration of monomers is set to , while the concentrations of the VCG oligomers are varied independently (Figure 3A). Replication efficiency reaches its maximum at and very low tetramer concentration, , effectively resembling a single-length VCG pool containing only octamers. As shown in Figure 3B, the maximal efficiency is surrounded by a plateau of close-to-optimal efficiency. The octamer concentration can be varied by more than one order of magnitude without significant change in efficiency. Similarly, adding tetramers does not affect efficiency as long as the tetramer concentration does not exceed the octamer concentration. Figure 3C illustrates that the plateau of close-to-optimal efficiency coincides with the concentration regime where the ligation of a monomer to an octamer with another octamer acting as template, , is the dominant ligation reaction. For high tetramer concentration and intermediate octamer concentration, templated ligation of tetramers on octamer templates, , surpasses the contribution of ligations (green shaded area in Figure 3C). The reactions give rise to a ridge of increased efficiency, which, however, is small compared to the plateau of close-to-optimal efficiency (Figure 3B). Even though reactions of the type produce correct products in most cases, they compete with error-prone ligations like or , which reduce fidelity. Increasing the binding affinity, that is lowering , enhances the contribution of ligations at the expense of ligations, resulting in an increased local maximum in replication efficiency (Figure 3—figure supplement 1). However, very strong hybridization energy, , is necessary for ligations to give rise to similar efficiency as ligations. Thus, in this example, high efficiency replication facilitated by the ligation of short VCG oligomers on long VCG oligomers is in theory possible, but requires unrealistically strong binding affinity. Moreover, this mechanism is only effective if educts and template differ significantly with respect to their length. If the involved VCG oligomers are too similar in size, templated ligation of two short oligomers tends to be error-prone, irrespective of the choice of binding affinity (Figure 3—figure supplement 2 and Figure 3—figure supplement 3). In that case, F+V ligations remain the most reliable replication mechanism.
Replication performance of pools containing VCG oligomers of two different lengths.
(A) The pool contains a fixed concentration of monomers, , as well as tetramers and octamers at variable concentration. The hybridization energy per nearest-neighbor block is . (B) Replication efficiency reaches its maximum for and significantly lower tetramer concentration, . Efficiency remains close to maximal on a plateau around the maximum spanning almost two orders of magnitude in tetramer and octamer concentration. In addition, efficiency exhibits a ridge of increased efficiency for high tetramer concentration and intermediate octamer concentration. (C) Complexes that facilitate templated ligation are grouped by the length of the template and the educts, . We distinguish complexes producing correct (labeled ‘c’) and false products (labeled ‘f’). For each relevant type of complex, we highlight the region in the concentration plane where it contributes most significantly, that is at least 20% of the total ligation flux. The plateau of high efficiency is dominated by the ligation of monomers to octamers, whereas the ridge of increased efficiency is due to the correct ligation of two tetramers templated by an octamer.
We observe similar behavior in VCG pools containing a range of oligomer lengths from to , rather than just tetramers and octamers. For a uniform VCG concentration profile (Figure 4A, dark gray), the maximal replication efficiency (at the optimal VCG concentration) is attained when F+V ligations involving long VCG oligomers dominate the templated ligation (see Figure 4D showing the contribution of different ligation types in pools with varying at fixed ). Importantly, the maximal replication efficiency is always bounded by the efficiency of the longest VCG oligomer in the pool, independent of the presence or length of shorter oligomers (blue curve in Figure 4B; identical to the orange curve in Figure 2G). Including short VCG oligomers has minimal effect on the dominant ligation types and only slightly increases the proportion of unproductive F+F ligations. Consequently, reducing while keeping fixed leads to only a modest decline in maximal efficiency. In contrast, decreasing while holding constant causes a substantial reduction in efficiency (Figure 4B), because the longest oligomer in the pool sets an upper bound on replication efficiency. Moreover, at low , short and long oligomers become more similar in length, giving rise to a spectrum of erroneous V+V ligations that compete with the productive F+V ligations (Figure 4E).
Replication performance of multi-length VCG pools.
(A) The pool contains a fixed concentration of monomers, , as well as long oligomers in the range at variable concentration . The length dependence of the concentration profile is assumed to be uniform (for panels B, D, and E) or exponential (for panel C); its steepness is set by the parameter . (B) If the length distribution is uniform, reducing decreases the maximal efficiency, whereas increasing increases it. Pools containing a range of oligomer lengths are always outperformed by single-length VCGs (blue curve). (C) Assuming an exponential length distribution of VCG oligomers allows us to tune from a poorly-performing regime (dominated by oligomers of length ) to a well-performing regime (dominated by oligomers of length ). In the limit , approaches the replication efficiency of single-length pools containing only oligomers of length (dashed lines). (D) For high , replication is dominated by primer extension of the long oligomers in the VCG (here ). In this limit, addition of shorter oligomers leaves the dominant F+V ligations almost unchanged. (E) Reducing for fixed increases the fraction of unproductive (i.e. dimerization) or erroneous ligation reactions.
In a realistic prebiotic scenario, the concentration profile of the VCG pool would not be uniform. Depending on the mechanism producing the pool and its coupling to the non-equilibrium environment, it might have a concentration profile that decreases or increases exponentially with length. We use the parameter to control this exponential length dependence (Figure 4A): For negative , the concentration increases as a function of length, while exponentially decaying length distributions have positive . We find that replication efficiency is high if the concentration of long VCG oligomers exceeds or at least matches the concentration of short VCG oligomers ( in Figure 4C). In that case, replication efficiency is dominated by the long oligomers in the pool, since these form the most stable complexes. As the concentration of long oligomers is decreased further (), the higher stability of complexes formed by longer oligomers is eventually insufficient to compensate for the reduced concentration of long oligomers. Replication efficiency is then governed by short VCG oligomers, which exhibit lower replication efficiency. In the limit , replication efficiency approaches the replication efficiency of a single-length VCG pool containing only oligomers of length (Figure 4C and Figure 2G).
Adding dinucleotides to the feedstock decreases replication efficiency
So far, we focused on ensembles that contain solely monomers as feedstock. However, examining the influence of dimers on replication in VCG pools is of interest, since dinucleotides have proven to be interesting candidates for enzyme-free RNA copying (Leveau et al., 2022; Sosson et al., 2019; Walton and Szostak, 2016). For this reason, we study oligomer pools like those illustrated in Figure 5A: The ensemble contains monomers, dimers, and VCG oligomers of a single length, . As our default scenario, the dimer concentration is set to 10% of the monomer concentration, corresponding to , but this ratio can be modified by changing .
Replication performance of single-length VCG pools containing monomers and dimers as feedstock.
(A) The pool contains a fixed total concentration of feedstock, , partitioned into monomers and dimers, as well as VCG oligomers of a single length, . The proportion of monomers and dimers can be adjusted via , and the concentration of the VCG oligomers is a free parameter, . (B) Replication efficiency exhibits a maximum at intermediate VCG concentration in systems with (dashed blue curve) and without dimers (solid blue curve). The presence of dimers reduces replication efficiency significantly, as they enhance the ligation share of incorrect F+V ligations (dashed green curve). The panel depicts the behavior for and . (C) Optimal replication efficiency increases as a function of oligomer length, , and asymptotically approaches a plateau (dashed lines, Equation 4). The value of this plateau, , is determined by the competition between correct and false 2+V reactions, both of which grow exponentially with . Thus, depends on the relative concentration of the dimers in the pool: the more dimers are included, the lower is . (D) Erroneous 1+V ligations are possible if the educt oligomer has a short overlap region with the template. The hybridization energy for such configurations is small and independent of the length of the VCG oligomers (left). While 2+V ligations may produce incorrect products via the same mechanism (middle), incorrect product can also be caused by complexes in which two VCG oligomers hybridize perfectly to each other, but the dimer has a dangling end. The stability of these complexes increases exponentially with oligomer length (right).
Figure 5B compares the replication efficiency of a pool with () and without dimers (). In both cases, pools that are rich in VCG oligomers exhibit low efficiency of replication. As erroneous V+V ligations are the dominant type of reaction in this limit, all pools achieve the same efficiency regardless of the presence of dimers. In contrast, when the pool is rich in feedstock (small ), pools with and without dimers behave differently: If only monomers are included, the efficiency approaches zero, as dimerizing monomers do not contribute to the yield, and thus not to the efficiency. However, the presence of dimers enables the ligation of monomers and dimers to form trimers and tetramers, which lead to a non-zero yield. Given the low VCG concentration, the ligations are likely to proceed using dimers as template. As a consequence, educt oligomers can only hybridize to the template with a single-nucleotide-long hybridization region, leading to frequent formation of incorrect products and, consequently, low efficiency.
Replication efficiency reaches its maximum at intermediate VCG concentrations, where replication is dominated by F+V ligations. Notably, the maximal attainable efficiency is significantly lower for pools with dimers than without, as dimers increase the number and the stability of complex configurations that can form incorrect products (Figure 5C). Without dimers, ligation products are only incorrect if the overlap between the VCG educt oligomer and template is shorter than the unique motif length, . With dimers, however, dangling end dimers can cause incorrect products even in case of long overlap of educt oligomer and template (right column in Figure 5D). The stability of the latter complexes depends on the length of the VCG oligomers, , whereas the stability of complexes facilitating incorrect monomer addition is independent of oligomer length (Figure 5D).
In the presence of dimers, the length-dependent stability of complexes allowing for correct and incorrect F+V ligations causes a competition, which sets an upper bound on the efficiency of replication (Appendix 3),
Here, we introduced effective association constants , which depend differently on the VCG oligomer length, . While the effective association constant of complexes enabling incorrect 1+V ligations is length-independent, the effective association constant for incorrect 2+V ligations, , scales exponentially with ,
The effective association constants for correct 1+V and 2+V ligations also scale exponentially with the oligomer length (Appendix 3—figure 1).
In systems without dimers, that is , approaches 100%, which is consistent with the behavior observed in the previous sections. Conversely, in systems containing dimers, the maximal efficiency remains at a value below 100%, which depends on the concentration of dimers in the feedstock. Figure 5C shows the dependence of maximal replication efficiency on the length of VCG oligomers in pools containing monomers, dimers, and VCG oligomers. As increases, converges towards the upper bound defined in Equation 4 (dashed line in Figure 5C).
Error-prone ligation of VCG oligomers can be kinetically suppressed
In all scenarios considered so far, the efficiency of replication is limited by a common mechanism, regardless of the specifics of the VCG pool: Ligations involving two oligomers that hybridize to the template over a region shorter than are prone to generate incorrect products. In previous sections, we minimized these erroneous ligations by fine-tuning the concentration and length of VCG oligomers. However, such control may become unnecessary if the typically error-prone ligation of two oligomers (V+V) is kinetically suppressed. Kinetic suppression can be an intrinsic property of the activation chemistry: Templated ligation of two oligomers can be several orders of magnitude slower than the extension of an oligomer by a single monomer (Ding et al., 2022; Prywes et al., 2016). As a result, V+V ligations are disfavored purely by their slower kinetics. In addition, it is conceivable that only monomers are chemically activated while longer oligomers remain inactive, which would further reduce the likelihood of erroneous ligations. This scenario has already been explored experimentally (Ding et al., 2023). In natural environments, it could occur, for instance, when activated monomers are produced externally and then diffuse into compartments containing the VCG but lacking internal activation pathways (Kriebisch et al., 2024; Toparlak et al., 2023).
Within our model, we capture the kinetic suppression by introducing two different rates of ligation, for ligations involving a monomer and for ligations involving no monomer, allowing for kinetic discrimination between these processes. We explore the resulting replication efficiency in the limit of perfect kinetic discrimination () where only monomers are reactive for ligation. We first consider a pool where the reactive monomers are mixed with VCG oligomers of a single length as well as non-reactive dimers (Figure 6A). We vary the concentration of VCG oligomers, but keep the feedstock concentrations constant. For small VCG concentrations, we observe low efficiencies (Figure 6B), as the ligation of two monomers, or one monomer and one dimer, is most likely. Conversely, high facilitates the formation of complexes in which VCG oligomers are extended by monomers, which implies high efficiency (Figure 6B). Note that, unlike in systems where all oligomers are reactive, replication efficiency does not decrease for high VCG concentration, as erroneous V+V ligations are impossible. Instead, perfect replication efficiency (100%) is reached for sufficiently high .
Replication performance of single-length VCG pools with kinetic suppression of ligation between oligomers.
(A) The pool contains reactive monomers alongside non-reactive dimers and VCG oligomers of a single length. The concentrations of monomers and dimers are fixed, and , adding up to a total feedstock concentration of , while the concentration of VCG oligomers, , is varied. (B) Unlike in pools in which all ligation processes occur, replication efficiency does not decrease at high VCG concentration if ligations that do not involve monomers are kinetically suppressed. Instead, replication efficiency approaches an asymptotic value of 100%, as erroneous V+V ligations are impossible. (C) The fraction of oligomers that are in a monomer-extension-competent state depends on the total concentration of VCG oligomers. At low VCG concentration, most oligomers are single-stranded, and extension of oligomers by monomers is scarce. At high VCG concentration, approaches the asymptotic value (grey dashed line, Figure 7.). In this limit, almost all oligomers form duplexes, which facilitate monomer addition upon hybridization of a monomer. Thus, the asymptotic fraction of oligomers that gets extended by monomers is not determined by the oligomer length, but by the binding affinity of monomers to existing duplexes. Conversely, the threshold concentration at which depends on oligomer length (colored dashed lines): Longer oligomers reach higher at lower VCG concentration.
While replication efficiency characterizes the relative amount of nucleotides used for the correct elongation of VCG oligomers, it is also interesting to analyze which fraction of VCG oligomers is in a monomer-extension-competent state,
Here, denotes the equilibrium concentration of all complexes enabling the addition of a monomer to any VCG oligomer. We find that depends on the VCG concentration qualitatively in the same way as the efficiency: is small for small VCG concentration but approaches a value asymptotically for high VCG concentration (Figure 6C). In this limit, almost all VCG oligomers are part of a duplex. Monomers can bind to these duplexes to form complexes allowing for 1+V ligations. Since almost all VCG oligomers are already part of a duplex, increasing further does not increase the fraction of VCG oligomers that can be extended by monomers. Instead, the asymptotic value is determined by the concentration of monomers and their binding affinity to an existing duplex (Appendix 4),
While the asymptotic value does not depend on the length of the VCG oligomers, the threshold VCG concentration at which scales exponentially with (Appendix 4),
This scaling implies that longer oligomers require exponentially lower VCG concentration to achieve a given ratio (Figure 6C), as their greater length allows them to form more stable complexes. This observation implies that pools with longer oligomers will always be more productive than pools with shorter oligomers (at equal VCG concentration).
The behavior becomes more complex in pools containing VCG oligomers of multiple lengths, due to the competitive binding within such heterogeneous pools. To illustrate this, we examine an ensemble containing VCG oligomers ranging from to , along with the same feedstock as previously (reactive monomers and non-reactive dimers, see Figure 7A). For simplicity, we assume that the length distribution of VCG oligomers is uniform. We study the fraction of oligomers in a monomer-extension-competent state as a function of oligomer length,
Replication performance of multi-length VCG pools with kinetic suppression of ligation between oligomers.
(A) The pool contains reactive monomers as well as non-reactive dimers and VCG oligomers. The concentrations of monomers and dimers are fixed, and , adding up to a total feedstock concentration of , while the total concentration of VCG oligomers, , is varied. All VCG oligomers are assumed to have the same concentration. (B) At low VCG concentration, long oligomers are more likely in a monomer-extension-competent state than short oligomers, whereas at high VCG concentration, the trend reverses and short oligomers are more likely to be extended by monomers (‘productivity inversion’). The threshold concentration at which a short oligomer starts to outperform a longer oligomer depends on the lengths of the compared oligomers (dashed lines). (C) The mechanism underlying productivity inversion can be understood based on the pair-wise competition of different VCG oligomers, for example 8-mers vs. 9-mers. Over the entire range of VCG concentrations, complexes with 8-mer templates have a lower relative equilibrium concentration than complexes with 9-mer templates (bottom two curves vs. top two curves). As the concentration of VCG oligomers is increased, ligations of type exceed ligations of type , that is the fraction of 8-mers that are extended by monomers using a 9-mer as a template exceeds the fraction of extended 9-mers. (D) The equilibrium concentration of free oligomer decreases with increasing . For longer oligomers, the equilibrium fraction of free oligomers is lower, as they can form more stable complexes with longer hybridization sites. (E) Complexes in which 8-mers serve as template are less stable than complexes with 9-mer templates, explaining why complexes with 8-mer templates are more abundant than complexes with 9-mer templates (see panel C). Complexes with 9-mer template have similar stability regardless of the length of the educt oligomer, that is and are similarly stable. This similar stability, together with the higher concentration of free 8-mers compared to 9-mers (see panel D), is the reason why the fraction of monomer-extended 8-mers exceeds the one of 9-mers (see panel C).
where denotes the equilibrium concentration of all complexes that allow monomer-extension of VCG oligomers of length . At low VCG concentration, longer oligomers are more likely to be extended by monomers than shorter ones (Figure 7B). This behavior is intuitive, as longer oligomers tend to form more stable complexes, which lead to higher productivity. Surprisingly, increasing the VCG concentration reverses the length-dependence of the productivity, such that short oligomers are more likely to be extended by monomers than long ones (note the three crossings of the curves in Figure 7B). For example, 8-mers are more likely to undergo primer extension than 9-mers once the VCG concentration exceeds . We derived a semi-analytical expression for the threshold VCG concentrations at which oligomers of two different lengths have equal productivity (dashed lines in Figure 7B, Appendix 5 and Appendix 6).
To understand the mechanism underlying this inversion of productivity, we analyze how different complex types contribute to . We introduce
which denotes the fraction of oligomers of length that are in a monomer-extension-competent complex configuration that uses an oligomer of length as template. The term includes the sum over all possible configurations of complexes with the given lengths. Focusing on 8-mers and 9-mers as an example, we observe that complexes utilizing the 9-mer as template are responsible for the inversion of productivity (top two curves in Figure 7C): As the VCG concentration increases, the fraction of monomer-extendable 8-mers eventually surpasses the fraction of monomer-extendable 9-mers, that is (Figure 7C). Two factors give rise to this feature: (i) Ternary complexes of type and have similar stability (Figure 7E), and (ii) the equilibrium concentration of free 8-mers is higher than that of 9-mers (Figure 7D). As a result, 8-mers are more likely than 9-mers to hybridize to an existing duplex , and, given the stability of the complex , 8-mers remain bound almost as stably as 9-mers. In summary, short oligomers sequester long oligomers as templates to enhance their monomer-extension rate, while long oligomers cannot make use of short oligomers as templates due to the relative instability of the corresponding complexes.
It is noteworthy that Ding et al. have already observed productivity inversion experimentally (Ding et al., 2023). In their study, they included activated monomers, activated imidazolium-bridged dinucleotides and oligomers up to a length of 12 nt and observed that the primer extension rate for short primers is higher than the extension rate of long primers. Even though our model differs from their setup in some aspects (e.g. different circular genome, no bridged dinucleotides), evaluating our model using parameters similar to those of the experimentally studied system predicts productivity inversion that qualitatively agrees with the experimental findings (Appendix 7). We therefore assume that the mechanism underlying productivity inversion described here also applies to the experimental observations.
Discussion
While significant progress has been made in understanding the prebiotic formation of ribonucleotides (Becker et al., 2016; Benner et al., 2012; Kim et al., 2011; Powner et al., 2009) and characterizing ribozymes that might play a role in an RNA world (Attwater et al., 2018; Mutschler et al., 2015; Pressman et al., 2019; Tjhung et al., 2020), a convincing scenario bridging the gap between prebiotic chemistry and ribozyme-catalyzed replication is still missing. Here, we studied a scenario proposed by Zhou et al., 2021 (the ‘Virtual Circular Genome’, VCG) using theoretical and computational approaches. We analyzed the process whereby template-directed ligation replicates the genomic information that is collectively stored in the VCG oligomers. Our analysis revealed a trade-off between the fidelity and the yield of this process: At low concentration of VCG oligomers, most of the ligations produce oligomers that are too short to specify a unique locus on the genome, resulting in a low yield of replication (Figure 2B-C). At high VCG concentration, erroneous templated ligations cause sequence scrambling and consequently limit the fidelity of replication (Figure 2C-D). We considered two solutions to these issues: (i) a VCG pool composition that optimizes its replication behavior within the bounds of the fidelity-yield trade-off, and (ii) breaking the fidelity-yield trade-off given that error-prone ligations can be kinetically suppressed.
The first solution maximizes the yield of replication for fixed fidelity. In pools containing only monomers and VCG oligomers of a single length, replication efficiency can be maximized by increasing the length of VCG oligomers and decreasing their concentration (Figure 2F-G). This reduces the likelihood of error-prone templated ligation of long oligomers. When the pool contains VCG oligomers of multiple lengths, replication efficiency is typically governed by the longest oligomer in the pool (Figure 4D). Including dimers as feedstock for the replication increases the error fraction (Figure 5B-C), as dimers that bind to a template with a dangling end are prone to form an incorrect product (Figure 5D).
The second solution eliminates the error-prone templated ligation of two VCG oligomers by suppressing them kinetically, for example by assuring that only monomers are chemically activated. This enables both fidelity and yield to remain high at high VCG concentrations (Figure 6B), effectively breaking the fidelity-yield trade-off. Longer VCG oligomers are then more likely to be extended than shorter oligomers at equal concentration (Figure 6C). However, this is only true for pools with VCG oligomers of a single length — once multiple VCG oligomer lengths compete with each other, shorter oligomers can be more productive than longer ones (Figure 7B). This feature, which has also been observed experimentally (Ding et al., 2023), is caused by an asymmetry in the productive interaction between short and long oligomers (Figure 7C): While short oligomers can sequester longer oligomers as templates for their extension by a monomer, short oligomers are unlikely to serve as templates for longer oligomers (Figure 7D-E).
As we intended to study the pathways responsible for sequence scrambling and to explore possible mitigation strategies, we based our analysis on a coarse-grained model that neglects some experimental details. First, we assumed that a complex instantaneously dehybridizes if it contains a non-complementary base pair, whereas in reality, short duplexes can tolerate a limited number of mismatches (Todisco et al., 2024a). While such mismatches can facilitate incorrect hybridization and introduce additional replication errors, we expect this effect to be moderate: Mismatches preferentially occur near the ends of the hybridized region, where their destabilizing effect on binding is weakest (Todisco et al., 2024a). However, such terminal mismatches have also been shown to significantly reduce ligation rates (Rajamani et al., 2010 Leu et al., 2013), which in turn limits the likelihood of forming incorrect products.
Second, we simplified the hybridization dynamics by assuming that all oligomers bind to each other at equal rates, and that dehybridization rates are determined by the hybridization energy computed via a nearest-neighbor model. However, recent work has shown that hybridization to a gap flanked by two oligomers proceeds more slowly than binding to an unoccupied template. Moreover, the resulting nicked complexes (two oligomers hybridized adjacently on a template) are more stable than predicted by standard nearest-neighbor models due to enhanced stacking interactions at the nick site (Todisco et al., 2024b). While this added stability is not expected to affect overall replication efficiency of the VCG (since all productive complexes, correct or incorrect, contain a nick), it can impact the kinetics of the system. In particular, the extended lifetime of such complexes may challenge the adiabatic approximation used in much of our analysis, which assumes ligation is always slower than hybridization and dehybridization.
Third, we do not model the activation chemistry explicitly, but instead assume that all monomers (and, depending on the scenario, also oligomers) are always reactive. As a result, some activated intermediates that are known to form in experiments, such as imidazolium-bridged dinucleotides (Walton and Szostak, 2016), are not modeled. Nonetheless, we include aspects of activation chemistry in a coarse-grained manner. Specifically, to capture the experimentally observed difference in reactivity between monomer incorporation and templated ligation of oligomers under amino-imidazolium activation, we introduce two distinct ligation rate constants. With this approach, we describe the experimental setup well enough to qualitatively reproduce features observed in experiments, for example, the preferential extension of shorter oligomers by monomers in pools containing VCG oligomers of varying lengths (Ding et al., 2023).
The VCG scenario was proposed to close the gap between prebiotic chemistry and ribozyme-catalyzed replication. To this end, VCG pools need to be capable of replicating (parts of) ribozymes that play a role in the emergence of life. While there are cases of small ribozymes (Pressman et al., 2019) or ribozymes with small active sites (e.g. the Hammerhead ribozyme Scott, 2013), ribozymes obtained experimentally via in vitro evolution are often more than a hundred nucleotides long (Attwater et al., 2013; Johnston et al., 2001; Müller and Bartel, 2008; Wochner et al., 2011). Remarkably, our model suggests that the VCG scenario enables high-fidelity replication of long genomes, even in pools containing relatively short VCG oligomers. For a genome of length , a sequence of at least nucleotides is required to uniquely specify a position within the genome. If the oligomers in the pool exceed this length by about three nucleotides, accurate replication becomes feasible (Figure 2H). For example, genomes of length () can be replicated in VCG pools containing oligomers. However, equals only for a restricted set of genome sequences; more generally, . In such cases, reliable replication requires correspondingly longer oligomers. While the precise margin between oligomer length and depends on genome-specific features (particularly the motif distribution at sub- scales), it typically amounts to only a few additional nucleotides (Appendix 2).
It is noteworthy that replication in the VCG scenario imposes a selection pressure on prebiotic genomes to reduce their unique motif length, . A circular genome requiring many nucleotides to specify a unique locus (high ) replicates less efficiently than one with a shorter , assuming all other properties of the VCG pool (particularly the oligomer length distribution) remain identical. This length distribution arises from the interplay between the chemical kinetics and molecular transport governed by the physical environment. For instance, templated ligation in an open system with continuous oligomer influx and outflux can produce a non-monotonic length distribution, with a concentration peak at a characteristic oligomer length , determined by the interplay between dehybridization and outflux (Rosenberger et al., 2021). Through this emergent length scale, the environment shapes replication in the VCG scenario. If the environment facilitates long oligomers (), replication proceeds efficiently. Conversely, in environments with a small , repeating motifs longer than are selected against. In such cases, mutational errors may replace long repeated motifs with functionally equivalent sequences composed of shorter unique motifs, thereby increasing replication efficiency.
Given the broad range of prebiotically plausible non-equilibrium environments (Ianeselli et al., 2023), it is reasonable to expect that some environments provide the required conditions for efficient replication. The constraints formulated in this work can help to guide the search for self-replicating oligomer pools, in the vast space of possible concentration profiles and non-equilibrium environments.
Methods
Constructing circular genomes
In the Virtual Circular Genome (VCG) scenario, genomes are encoded in a pool of oligomers. The encoded genomes are assumed to be circular sequences of length , containing both the original sequence and its reverse complement. Each genome is characterized by two fundamental length scales that reflect different aspects of motif distribution along the sequence. The minimal unique motif length, , is defined as the shortest subsequence length for which all motifs of length appear at most once in the genome. In contrast, the exhaustive coverage length, , denotes the largest motif length for which all possible motifs are present within the genome. Since only distinct motifs can be encoded in a genome (including its complement), cannot exceed
Similarly, for a motif to be uniquely addressable on the genome, its length must be at least
We note that and are essentially the same length scale (differing by at most one nucleotide).
The characteristic length scales and impose constraints on how motifs are distributed. For example, when , all motifs of length must appear at most once, while at least one motif of length must occur more than once. To quantify the motif distribution, we introduce the motif entropy,
where denotes the relative frequency of motif across the genome and its reverse complement. Motif entropy ranges from zero (a homogeneous sequence with only one motif) to a maximum value that depends on the subsequence length,
For motif length to qualify as the unique motif length , its entropy must be maximal, , while must be smaller than its respective maximum, .
The correspondence between characteristic length scales and motif entropies provides a way to construct genome sequences with specified motif characteristics. By treating the entropy function as an effective ‘Hamiltonian’ , we can generate genome sequences through Metropolis–Hastings sampling. In our implementation, each update step in the Metropolis-Hastings algorithm involves either a single-nucleotide mutation or a cut-and-paste operation that relocates a segment of the genome to a new position (the cut-and-paste operation is 10 times more likely than the single nucleotide mutation). The acceptance criterion follows the standard Metropolis rule: modifications that reduce the Hamiltonian are always accepted, while increases in ‘energy’ are accepted with probability . Here, is an effective temperature chosen to be small compared to the typical energy to ensure convergence to the minimum. Simulations are either run until a predefined entropy target is reached or until the energy converges to a plateau.
To generate genomes with and , we minimize the Hamiltonian
Starting from a random sequence, we perform 10,000 Metropolis–Hastings steps at an inverse temperature to construct genome sequences of lengths (listed in Results) and (listed in Supplementary file 1). To explore genomes with and , we initialize the algorithm from the maximally diverse genomes (, ) and then reduce entropy across the range . This is done by minimizing the Hamiltonian
via two different sampling protocols. In the first protocol, the simulation is terminated as soon as the genome reaches the desired values of and . The resulting motif distributions on the intermediate length scales () remain close to uniform, with only minor biases sufficient to enforce the length-scale constraints. In the second protocol, entropy minimization continues beyond the point at which the target values are achieved, leading to more strongly biased motif distributions on intermediate length scales. These construction strategies allow us to systematically tune genome complexity and motif structure, enabling controlled investigations of how the characteristic length scales influence replication dynamics (see Appendix 2 for details).
Computing replication observables based on the kinetic simulation
We simulate the dynamics of VCG pools using a kinetic simulation that is based on the Gillespie algorithm. In the simulation, oligomers can hybridize to each other to form complexes or dehybridize from an existing complex. Moreover, two oligomers can undergo templated ligation if they are hybridized adjacent to each other on a third oligomer. At each time , the state of the system is determined by a list of all single-stranded oligomers and complexes as well as their respective copy number. We refer to the state of the system at the time as the ensemble of compounds . Given the copy numbers, the rates of all possible chemical reactions can be computed. To evolve the system in time, we need to perform two steps: (i) We sample the waiting time until the next reaction, , from an exponential distribution with mean , and update the simulation time, . (ii) We pick which reaction to perform by sampling from a categorical distribution. Here, the probability to pick reaction equals . The copy numbers are updated according to the sampled reaction, yielding . Steps (i) and (ii) are repeated until the simulation time reaches the desired final time, . A more detailed explanation of the kinetic simulation is presented in Göppel et al., 2022; Rosenberger et al., 2021.
Our goal is to compute observables characterizing replication in the VCG scenario based on the full kinetic simulation. In the following derivation, we focus on one particular observable (yield) for clarity. The results for other observables are stated directly, as their derivations follow analogously. Recall the definition of the yield introduced in the Results section,
As we are interested in the initial replication performance of the VCG, we compute the yield based on the ligation events that take place until the characteristic timescale of ligations . In principle, we would like to compute the yield based on the templated ligation events that we observe in the simulation. Unfortunately, for reasonable system parameters, it is impossible to simulate the system long enough to observe sufficiently many ligation events to compute to reasonable accuracy. For example, for a VCG pool containing monomers at a total concentration of and VCG oligomers of length at a total concentration of , it would take about 1700 hr of simulation time to reach (Figure 8). Multiple such runs would be needed to estimate the mean and the variance of the observables of interest, rendering this approach unfeasible.
Simulation runtime of the full kinetic simulation for a VCG pool that includes monomers and VCG oligomers of length .
The total concentration of feedstock monomers equals , while the total concentration of VCG oligomers is . The energy contribution per matching nearest-neighbor block is set to . The volume of the system is varied, and the time evolution is simulated until . The runtime of the simulation scales linearly with the volume of the system.
Instead, we compute the replication observables based on the copy number of complexes that could potentially perform a templated ligation, that is complexes in which two strands are hybridized adjacent to each other, such that they could form a covalent bond. We can show analytically that the number of productive complexes is a good approximation for the number of incorporated nucleotides: The number of incorporated nucleotides can be computed as the integral over the ligation flux, weighted by the number of nucleotides that are added in each templated ligation reaction,
Here, denotes the copy number of the complex C in the pool . and denote the lengths of the oligomers that undergo ligation, and is an indicator function which enforces that only complexes in a ligation-competent configuration contribute to the reaction flux. As only a few ligation events are expected to happen until , it is reasonable to assume that the ensembles do not change significantly during . Therefore, the integration over time may be interpreted as a multiplication by ,
where denotes the average over realizations of the ensembles within the time interval . This average corresponds to the average number of complexes in a ligation-competent configuration. Note that, at this point, we made the additional assumption that no templated ligations are taking place between . This assumption is reasonable, as (i) the equilibration process is very short compared to the characteristic timescale of ligation, and (ii) the number of complexes that might allow for templated ligation during equilibration is lower than in equilibrium (we start the simulation with an ensemble of single-stranded oligomers). Both aspects imply that the rate of templated ligation is negligible during the interval .
In order to compute the average over different realizations of ensembles (as required in Equation 6), we need to sample a set of uncorrelated ensembles that have reached the hybridization equilibrium, which can be done using the full kinetic simulation. The simulation starts with a pool containing only single-stranded oligomers and reaches the (de)hybridization equilibrium after a time . We identify this timescale of equilibration by fitting an exponential function to the total hybridization energy of all complexes in the system, (Figure 9A). In the set of ensembles used to evaluate the average in Equation 6, we only include ensembles for time to ensure that the ensembles have reached (de)hybridization equilibrium. To ensure that the ensembles are uncorrelated, we require that the time between two ensembles that contribute to the average is at least . The correlation time, , is determined via an exponential fit to the autocorrelation function of (Figure 9B). Besides computing the expectation value (Equation 6), we are also interested in the ‘uncertainty’ of this expectation value, that is in the standard deviation of the sample mean . (We use as a short-hand notation for ). The standard deviation of the sample mean, , is related to the standard deviation of , , by the number of samples, . Moreover, based on the van-Kampen system size expansion, we expect the standard deviation of to be proportional to , such that .
Characteristic timescales in the kinetic simulation.
(A) The equilibration timescale is determined based on the total hybridization energy of all strands in the pool, . By fitting an exponential function to , we obtain a characteristic timescale (vertical dotted line), which is then used to calculate the equilibration time as (vertical dashed line). The horizontal dashed line shows the total hybridization energy expected in (de)hybridization equilibrium according to the coarse-grained adiabatic approach (Methods). (B) The correlation timescale is determined based on the autocorrelation of . We obtain (vertical dashed line) by fitting an exponential function to the autocorrelation. In both panels, we show simulation data obtained for a VCG pool containing monomers and VCG oligomers with a concentration of as well as oligomers of length with a concentration of .
Using Equation 6 (as well as an analogous expression for the number of nucleotides that are incorporated in VCG oligomers), the yield can be expressed as
The additional condition in the numerator ensures that the product oligomer is long enough to be counted as a VCG oligomer, that is at least nucleotides long. Analogously, the expression for the fidelity of replication reads
Multiplying fidelity and yield results in the efficiency of replication,
The ligation share of a particular type of templated ligation , that is, the relative contribution of this templated-ligation type to the nucleotide extension flux, can be represented in a similar form as the other observables,
As all observables are expressed as the ratio of two expectation values, , we can compute the uncertainty of the observables via Gaussian error propagation,
Since the variances, and , as well as the covariance, , are proportional to , the standard deviation of the observable mean, , scales with the inverse square root of the number of samples and the system volume, that is . Therefore, the variance of the computed observable can be reduced by either increasing the system volume or increasing the number of samples used for averaging. Both approaches incur the same computational cost: (i) Increasing the number of samples, , requires running the simulation for a longer duration, with the additional runtime scaling linearly with the number of samples. (ii) Similarly, the additional runtime needed due to increased system volume, , also scales linearly with (Figure 8). One update step in the simulation always takes roughly the same amount of runtime, but the change in simulation time per update step depends on the total rate of all reactions in the system. The total rate is dominated by the association reactions, and their rate is proportional to the volume. Therefore, the change in simulation time per update step is proportional to . The runtime, which is necessary to reach the same simulation time in a system with volume as in a system with volume 1, is a factor of longer in the larger system. With this in mind, it makes no difference whether the variance is reduced by increasing the volume or the number of samples. For practical reasons (post-processing of the simulations is less memory- and time-consuming), we opt to choose a moderate number of samples, but slightly higher system volumes to compute the observables of interest. The simulation parameters (length of oligomers, concentrations, hybridization energy, volume, number of samples, characteristic timescales) used to obtain the results presented in Figure 2 are summarized in Table 1.
Input parameters and resulting observables (yield and efficiency) from the full kinetic simulation of replication in pools containing monomers and VCG oligomers of a single length . The observables (yield and efficiency) listed in this table are shown in Figure 2.
| VCG oligo. length | conc. ratio | volume | equilibration time | correlation time | number of samples | yield | efficiency |
|---|---|---|---|---|---|---|---|
| 6 | 1.0 ⋅ 10−4 | 5.0 ⋅ 104 | 3.4 ⋅ 106 | 1.9 ⋅ 106 | 3805 | 0.04 ± 0.01 | 0.04 ± 0.01 |
| 6 | 1.0 ⋅ 10−3 | 5.0 ⋅ 103 | 1.2 ⋅ 107 | 2.6 ⋅ 106 | 3264 | 0.38 ± 0.02 | 0.36 ± 0.02 |
| 6 | 3.3 ⋅ 10−3 | 8.0 ⋅ 102 | 1.3 ⋅ 107 | 2.7 ⋅ 106 | 5400 | 0.68 ± 0.02 | 0.64 ± 0.02 |
| 6 | 1.0 ⋅ 10−2 | 9.1 ⋅ 101 | 1.4 ⋅ 107 | 2.7 ⋅ 106 | 5440 | 0.87 ± 0.01 | 0.77 ± 0.03 |
| 6 | 3.3 ⋅ 10−2 | 9.1 ⋅ 100 | 1.3 ⋅ 107 | 2.4 ⋅ 106 | 6170 | 0.96 ± 0.01 | 0.63 ± 0.03 |
| 7 | 1.0 ⋅ 10−4 | 3.9 ⋅ 104 | 1.7 ⋅ 108 | 2.6 ⋅ 107 | 784 | 0.33 ± 0.05 | 0.33 ± 0.05 |
| 7 | 1.0 ⋅ 10−3 | 7.6 ⋅ 102 | 1.9 ⋅ 108 | 4.0 ⋅ 107 | 2041 | 0.87 ± 0.02 | 0.81 ± 0.05 |
| 7 | 3.3 ⋅ 10−3 | 7.7 ⋅ 101 | 1.9 ⋅ 108 | 3.3 ⋅ 107 | 2980 | 0.95 ± 0.01 | 0.87 ± 0.04 |
| 7 | 1.0 ⋅ 10−2 | 1.1 ⋅ 101 | 1.9 ⋅ 108 | 2.6 ⋅ 107 | 3465 | 0.99 ± 0.01 | 0.81 ± 0.05 |
| 7 | 3.3 ⋅ 10−2 | 1.7 ⋅ 100 | 1.9 ⋅ 108 | 3.1 ⋅ 107 | 3235 | 0.99 ± 0.04 | 0.73 ± 0.05 |
| 8 | 1.0 ⋅ 10−4 | 6.3 ⋅ 103 | 2.5 ⋅ 109 | 1.1 ⋅ 108 | 466 | 0.81 ± 0.05 | 0.81 ± 0.05 |
| 8 | 1.0 ⋅ 10−3 | 9.9 ⋅ 101 | 1.9 ⋅ 109 | 3.6 ⋅ 108 | 615 | 0.99 ± 0.01 | 0.99 ± 0.01 |
| 8 | 3.3 . 10-3 | 1.6 ⋅ 101 | 1.0 ⋅ 109 | 2.2 ⋅ 108 | 1100 | 0.95 ± 0.03 | 0.95 ± 0.03 |
| 8 | 1.0 . 10-2 | 3.8 ⋅ 100 | 5.6 ⋅ 108 | 1.4 ⋅ 108 | 1700 | 1.00 ± 0.01 | 0.93 ± 0.05 |
| 8 | 3.3 . 10-2 | 0.9 ⋅ 100 | 4.9 ⋅ 108 | 7.4 ⋅ 107 | 3195 | 1.00 ± 0.03 | 0.82 ± 0.05 |
Coarse-grained representation of complexes in the adiabatic approach
To characterize the replication performance of VCG pools across a broad range of system parameters, we developed an adiabatic approach that enables faster computation of replication observables than full kinetic simulations. In this method, we compute the equilibrium concentrations of complexes in productive configurations under the assumption that templated ligation is much slower than (de)hybridization. The approach relies on a coarse-grained representation of the oligomers in the pool, which we introduce in this section.
Single strands
In the coarse-grained description, oligomers of identical length are assumed to have equal concentration, irrespective of their sequence. This assumption is justified for two reasons: (i) We initialize the VCG pool without sequence bias, that is all oligomers compatible with the genome sequence are included at equal concentration. (ii) Hybridization energy in our simplified energy model (and therefore also the stability of complexes) only depends on the length of the hybridization site, not on its sequence, such that oligomers of equal length have similar probabilities of being free or bound in complexes. Under this assumption, each coarse-grained oligomer is uniquely identified by its length , and it represents a group of oligomers with distinct sequences. We refer to the number as the combinatorial multiplicity of the coarse-grained oligomer. The value of depends on the choice of the encoded genome. We assume that by construction, all possible oligomer sequences of length are included in the genome (see Methods for the definition of and ). For , only a subset of all possible sequences is included, but no sequence is repeated multiple times across the genome. Therefore, the combinatorial multiplicity equals
Duplexes
Two strands can form a duplex by hybridizing to each other. We refer to the bottom oligomer as ‘Crick’ strand C and to the top oligomer as ‘Watson’ strand W. A duplex is uniquely characterized by the lengths of the oligomers, and , as well as their relative alignment (Figure 10A). The alignment index denotes the position of the Watson strand with respect to the Crick strand. As there needs to be at least one nucleotide of overlap between the strands for a duplex to exist, the alignment index needs to be in the interval . Using the alignment index, we can also determine if the duplex has a left (or right) dangling end. The corresponding indicator variables are called (or ),
Schematic representation of complexes considered in the adiabatic approach.
(A) A duplex is comprised of two strands, which we refer to as W (Watson) and C (Crick) strands. The relative position of the strands is characterized by the alignment index ; for the depicted duplex, . The length of the hybridization region is called . (B) A ternary complex contains three strands. By convention, we denote the two strands that are on the same ‘side’' of the complex as W1 and W2, and the complementary strand as C. The alignment indices and denote the positions of W1 and W2 relative to C. For the depicted complex, and . The length of the hybridization regions is called and .
Moreover, the length of the hybridization region can be computed via
The hybridization energy of the duplex depends on the length of the hybridization region as well as on the existence/absence of dangling ends. For a hybridization site of length , there are nearest-neighbor energy blocks each of which contributes to the energy. Moreover, each dangling end contributes to the energy,
To compute the combinatorial multiplicity for a duplex with fixed , and alignment index , we need to multiply the combinatorial multiplicity of the Crick strand by the number of possible hybridization partners. We assume that a hybridization partner is possible if its sequence is perfectly complementary to the lower strand within the hybridization region, whereas hybridization partners with mismatches are not accounted for. This is sensible as long as the energetic penalty for mismatches in the full kinetic simulation is sufficiently large to suppress mismatches. The number of possible hybridization partners is determined by the length of the overlap region : If , the pool contains only one oligomer sequence that can act as hybridization partner by construction of the genome. For shorter hybridization regions, multiple hybridization partners might be possible. Their number is set by the combinatorial multiplicity of the Watson oligomer divided by the combinatorial multiplicity of the hybridization region,
To avoid double-counting, we only account for complexes in which the Crick strand is at least as long as the Watson strand, , and multiply by if .
Ternary complexes
Ternary complexes, that is complexes comprised of three strands, are uniquely characterized by the length of the three oligomers, , as well as their respective alignment (Figure 10B). The alignment index denotes the position of strand W1 relative to strand C. Analogously, denotes the relative position of W2 relative to oligomer C. Two strands that are hybridized to each other need to have a hybridization region of at least one nucleotide. Moreover, the strands W1 and W2 must not occupy the same position on the template strand C. Taking both requirements together, the alignment indices fall within the intervals,
A ternary complex may have a dangling end not only on its left or right end, but also in the gap between strands W1 and W2. Three boolean variables are necessary to denote the presence/absence of the respective dangling ends,
The length of the two hybridization regions is given by
The hybridization energy depends on the length of the overlap regions as well as on the existence of dangling ends: As in the duplex, each overlap region of length (or ) comprises (or ) nearest neighbor blocks, each of which contributes to the total energy. Moreover, every dangling end contributes . Note that the presence of a gap between strands W1 and W2, that is , implies that there are two dangling ends, one for W1 and another for W2. Gaps in between two complexes contribute per each dangling end, adding up to . If there is no gap between the strands, that is , there are no dangling end contributions, but a new full nearest neighbor block emerges, which contributes to the energy. Therefore, the total energy reads,
The combinatorial multiplicity of a ternary complex is computed in the same way as for the duplex: The combinatorial multiplicity of the strand C is multiplied by the number of possible hybridization partners W1 and W2. Again, the number of possible partners is set by the length of the hybridization regions,
We use to denote the indicator function which returns 1 in case the condition in the bracket is fulfilled and zero otherwise. As all ternary complexes are asymmetric, there is no need to introduce a symmetry correction factor.
Quaternary complexes
The largest complexes to be accounted for in our coarse-grained adiabatic approach are quaternary complexes, that is complexes comprised of four strands. We need to distinguish three types of such complexes: (i) 3-1 quaternary complexes, (ii) left-tilted 2-2 quaternary complexes, and (iii) right-tilted 2-2 quaternary complexes. In 3-1 quaternary complexes, three Watson strands are hybridized to one Crick strand (Figure 11), whereas in 2-2 quaternary complexes, two Watson strands are hybridized to two Crick strands (Figure 12 and Figure 13).
Schematic representation of a 3-1 quaternary complex.
Three strands (referred to as Watson strands W1, W2, and W3) hybridize to a single template strand (Crick strand C). The positions relative to the left end of the C strand are given by the alignment indices , and ; here, . The length of the overlap regions is denoted as , and .
Schematic representation of a left-tilted 2-2 quaternary complex.
(A) Two Watson strands (W1 and W2) are hybridized to two Crick strands (C1 and C2). Both Watson strands are hybridized to the left Crick strand C1, whereas only W2 is hybridized to the right Crick strand C2. The alignment indices and denote the position of the strands relative to the left end of C1; here, , and . The length of the hybridization regions is called , , and . (B) Rotating the schematic representation of a left-tilted 2-2 quaternary complex by produces an alternative representation of the same complex, which is again a left-tilted 2-2 complex. The panel depicts the rotated complex representation (variables with superscript ‘rot’) as well as the non-rotated representation (variables without superscript). There is a unique linear mapping between non-rotated and rotated representation, for example C2 after rotation always corresponds to W1 before rotation.
Schematic representation of a right-tilted 2-2 quaternary complex.
(A) Two Watson strands (W1 and W2) are hybridized to two Crick strands (C1 and C2). Unlike in the left-tilted 2-2 quaternary complex, both Watson strands are hybridized to the right Crick strand C2, whereas only W1 is hybridized to the left Crick strand C1. The alignment indices , and denote the positions of the strands relative to C1; here, , , and . The length of the overlap regions is called and . (B) Rotating the schematic representation of a right-tilted 2-2 quaternary complex by produces an alternative representation of the same complex, which is again a right-tilted 2–2 complex. The panel depicts the rotated complex representation (variables with superscript ‘rot’) as well as the non-rotated representation (variables without superscript). There is a unique linear mapping between non-rotated and rotated representation, for example C2 after rotation always corresponds to W1 before rotation. The mapping is identical for left- and right-tilted 2–2 quaternary complexes.
3-1 Quaternary complexes
Figure 11 depicts a typical 3-1 quaternary complex. Such a complex is uniquely characterized by the length of its oligomers, , as well as their relative position to each other denoted by the alignment indices , and . All positions within the ternary complex are measured relative to the left end of the C strand. Any W strand needs to have at least one nucleotide of overlap with the C strand, but two W strands must never occupy the same position on the C strand. Consequently, the alignment indices fall within the intervals,
There are two dangling ends (left and right) and potentially two gaps between the W strands: one gap between W1 and W2 and another one between W2 and W3. The following boolean variables indicate the presence/absence of the respective dangling ends,
The length of the hybridization regions is given by
Following the same reasoning as in the case of ternary complexes, the energy equals
Similarly, the combinatorial multiplicity of 3-1 quaternary complexes is constructed using the same reasoning as in the case of ternary complexes,
As 3-1 quaternary complexes are not symmetric under rotation, no symmetry correction of the combinatorial multiplicity is necessary.
Left-tilted 2-2 quaternary complexes
A 2-2 quaternary complex is comprised of two C strands and two W strands. We call a 2-2 complex left-tilted if strand W1 is connected to strand W2 via strand C1 (Figure 12A). The lengths of the oligomers are called , and . The positions of the strands relative to each other are governed by the alignment indices. All positions are measured relative to the position of the left end of strand C1. The alignment indices may take on the following values,
The complex can have dangling ends on the right and on the left end of the complex; the presence of these dangling ends is indicated by the boolean variables and . Moreover, two gaps are possible: There might be a gap between strands W1 and W2, or a gap between C1 and C2. The respective boolean variables read
We refer to the hybridization region of strand W1 and C1 as overlap region 1, to the hybridization region of strand W2 and C1 as overlap region 2 and to the hybridization region of strand W2 and C2 as overlap region 3. The length of these hybridization regions is computed via
Given the length of the hybridization region as well as the presence/absence of dangling ends, we can compute the hybridization energy,
The combinatorial multiplicity of a left-tilted 2-2 quaternary complex is constructed using the same reasoning as in the case of a 3-1 complex,
To prevent double-counting the same quaternary complex, we include either the complex or its rotated representation in the container of possible complexes, but not both. If the complex is symmetric under rotation, we multiply the combinatorial multiplicity by . Given a left-tilted 2-2 quaternary complex , we can compute the corresponding rotated complex by applying a linear map. The mapping of the oligomer lengths is illustrated in Figure 12B. We see that strand C2 after rotation corresponds to strand W1 before rotation, implying that . The same reasoning can be applied to all strands leading to the map,
In order to compute the map of the alignment indices under rotation, we need to express the relative positions of the strands with respect to the position of strand C1 after rotation, which corresponded to W2 before rotation. For example, corresponds to the number of nucleotides by which strand C2 (before rotation) protrudes beyond strand W2 (before rotation). Expressed in terms of variables before rotation, this distance may be written as . Analogous relations can be derived for all alignment indices,
Right-tilted 2-2 quaternary complexes
A 2-2 quaternary complex is called right-tilted if strand W1 is connected to strand W2 via strand C2 (Figure 13). As in the case of the left-tilted 2-2 complex, the oligomer lengths are again called and , but the values of the alignment indices that are possible for the right-tilted quaternary complex differ from the ones of the left-tilted complex,
Note that the range of is chosen such that at least one nucleotide of strand W1 always extends to the right beyond the end of strand C1, allowing for a hybridization region between strand C2 and W1. The boolean variables denoting the presence/absence of dangling ends read
The length of the overlap regions is given by
Like in the case of left-tilted 2-2 quaternary complexes, the total hybridization energy is computed via
and the combinatorial multiplicity via
We include either the quaternary complex or its rotated representation in the list of possible complexes to avoid double-counting. Moreover, the combinatorial multiplicity is divided by 2 if the complex is symmetric under rotation. It turns out that the rotation map for the right-tilted 2-2 quaternary complex is identical to the one of the left-tilted 2-2 complex,
Numerical solution of the (De)hybridization equilibrium in the adiabatic approach
Based on the list of complexes constructed in the previous section, we can compute the equilibrium concentration of strands and complexes reached in the (de)hybridization equilibrium. In the following, we denote the concentration of a coarse-grained oligomer with length as , and the concentration of an oligomer with length and known sequence as . Recall that we assumed that all sequences of a given length that are compatible with the circular genome are equally likely in the pool. Thus, the concentration of the coarse-grained oligomer and the concentration of an oligomer with specified sequence are related by the combinatorial multiplicity,
In order to compute the concentration of a complex based on the concentration of single strands, we make use of the law of mass action. The concentration of a specific sequence realization of a complex is computed as the product of concentrations of the strands forming the complex divided by the dissociation constant ,
Here, is the vector denoting the lengths of the strands comprising the complex, and are the alignment indices. The dissociation constant is set by the hybridization energy, , of the complex,
where is the standard concentration. Just as in the case of single strands, the concentration of the sequence-independent coarse-grained complex is related to the concentration of a complex with specific sequence realization via the combinatorial prefactor,
It can be useful to combine the combinatorial multiplicity and the dissociation constant into a single effective association constant,
Note that the effective association constant including the combinatorial multiplicity is denoted by (in curly font), while the association constant without combinatorial multiplicity is denoted (in regular font).
In the adiabatic approach, we study the behavior of the system on timescales that are long enough for the system to reach the (de)hybridization equilibrium, but too short for templated ligation events to take place. Therefore, the length of the oligomers is expected not to change throughout the equilibration process, and we need to introduce a separate mass conservation law for each coarse-grained oligomer, that is each oligomer length, that is included in the pool. For each length, the concentration of single-stranded coarse-grained oligomers of length and the concentration of coarse-grained oligomers of length bound in a complex need to add up to their total concentration set by the initial condition,
In this equation, denotes the summation over all complexes that contain at least one strand of length . Note that we referred to the total concentration of oligomers of length as in the main text, but for clarity, we use in the following.
Combining the mass conservation requirement with the mass action law gives a set of coupled polynomial equations. The number of equations equals the number of distinct oligomer lengths included in the pool. The polynomial equations are of degree 4, as quaternary complexes are the largest complexes to be accounted for and their concentrations equal the product of the four strands comprising the complex. We determine the equilibrium concentrations by finding the root of this set of fourth-degree polynomial equations using the Levenberg-Marquardt algorithm.
Computing replication observables based on the adiabatic approach
As we are not modeling templated ligation events explicitly in the adiabatic approach, we compute replication observables based on the equilibrium concentration of complexes that are in a configuration which allows for a templated ligation reaction to happen. Templated ligations are possible if two strands in the complex are adjacent to each other, that is there is no gap in between two oligomers that are hybridized to the same template strand (Figure 14). Recall that the absence of a gap between two oligomers in the complex implies that the dangling end indicator variable . The length of the product strand P is equal to the sum of the lengths of the two educt strands E1 and E2, . We can use the information about the length of the product strand to compute the yield of replication. By definition, the yield equals the fraction of nucleotides used to form VCG oligomers, that is strands that are at least long,
Schematic representation of a complex that allows templated ligation.
The strands E1 and E2 are adjacent to each other, such that a covalent bond can form between their ends. The length of the product strand, , is set by the length of the educt strands, and . The likelihood for the complex to form a product oligomer whose sequence is compatible with the true circular genome, , is determined by the length of the educts and the length of their hybridization region with the template. The parts of the complex shown with hatching do not affect .
We can express this quantity in terms of the equilibrium concentration of complexes facilitating templated ligation,
denotes the summation over all complexes, in which (i) the strands E1 and E2 are adjacent to each other, that is , and (ii) the length of the product . We multiply the equilibrium concentration by the length of the shorter educt strand to account for the number of incorporated nucleotides in line with the definition of the yield (Equation 7).
In order to compute the fidelity of replication, we need to distinguish between product oligomer sequences that are compatible with the genomes (correct sequences) and sequences that are incompatible with the genome (false sequences). As we do not know about the details of the sequences due to the coarse-grained representation of the complexes, we need to invoke a combinatorial argument to determine the fraction of correct products. To this end, we compare the number of product sequences that might be produced in a complex of given oligomer lengths and alignment indices to the number of correct products associated with the same complex configuration. The combinatorial multiplicity of the products that could be produced by a complex of given configuration is set by the combinatorial multiplicity of the possible templates, , multiplied by the multiplicity of the educt strands hybridizing to the template with given lengths of the hybridization regions, and ,
The multiplicity of correct products equals the combinatorial multiplicity of strands that have the same length as the product,
This implies that the probability for a complex of given shape to form a correct product is given by
Using this probability, we can compute the fidelity of replication,
as well as the replication efficiency,
Appendix 1
Scaling of replication efficiency with oligomer length in single-length VCG pools
We consider a pool containing monomers and VCG oligomers of a single length. As the feedstock only contains monomers, the total monomer concentration and the total feedstock concentration are identical, . Similarly, the total VCG concentration equals the total concentration of oligomers of length , . In systems of this type, the replication efficiency can be expressed as
where the concentrations denote the equilibrium concentration of all complexes facilitating the indicated type of templated ligation. Note that we do not include F+F ligations (i.e. dimerizations) in the numerator as they do not contribute to the yield. Moreover, the V+V ligations are multiplied by the length of the oligomers , as each V+V extends an existing oligomer by nucleotides.
Assuming that complexes comprised of three strands are the dominant type of complexes, we can express the equilibrium concentrations as a product of the equilibrium concentrations of the free strands and the effective association constants,
Here, denotes the equilibrium concentration of a VCG oligomer with a specific sequence; the equilibrium concentration of all VCG oligomers for any sequence is computed by multiplying with the combinatorial multiplicity, (analogously, , see Methods for details). Equation 8 allows us to write the replication efficiency as
where we introduced the ratio of equilibrium concentrations, . Maximizing Equation 9 with respect to yields
The effective association constants are computed using the combinatorial rules outlined in the Methods section. We find that is similar to , deviating at most by 10%. Moreover,
and,
These observations allow us to obtain a significantly simplified approximate expression for ,
We used the identity in the last step. To derive the scaling of as a function of , we need to analyze how the effective association constants depend on .
scales linearly with . In F+F ligations, a VCG oligomer acts as a template facilitating the ligation of two monomers. Independent of the length of the VCG oligomer, the two monomers always have the same binding affinity to their template. However, the number of possible positions that the two adjacent monomers can have on the template scales linearly in template length, causing the effective association constant to depend linearly on ,
(10)scales exponentially with . As illustrated in Figure 2E, the length of the hybridization region in complexes facilitating incorrect V+V ligations equals . As the hybridization energy is proportional to the length of the hybridization site, , the effective association constant scales exponentially with . The number of complexes facilitating incorrect product formation is a function of the unique subsequence length , but does not depend on the oligomer length . Therefore, there is no additional dependence of on ,
(11)
Appendix 1—figure 1 shows the -dependence of the effective association constants. The circles represent the effective association constants derived based on the combinatorial rules discussed in the Methods section, while the solid lines show the length-dependent scaling computed via Equation 10 and Equation 11.
Effective association constants of complexes facilitating F+F ligations (A), false V+V ligations (B), and F+V ligations (C).
The dots depict the effective association constants derived based on the combinatorial rules presented in the Methods section, and the solid lines represent the respective scaling laws introduced in Equations 10 and 11. Different colors correspond to different hybridization energies per matching nearest neighbor block γ.
Given the scaling of and , we find that the optimal ratio of equilibrium concentrations equals
We obtain (Figure 2) from by multiplying with the appropriate combinatorial multiplicity,
Combining the expression for with Equation 9, the optimal efficiency of replication can be expressed as
As (for sufficiently high ), we can expand in this small argument,
To understand the scaling of as function of , we need to characterize the scaling of with . A typical complex allowing for a F+V ligation is depicted in Figure 5D. The length of the overlap region between template and educt VCG oligomer can vary; it is at least one nucleotide, and at most nucleotides (one base pairing position needs to remain available for the monomer). We obtain the effective association constant by summing the contributions of all of these configurations,
Therefore, we use the following scaling ansatz for ,
Combining the scaling laws for , , and with Equation 12, we find
Appendix 2
Dependence of replication efficiency on the sequence of the genome
The coarse-grained representation used in the adiabatic approach applies only to a specific class of genomes, namely those in which the exhaustive coverage length is maximal and the unique motif length is minimal. This corresponds to genomes that satisfy and (see Methods for details).
To analyze replication behavior beyond this limited case, we developed a fully sequence-resolved extension of the adiabatic approach. Rather than using a coarse-grained view of oligomers, this method considers each distinct strand sequence as a separate chemical species. For example, to study a genome of length that is encoded in a pool containing monomers and octamers, a total of 132 individual oligomers (four monomers and 128 distinct octamers) must be accounted for. This contrasts sharply with the coarse-grained scenario, which involves only two variables (total monomer and total octamer concentrations). Starting from the single-stranded oligomers, all possible complexes involving up to three strands are enumerated, and their hybridization free energies are calculated using the same energetic model applied in the coarse-grained framework. In the aforementioned example, this results in 351,200 distinct sequence-resolved complexes, compared to just 135 complexes in the coarse-grained model, highlighting the increased computational demands in terms of memory and runtime. The hybridization equilibrium is computed by solving the algebraic system defined by the law of mass action and mass conservation. The procedure mirrors that of the coarse-grained approach (Methods), with the key difference that combinatorial prefactors are no longer required: these are inherently encoded in the full enumeration of unique sequences and their binding configurations.
For our further analysis, we fix the genome length at and systematically vary the motif length scales and . Genomes with desired values of and are constructed as described in the Methods section: Starting from a genome with maximal motif entropy (, ), we reduce the motif entropy on intermediate length scales to achieve the target characteristics. Supplementary file 1 provides an overview of all sampled genome sequences. We consider two limiting cases for motif distributions between and . In the weakly biased case, the motif distribution (for motifs of length ) is nearly uniform: most motifs of length appear exactly once, except for a few motifs that occur twice. This leads to genomes where , but the deviation from maximal entropy is minimal. In the strongly biased case, many motifs appear multiple times, resulting in a much less uniform motif distribution.
Each genome is mapped to a VCG pool containing monomers and oligomers of variable length . Assuming all oligomers are chemically activated, we compute the replication efficiency of each pool as a function of the relative oligomer concentration, for different values of and across genome types (Appendix 2—figure 1). As expected, replication efficiency exhibits a maximum at intermediate oligomer concentrations. The maximum efficiency achieved depends on both the VCG oligomer length and the genome’s motif properties. In general, longer oligomers enable more efficient replication. However, at a fixed oligomer length, genomes with higher unique motif lengths (or lower ) replicate with lower efficiency, implying that these genomes require longer oligomers for successful replication. For example, a VCG pool with oligomers of length can replicate a genome with and at 97% efficiency (see solid green curve in Appendix 2—figure 1B). In contrast, replicating a genome with and requires oligomers of at least to reach comparable efficiency (see dotted green curve in Appendix 2—figure 1C).
Replication efficiency as a function of the concentration of VCG oligomers for different choices of genomes and varying VCG oligomer length.
All genomes are long and include all motifs up to length , but differ with respect to their minimal unique subsequence length : (A) , (B) , and (C) (shown as dotted lines). For comparison, every panel shows the replication efficiency of a genome with (solid line). Different colors are used to distinguish different VCG oligomer lengths. Under otherwise identical conditions (e.g. identical oligomer length), replication proceeds with lower efficiency in genomes with higher unique subsequence length .
Across all tested genomes, the oligomer length required to achieve efficiency (denoted ) consistently exceeds the unique motif length (Appendix 2—figure 2). However, the difference depends on the structure of the motif distribution in the intermediate length range . Genomes with strongly biased motif distributions require longer oligomers (larger , see Appendix 2—figure 2A), whereas the replication behavior of weakly biased genomes closely mirrors that of the fully entropic genome (, ) (Appendix 2—figure 2B). In realistic prebiotic settings, genomes are likely to fall somewhere between these extremes, depending on the strength of biases introduced during their emergence.
Maximal replication efficiency as a function of the oligomer length for different genomes (all long).
Regardless of the genome, the oligomer length needs to exceed to enable replication with high efficiency (e.g., higher than 95%). The difference between the genome length required for high efficiency replication and the unique motif length depends on the motif distribution on intermediate length scales : Genomes with strong bias require longer oligomers (A) than genomes with weak bias (B) (see Supplementary file 1 for the genomes and their motif entropies).
Appendix 3
Influence of dimers on replication efficiency in single-length VCG pools
In a VCG pool comprised of monomers, dimers, and VCG oligomers of a single length, the replication efficiency can be expressed as
provided the pool operates in the concentration regime where all types of ligations other than F+V ligations are negligible. In this equation, denotes the equilibrium concentration of all complexes allowing for the templated ligation of oligomers of length to oligomers of length if templated by oligomers of length . denotes the same concentration under the additional constraint that the product oligomers need to be correct, that is the sequence of the product needs to be part of the pre-defined genome sequence.
Complexes allowing for templated ligation need to involve at least three oligomers but may be comprised of more strands. For the following analytical derivation, we restrict ourselves to ternary complexes, while the full numerical solution (see continuous lines in Figure 5C) also includes quaternary complexes. If only ternary complexes are accounted for, each equilibrium concentration may be written as a product of the effective association constant and the equilibrium concentrations of the involved oligomers,
Using this expression, the equation for the replication efficiency may be simplified,
The effective association constants in the denominator can be expressed as the sum of the association constants of correct and false products,
Effective association constants of complexes facilitating 1+V ligations (A) and 2+V ligations (B).
In the following, we simplify the notation by referring to as 1+V ligations, and to as 2+V ligations. The effective association constants differ with respect to their dependence on : While , and increase exponentially with , is length-independent (Appendix 3—figure 1). Introducing the parametrizations,
we can express as
Taking the limit yields an upper bound for the replication efficiency,
Provided that the ratio of equilibrium concentrations of dimers and monomers is comparable to the ratio of total concentration of dimers and monomers, , we can write
Therefore, the upper bound of the replication efficiency reads
For the genome that is analyzed in the main text, . Comparison of the analytical approach (involving only ternary complexes) to the numerical solution (involving quaternary complexes) reveals that the contribution of quaternary complexes is indeed negligible (Figure 5).
Appendix 4
Fraction of VCG oligomers extended by monomers in single-length VCG pools with kinetically suppressed V+V ligation
In pools containing monomers, dimers, and VCG oligomers of a single length, in which V+V ligations are kinetically suppressed (e.g. by only activating monomers), we observe that the fraction of oligomers that can be extended by a monomer reaches an asymptotic value for high VCG concentrations. Notably, this asymptotic value is independent of the oligomer length.
The fraction of oligomers that are in a monomer extension-competent state can be computed based on the equilibrium concentration of complexes facilitating the extension of an oligomer by a monomer,
Note that, in principle, the monomer can be added to the - or the -end of the oligomer, leading to two contributions in the numerator, which are identical due to symmetry, . Here, denotes the equilibrium concentration of all complexes allowing for the templated ligation of oligomers of length to oligomers of length if templated by oligomers of length . We assume that ternary complexes are the dominant type of complexes facilitating templated ligation, allowing us to express the equilibrium concentrations as
Therefore, the fraction of monomer-extendable oligomers reads
In the limit of high VCG concentration, almost all VCG oligomers form duplexes. Hence, the equilibrium concentration of free oligomers is well-approximated by
implying,
As almost all monomers are free in solution, their equilibrium concentration can be approximated by the total concentration of monomers, .
From this representation of , it is not clear yet why should be independent of . In order to derive an -independent expression for , we need to express the association constant of ternary complexes, , in terms of the duplex association constant, .
As illustrated in Appendix 4—figure 1, the effective association constant of the ternary complex may be obtained by multiplying the effective association constant of the duplex by the binding affinity of the monomer. We need to distinguish the contributions of duplexes based on the length of their hybridization region: If the two strands in the duplex form a hybridization region that is as long as the oligomers, there is free position for the monomer to hybridize to; such duplexes must be excluded from the computation of the association constant of ternary complexes. Moreover, it is necessary to take into account that monomers hybridizing to the end of a strand have different binding affinity than monomers hybridizing to the center of an oligomer (Appendix 4—figure 1A and Appendix 4—figure 1B-C). All in all, the effective association constant for ternary complexes reads
The notation implies that only complex configurations with a specific overlap length are included.
To proceed further, we express effective duplex association constant as a sum of effective duplex association constants with constrained overlap length ,
For each term contributing to the sum, the length of the hybridization region (overlap length) is known. Therefore, each contributing term can be evaluated by computing the energy associated with the known overlap length,
Therefore, the asymptotic ratio of monomer-extended oligomers equals
Addition of a single base pair will decrease the energy by the energy of a half nearest neighbor block, that is by . Thus, we assign , and find
The effective association constants of reactive ternary complexes can be computed based on the effective association constant of the duplexes.
(A) If the hybridization region of the two VCG oligomers is nucleotides long, the monomer hybridizes to the end (start) of the template strand. As the template has no dangling end, the energy contribution of the hybridizing monomer is . (B) and (C) for hybridization regions that are shorter than , but at least 1 nucleotide long, the energy contribution due to the added nucleotide is . In all cases, the factor 2 accounts for the two possible positions at which a monomer might be added.
Appendix 5
Analytical solution for equilibrium concentrations in multi-length VCG pools
We consider pools comprised of monomers, dimers, and VCG oligomers of multiple lengths. For simplicity, we restrict ourselves to concentration profiles with a uniform distribution of VCG oligomers, such that all VCG oligomers have the same concentration. Assuming that almost all oligomer mass is contained in single strands and duplexes, the mass conservation equations read
Note that there is a mass conservation equation for each oligomer length individually, that is we are dealing with a set of multiple coupled quadratic equations. We can make the system of equations dimensionless by introducing a dimensionless concentration,
Rescaling by is helpful, as this expression is the solution to the approximative mass conservation equation in the high concentration limit,
under the assumption that . The latter assumption is reasonable given the concentration profile is uniform. As a consequence of the rescaling, we expect to be of order 1 in the high concentration limit.
Writing the full mass conservation equation in terms of the dimensionless concentration yields,
We drop all ratios of total concentrations, as we assume that the total concentration is the same for each oligomer length of VCG oligomer (uniform concentration profile),
By introducing the dimensionless prefactors,
we can rewrite the mass conservation equation,
To solve for , we make the assumption that the dimensionless equilibrium concentration is close to 1 for any length . For this reason, we may assume that . This assumption is crucial, as it allows us to de-couple the quadratic equations. Therefore, the mass conservation reads,
We can use this representation to compute the equilibrium concentrations recursively: We start the recursion with the assumption , and compute with this assumption,
Next, we compute , and solve the mass conservation equation for the equilibrium concentration (first recursion step),
This scheme can be applied until the approximated values of match the true values (obtained via numerical root finding) sufficiently well,
Note that the equilibrium concentration obtained after the -th iteration step depends on the total concentration via the rescaling prefactor as well as via . It turns out that the approximation converges after about five iteration steps; the relative error between the true (numerically computed) equilibrium concentration and the approximation is already below 1% in the first iteration step.
Comparison between approximate (analytical) and exact (numerical) solution of the (de)hybridization equilibrium in multi-length VCG pools.
(A) The approximation converges after roughly five iteration steps. The relative error drops below 1% in the first iteration step already. (B) Equilibrium concentrations as a function of total VCG concentration. Continuous lines show the numerical solution, while the dotted and dashed lines depict the approximation obtained in the zeroth or first iteration step, respectively. The feedstock concentration is fixed, .
Appendix 6
Threshold concentration for productivity inversion in multi-length VCG pools with kinetically suppressed V+V ligations
In VCG pools that contain VCG oligomers of multiple lengths as well as activated monomers, we observe ‘productivity inversion’: Short oligomers are more likely than long oligomers to be part of a complex capable of extending the oligomer by a monomer. However, productivity inversion can be only observed for sufficiently high concentration of VCG oligomers. The threshold concentration that is necessary for oligomers of length to exceed the monomer-extension fraction of oligomers of length (assuming ) is set by the condition . To compute , we need to account for all possible complexes, in which an oligomer can be extended by a monomer,
As we are considering a uniform concentration profile, all VCG oligomers have the same total concentration, that is . Thus, we can express the condition for the threshold concentration as
We combine this condition with the analytical approximation of the equilibrium concentration up to the first iteration step derived in Appendix 5 (Equations 14 and 15). As the equilibrium concentrations depend on the total concentration of VCG oligomers , this yields a semi-analytical criterion for the threshold concentration , at which the fraction of monomer-extended -mers exceeds the fraction of monomer-extended -mers.
Appendix 7
Productivity inversion in the experimental study by Ding et al.
In their experimental study, Ding et al. focus on a genome of length . In the VCG pool encoding the genome, Ding et al. include dimers with 11 different sequences, trimers with 20 different sequences, and tetramers up to 12-mers with 24 different sequences each (Tab. S1 in Ding et al., 2023).
Every oligomer sequence is included with a concentration of (so-called 1x profile), adding up to a total concentration of . The feedstock for the replication is comprised of activated imidazolium-bridged homo-dinucleotides with a total concentration of . Over a timescale of around , these homo-dinucleotides react with each other to form 6 additional imidazolium-bridged hetero-dinucleotides as well as activated mono-nucleotides (Fig. S1 in Ding et al., 2023).
We construct a genome that mimics the properties of the genome used by Ding et al., but obeys our genome construction principles outlined in the Methods section. To this end, we consider a genome of length and a minimal unique subsequence length of . This implies that our VCG pool contains dimers with 16 different sequences and trimers up to 12-mers with 24 different sequences.
Our model does not include imidazolium-bridged dinucleotides explicitly. Instead, we assume that the concentration of activated mononucleotides in our model equals the total concentration of activated homo-dinucleotides () used in the experimental study. The total concentration of non-activated oligomers is treated as a free parameter.
We find that the system exhibits inversion of productivity: Over the entire studied concentration range, 10-mers are more likely to be extended by monomers than 12-mers (Appendix 7—figure 1). Moreover, 8-mers are more productive than 12-mers (provided the concentration of non-activated oligomers exceeds roughly ), and more productive than 10-mers (provided ). Thus, for the experimentally used concentration of non-activated oligomers (, see vertical dashed line in Appendix 7—figure 1), 8-mers are more productive than 10-mers, which are in turn more productive than 12-mers. Unlike in the experimental system, 6-mers are less likely than 8-mers and 10-mers to be extended by a monomer. We suppose that this difference can be attributed to the differences between the experimental setup and our theoretical model: (i) We choose a (slightly) different genome than Ding et al. (ii) We model imidazolium-bridged dinucleotides as mononucleotides, as imidazolium-bridged dinucleotides only incorporate one mononucleotide at a time, just like activated mononucleotides. However, dinucleotides bind more stably to an existing complex than mononucleotides, which will affect the fraction of monomer-extended oligomers predicted.
Replication performance of multi-length VCG pools, in which only the monomers are activated.
The pool includes activated monomers , as well as non-activated oligomers of length up to with variable total concentration . The system exhibits inversion of productivity: 10-mers are more likely to be in a monomer-extension competent state than 12-mers, 8-mers are more likely to be extended by monomers than 10-mers (for provided ). For the experimentally used concentration (vertical dashed line), 8-mers are more productive than 10-mers, and those are more productive than 12-mers. However, unlike in the experimental system, 6-mers are less productive than 8-mers and 10-mers.
Data availability
This manuscript presents a computational study, employing both stochastic simulations and a deterministic coarse-grained approximation (the adiabatic approach). Means and standard deviations from the stochastic simulations are shown in Figure 2 and listed in Appendix 2 - Table 1. Data from the adiabatic approach are shown in Figures 2-7 and can be reproduced using the code available at https://github.com/gerland-group/VirtualCircularGenome (copy archived at Burger, 2025). This repository also includes the code for sampling circular genomes via a Metropolis-Hastings-type algorithm.
References
-
Thermodynamics and kinetics of DNA and RNA dinucleotide hybridization to gaps and overhangsBiophysical Journal 122:3323–3339.https://doi.org/10.1016/j.bpj.2023.07.009
-
In-ice evolution of RNA polymerase ribozyme activityNature Chemistry 5:1011–1018.https://doi.org/10.1038/nchem.1781
-
Asphalt, water, and the prebiotic synthesis of ribose, ribonucleosides, and RNAAccounts of Chemical Research 45:2025–2034.https://doi.org/10.1021/ar200332w
-
SoftwareVirtualCircularGenome, version swh:1:rev:f3c7e7e7d8e0494de89be1316f880c159f155bddSoftware Heritage.
-
Experimental tests of the virtual circular genome model for nonenzymatic RNA replicationJournal of the American Chemical Society 145:7504–7515.https://doi.org/10.1021/jacs.3c00612
-
The RNA World: molecular cooperation at the origins of lifeNature Reviews. Genetics 16:7–17.https://doi.org/10.1038/nrg3841
-
Physical non-equilibria for prebiotic nucleic acid chemistryNature Reviews Physics 5:185–195.https://doi.org/10.1038/s42254-022-00550-3
-
The effect of leaving groups on binding and reactivity in enzyme-free copying of DNA and RNANucleic Acids Research 44:5504–5514.https://doi.org/10.1093/nar/gkw476
-
Synthesis of carbohydrates in mineral-guided prebiotic cyclesJournal of the American Chemical Society 133:9457–9468.https://doi.org/10.1021/ja201769f
-
The prebiotic evolutionary advantage of transferring genetic information from RNA to DNANucleic Acids Research 39:8135–8147.https://doi.org/10.1093/nar/gkr525
-
Cascade of reduced speed and accuracy after errors in enzyme-free copying of nucleic acid sequencesJournal of the American Chemical Society 135:354–366.https://doi.org/10.1021/ja3095558
-
Enzyme-free copying of 12 bases of RNA with dinucleotidesAngewandte Chemie 61:e202203067.https://doi.org/10.1002/anie.202203067
-
Freeze-thaw cycles as drivers of complex ribozyme assemblyNature Chemistry 7:502–508.https://doi.org/10.1038/nchem.2251
-
Mapping a systematic ribozyme fitness landscape reveals a frustrated evolutionary network for self-aminoacylating RNAJournal of the American Chemical Society 141:6213–6223.https://doi.org/10.1021/jacs.8b13298
-
Effect of stalling after mismatches on the error catastrophe in nonenzymatic nucleic acid replicationJournal of the American Chemical Society 132:5880–5885.https://doi.org/10.1021/ja100780p
-
The origins of the RNA worldCold Spring Harbor Perspectives in Biology 4:a003608.https://doi.org/10.1101/cshperspect.a003608
-
Self-assembly of informational polymers by templated ligationPhysical Review X 11:031055.https://doi.org/10.1103/PhysRevX.11.031055
-
The hammerhead ribozymeProgress in Molecular Biology and Translational Science 120:1–23.https://doi.org/10.1016/B978-0-12-381286-5.00001-9
-
Enzyme-free ligation of dimers and trimers to RNA primersNucleic Acids Research 47:3836–3845.https://doi.org/10.1093/nar/gkz160
-
An optimal degree of physical and chemical heterogeneity for the origin of life?Philosophical Transactions of the Royal Society B 366:2894–2901.https://doi.org/10.1098/rstb.2011.0140
-
Transient states during the annealing of mismatched and bulged oligonucleotidesNucleic Acids Research 52:2174–2187.https://doi.org/10.1093/nar/gkae091
-
RNA complexes with nicks and gaps: thermodynamic and kinetic effects of coaxial stacking and dangling endsJournal of the American Chemical Society 146:18083–18094.https://doi.org/10.1021/jacs.4c05115
-
Cyclophospholipids enable a protocellular life cycleACS Nano 17:23772–23783.https://doi.org/10.1021/acsnano.3c07706
-
A highly reactive imidazolium-bridged dinucleotide intermediate in nonenzymatic RNA primer extensionJournal of the American Chemical Society 138:11996–12002.https://doi.org/10.1021/jacs.6b07977
-
Prolinyl nucleotides drive enzyme-free genetic copying of RNAAngewandte Chemie 62:e202307591.https://doi.org/10.1002/anie.202307591
-
Kinetics of renaturation of DNAJournal of Molecular Biology 31:349–370.https://doi.org/10.1016/0022-2836(68)90414-2
Article and author information
Author details
Funding
Deutsche Forschungsgemeinschaft (CRC 392 Project-ID 521256690)
- Ulrich Gerland
Deutsche Forschungsgemeinschaft (ORIGINS EXC-2094-390783311)
- Ulrich Gerland
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Paul Higgs and members of the Gerland group for stimulating discussions. This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via the CRC/TRR 392 Molecular Evolution (Project-ID 521256690), and under Germany’s Excellence Strategy (EXC-2094-390783311, ORIGINS).
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.104043. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Burger and Gerland
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 784
- views
-
- 24
- downloads
-
- 1
- citation
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Citations by DOI
-
- 1
- citation for Reviewed Preprint v2 https://doi.org/10.7554/eLife.104043.2