Introduction

In order to delineate possible pathways towards the emergence of life, it is necessary to understand how a chemical reaction network capable of storing and replicating genetic information might arise from prebiotic chemistry. RNA is commonly assumed to play a central role on this path, as it can store information in its sequence and catalyze its own replication [13]. While ribozymes capable of replicating strands of their own length have been demonstrated in the laboratory [4], it remains elusive how enzyme-free self-replication might have worked before the emergence of such complex ribozymes.

One possible mechanism is template-directed primer extension [511]: In this process, a primer hybridizes to a template and is extended by short oligonucleotides, thereby forming a (complementary) copy of the template strand. Considerable progress has been made in optimizing template-directed primer extension, but challenges remain: (i) The produced copies are likely to be incomplete. So far, at most 12 nt have been successfully added to an existing primer [10]. Moreover, as the pool of primer strands needs to emerge via random polymerization, the primer is likely to hybridize to the template at various positions, and not only to its 3’-end, leaving part of the 3’-end of the template uncopied [12]. (ii) Errors in enzyme-free copying are frequent due to the limited thermodynamic discrimination between correct Watson-Crick pairing and mismatches [1316] The error probability constrains the length of the genome that can be reliably replicated.

The issue of insufficient thermodynamic discrimination can, in principle, be mitigated by making use of kinetic stalling after the incorporation of a mismatch [15]. By introducing a competition between the reduced polymerization rate and a characteristic timescale of the non-equilibrium environment, it is possible to filter correct sequences from incorrect ones [16]. To address the challenge of incomplete copies, Zhou et al. propose an new scenario of replication, the so-called Virtual Circular Genome (VCG) [17]. In this scenario, genetic information is stored in a pool of oligomers that are shorter than the circular genome they collectively encode: Each oligomer bears a subsequence of the circular genome, such that the collection of all oligomers encodes the full circular genome virtually. Within the pool, each oligomer can act as template or primer [17]. The oligomers hybridize to each other and form complexes that allow for templated ligation of two oligomers, or for the extension of an oligomer by a monomer. Because the sequences of the ligated strands and the template are part of the genome sequence, most of the products should also retain the sequence of the genome. That way, long oligomers encoding the circular genome are produced at the expense of short oligomers [17]. The long strands, in turn, can assemble into catalytically active ribozymes. With a continuous influx of short oligomers, the VCG might allow for continuous replication of the virtually encoded circular genome. Importantly, replication in the VCG is expected to avoid the issue of incomplete copies. Since the genome is circular, it does not matter which part of the genome an oligomer encodes, as long as the sequence is compatible with the genome sequence. An additional feature of the VCG scenario is that replication should be achievable without the need of adding many nucleotides to a primer: Provided the concentration of oligomers decreases exponentially with their length, the concentration of each oligomer in the pool can be doubled by extending each oligomer only by a few nucleotides [17]. The extension of an oligomer by a few nucleotides in a VCG pool has already been demonstrated experimentally [18].

A recent computational study points out that the VCG scenario is prone to loss of genetic information via ”sequence scrambling” [19]: If the genome contains identical sequence motifs at multiple different loci, replication in the VCG will mix the sequences of these loci, thus destroying the initially defined genome. More generally, it is currently unclear, which conditions could prevent genome instability of VCGs, such that their genetic information is retained. Length distribution, sequence composition, oligonucleotide concentration and environmental conditions, such as temperature, all affect the stability of complexes and thus the replication dynamics of the VCG pool. Here, we characterize the replication fidelity and yield of VCG pools using a kinetic model, which explicitly incorporates association and dissociation of RNA strands as well as templated ligation. We study a broad spectrum of prebiotically plausible and experimentally accessible oligomer pools, from pools containing only monomers and long oligomers of a single length to pools including a range of long oligomers with uniform or exponential concentration profile. The length of the included oligomers as well as their concentration are free parameters of our model.

We find that, regardless of the details of the pool, there is a competition between three types of template-directed ligation reactions: (i) Ligations of two short oligomers, e.g., monomers, forming oligomers that are too short to specify a unique locus on the genome. (ii) Ligations of a short oligomer to a long oligomer, which form longer oligomers with sequences that are typically compatible to the genome. (iii) Ligations of two long oligomers, which often create oligomers that are incompatible with the genome sequence. Therefore, the fidelity is primarily determined by the competition between the error-free extension of a long oligomer and the erroneous ligation of two long oligomers. The likeli-hood of the latter reaction can be minimized by lowering the relative concentration of long oligomers, even though this increases the fraction of short oligomers that are used up for ligations of the first type. Consequently, the fidelity can be improved at the expense of reducing the yield. The efficiency, meaning the yield attainable at a fixed high fidelity, thus depends on the length distribution of the oligomers in the pool.

Alternatively, the issue of erroneous ligations is solved if only monomers are activated, such that each ligation incorporates only one monomer at a time, as in the experimental study by Ding et al. [18]. In this case, the VCG concentration can be chosen arbitrarily large without compromising fidelity. Interestingly, our model predicts an unexpected feature: In the limit of high VCG concentrations, the rate of primer extension is higher for shorter oligomers than for long oligomers, even though, intuitively, complexes containing longer oligomers are expected to be more stable and thus more productive. The same behavior was indeed observed experimentally [18]. We provide an explanation for this feature, and discuss its dependence on system parameters such as the length and the concentration of long oligomers in the pool.

Model

The VCG is a replication scenario where a circular genome is stored in a pool of oligomers. The oligomers are shorter than the genome that they encode. Each oligomer bears a subsequence of the circular genome, such that, collectively, the oligomers encode the full genome (Fig. 1A). To set up our model for the dynamics of VCG pools, we specify (i) our choice of circular genomes, (ii) how to map the genomes to pools of oligomers, and (iii) which chemical reactions govern the dynamics.

Model. A In the Virtual Circular Genome (VCG) model, a circular genome (depicted in green) as well as its sequence complement are encoded in a pool of oligomers (depicted in blue and orange). Collectively, the pool of oligomers encodes the whole sequence of the circular genome. Depending on their length, two types of oligomers can be distinguished: Feedstock oligomers (depicted in blue) are too short to specify a unique locus on the genome, while long VCG oligomers (depicted in orange) do. B The length-distribution of oligomers included in the VCG pool is assumed to be exponential. The concentration of feedstock oligomers and VCG oligomers as well as their respective length scales of exponential decay and can be varied independently. The set of included oligomer lengths can be restricted via and . C The hybridization energy of complexes is computed using a simplified nearest-neighbor model: Each full block comprised of two base pairs (depicted in pink) contributes γ, while dangling end blocks (depicted in blue) contribute γ/2. D Oligomers form complexes via hybridization reactions, or dehybridize from an existing complex. The ratio of hybridization and dehybridization rate is governed by the hybridization energy. If two oligomers are adjacent to each other in a complex, they can undergo templated ligation. E Based on the length of the reacting oligomers, we distinguish three types of templated ligation: Ligation of two feedstock oligomers (F+F), ligation of feedstock oligomer to VCG oligomer (F+V) and ligation of two VCG oligomers (V+V).

Circular Genomes

For a given genome length, LG, there are distinct circular genomes (the factor 1/2LG accounts for the freedom to select the starting position and to choose the Watson or the Crick strand as reference sequence). A key property of the genome is its minimal unique motif length, LS, defined as the shortest length such that all possible subsequences of length LLS appear at most once in the genome. In other words, all subsequences of length LLS specify a unique locus on the genome. In principle, each circular genome has another length scale, corresponding to the maximal motif length, up to which all possible sequence motifs are contained in the genome. Here, for convenience, we restrict ourselves to genomes in which this length-scale is LS − 1, such that all possible subsequences shorter than LS are contained in the sequence of the genome. Additionally, we choose only unbiased genomes in which all possible subsequences of length L < LS are contained at equal frequency.

Construction of VCG Pools

To specify a VCG pool encoding a genomic sequence (chosen as described above), one must select the subsequences to be included in the pool, as well as their concentrations. We consider unbiased pools, where the concentration of subsequences, c(L), depends only on their length, L, i.e., all subsequences of a given length are included at equal concentration. We refer to the length-dependent concentration profile as the length distribution of the pool. Depending on their length, oligomers fall into two categories (Fig. 1B): (i) short feedstock oligomers and (ii) long VCG oligomers. Feedstock oligomers are oligomers that are shorter than the unique subsequence length LS. Since their sequence appears multiple times on the genome, they do not encode a specific position on the genome. Thus, they serve as feedstock for the elongation of VCG oligomers rather than as information storage. Conversely, the VCG oligomers, which are longer than the unique subsequence length LS, have a unique locus on the circular genome. Collectively, the VCG oligomers enable the reconstruction of the full genome.

The full length distribution, c(L), can be decomposed into the contributions of feedstock oligomers and VCG oligomers,

We assume that both types of oligomers have an exponential length distribution. In our model, the concentration of VCG oligomers can be varied independently of the concentration of feedstock oligomers, and the length scales for the exponential decay ( vs.) may differ between VCG oligomers and feedstock oligomers. Additionally, we can restrict the set of oligomer lengths included in the pool by setting minimal and maximal lengths for feedstock oligomers and for VCG oligomers individually,

For any other oligomer length, the concentrations equal zero. This parametrization includes uniform length distributions as a special case (κF = 0 and κV = 0), and also allows for concentration profiles that are peaked. Peaked length distributions can emerge from the interplay of templated ligation, (de)hybridization and outflux in open systems [20]. We define the total concentration of feedstock oligomers, , as well as the total concentration of VCG oligomers, . Their ratio will turn out to be an important determinant of the VCG dynamics.

(De)hybridization Kinetics

Oligomers can hybridize to each other to form double-stranded complexes, or dehybridize from an existing complex. For simplicity, we do not include self-folding within a strand, which is a reasonable assumption for short oligomers. The stability of a complex is determined by its hybridization energy, with lower hybridization energy indicating greater stability. We use a simplified nearest-neighbor energy model to compute the hybridization energy [2022]: The total energy equals the sum of the energy contributions of all nearest-neighbor blocks in a given complex (Fig. 1C). The energy contribution associated with a block of two Watson-Crick base pairs (matches) is denoted γ < 0, and dangling end blocks involving one Watson-Crick pair and one free base contribute γ/2. Nearest-neighbor blocks involving mismatches increase the hybridization energy by γMM > 0 per block, thus reducing the stability of the complex. The rate constants of hybridization and dehybridization are related via

where c° = 1 M is the standard concentration, and ΔG is the free energy of hybridization. The association rate constant kon is proportional to the encounter rate constant, kenc = 1/(c°t0). The encounter timescale t0 serves as the elementary time unit of the kinetic model, with all reaction timescales measured relative to it.

Templated Ligation

Two oligomers A and B that are hybridized adjacently to each other on a third oligomer C can produce a new oligomer A-B via templated ligation (Fig. 1D). Depending on the length of A and B, we distinguish three types of ligation reactions (Fig. 1E): (i) F+F ligations, in which two feedstock oligomers ligate, (ii) F+V ligations, where a VCG oligomer is extended by a feedstock oligomer, and (iii) V+V ligations involving two VCG oligomers. The formation of a covalent bond via templated ligation is not spontaneous, but requires the presence of an activation reaction. Usually, these reactions add a leaving group to the 5’-end of the nucleotide which is cleaved during bond formation [10, 23]. We assume that the concentration of activating agent is sufficiently high for the activation to be far quicker than the formation of the covalent bond, such that activation and covalent bond formation can be treated as a single effective reaction. We consider pools in which all oligomers are activated, as well as pools in which only monomers are activated.

Observables

Essentially, templated ligation in the pool produces VCG oligomers at the expense of feed-stock oligomers. To characterize these dynamics, we distinguish between different contributions to the production flux. While the product of an F+V ligation is always a VCG oligomer, F+F ligations can produce feed-stock oligomers or VCG oligomers. In both cases, a produced VCG oligomer can be correct (compatible with the genome) or incorrect (incompatible). V+V ligations always produce VCG oligomers that can be correct or incorrect. We quantify these processes by measuring extension fluxes in units of nucleotides ligated to an existing strand (counting the length of the shorter ligated strand as the number of incorporated bases). In particular, we define the fidelity f as the extension flux resulting in correct VCG oligomers relative to the flux resulting in any VCG oligomer,

In addition, we introduce the yield y as the proportion of total extension flux that produces VCG oligomers,

Efficient replication of the VCG requires both, high fidelity and high yield. Hence, we introduce the efficiency of replication η as the product of fidelity and yield,

In order to discern the contributions of different types of templated ligations (F+F, F+V, V+V) to fidelity, yield, and efficiency, we define the ligation share s of a ligation type,

Results

Replication efficiency reaches a maximum at intermediate concentrations of VCG oligomers

To study the dynamics of VCG pools, we start by picking a exemplary genome of length LG = 16 nt,

  • 5’-AAAGAGGACACGGCAU-3’

  • 3’-UUUCUCCUGUGCCGUA-5’.

In this genome, the minimal unique subsequence length is LS = 3 nt. Consequently, monomers as well as dimers are part of the feedstock and VCG oligomers are at least 3 nt long. We first consider the simplest oligomer pool possible, which includes only monomers and VCG oligomers of a single length LV (Fig. 2A).

Replication performance of VCG pools containing VCG oligomers of a single length (single-length VCG pools). A The pool contains a fixed concentration of monomers, mM, as well as VCG oligomers of a single length, LV, at variable concentration (the VCG oligomers cover all possible subsequences of the genome and its complement at equal concentration). B The yield increases as a function of , because dimerizations become increasingly unlikely for high VCG concentrations. C The ligation share of different ligation types depends on the total VCG concentration: In the low concentration limit, dimerization (F+F) dominates; for intermediate concentrations, F+V ligations reach their maximum, while, for high concentrations, a substantial fraction of reactions are V+V ligations. The panel depicts the behavior for LV = 6 nt. D Replication efficiency is limited by the small yield for small . In the limit of high , replication efficiency decreases due to the growing number of error-prone V+V ligations. Maximal replication efficiency is reached at intermediate VCG concentration. E V+V ligations are prone to the formation of incorrect products due to the short overlap between educt strand and template. In general, the probability of correct product formation, pcorr, depends on the choice of circular genome and as well as its mapping to the VCG pool. The probabilities listed here refer to a VCG pool with LG = 16 nt and LS = 3 nt. F The optimal equilibrium concentration ratio of free VCG strands to free feedstock strands, which maximizes replication efficiency, decays as a function of length (continuous line). The analytical scaling law (Eq. (4), dashed line) captures this behavior. The window of close-to-optimal replication, within which efficiency deviates no more than 1% from its optimum (shaded areas), increases with LV, facilitating reliable replication without fine-tuning to match the optimal concentration ratio. G Maximal replication efficiency, which is attained at the optimal VCG concentration depicted in panel E, increases as a function of LV and approaches a plateau of 100%. For high efficiency, Eq. (5) provides a good approximation of the length-dependence of ηmax (dashed lines). The oligomer length at which replication efficiency equals 95% is determined using Eq. (5) (vertical dotted lines). H By construction, the unique subsequence length, LS, increases logarithmically as a function of the length of the genome, LG. The length of VCG oligomers, LV, at which the optimal replication efficiency reaches 95% (computed via Eq. 5) exhibits the same logarithmic dependence on LG.

These initial VCG pools are evolved in time using a stochastic simulation based on the Gillespie algorithm [2022]. Since the Gillespie algorithm operates on the level of counts of molecules instead of concentrations, we must assign a volume to each system. We choose volumes in the range 1 μm3 to 10 000 μm3 (Supplementary Material Section S-I). In addition to the volume, the reaction rate constants need to be chosen appropriately: (i) Experimentally determined association rate constants are typically around 106 M−1 s−1 [2426], which corresponds to an association timescale of t0 ≈ 1 μs. (ii) The timescale of dehybridization is computed via Eq. (3) using the energy contribution γ = −2.5 kBT for a matching nearest-neighbor block and γMM = 25.0 kBT in case of mismatches. The high energy penalty of nearest-neighbor blocks involving mismatches, γMM, is chosen to suppress the formation of mismatches, while the value of γ roughly matches the average energy of all matching nearest-neighbor blocks given by the Turner energy model of RNA hybridization [27]. (iii) For templated ligation, we select a reaction rate constant of . This choice of klig is consistent with ligation rates observed in enzyme-free template-directed primer extension experiments, which range from around 10−6 s−1 [9, 10] to roughly 10−4 s−1 [23], depending on the underlying activation chemistry.

Based on the ligation events observed in the time evolution, we compute the observables introduced above (Model section). Due to the small ligation rate constant, the time evolution can only be simulated for a few ligation time units. Consequently, ligation events are scarce, which leads to high variances in the computed observables. We mitigate this issue by calculating the observables based on the concentration of complexes that are in a productive configuration, even if they do not undergo templated ligation within the time window of the simulation (Supplementary Material Section S-I).

From the ligation events, we obtain the total production flux of VCG oligomers and feedstock oligomers. Our observable y (yield) quantifies the fraction of this total flux directed to VCG oligomers. Fig. 2B shows how the yield depends on the concentration of VCG oligomers, , at a fixed total monomer concentration, . Here, the data points (with error bars) represent the simulation results with the different colors corresponding to different choices of initial VCG oligomer length, LV. We observe that the yield increases monotonically with the concentration of VCG oligomers, and approaches 100% for high . The concentration at which the pool reaches a yield of 50% depends on the oligomer length LV: Shorter VCG oligomers require higher concentrations for high yield. The concentration dependence of the yield can be rationalized by the types of templated ligations that are involved. For low VCG concentration, most templated ligation reactions are dimerizations (1+1) with the VCG oligomers merely acting as templates (Fig. 2C). As dimers are shorter than the unique subsequence length LS, their formation does not contribute to the yield, which explains the low yield in the limit of small . Conversely, for high VCG concentration, most of the templated ligations are F+V or V+V ligations, which produce oligomers of length LLS, implying high yield.

However, Fig. 2C also shows that the relative contribution of V+V ligation increases with increasing , with a large fraction of them producing incorrect products. This reduction in fidelity, f, causes a trade-off between fidelity and yield, which is visible as a peak at intermediate VCG concentrations in the replication efficiency, η = f · y (Fig. 2D). Using hexamers as an example, Fig. 2E illustrates why V+V ligations are prone to forming incorrect oligomers: In order to ensure that two oligomers only ligate if their sequences are adjacent to each other in the true circular genome, both oligomers need to share an overlap of at least LS nucleotides with the template oligomer. Otherwise, the product oligomers might form sequences incompatible with the true genome. The probability of forming incorrect products is a consequence of the combinatorics of possible subsequences in the VCG. Specifically, for a genome with LG = 16 nt, there are 32 different hexamers. For example, if the left educt hexamer only has one nucleotide of overlap with the template, there are 1/4 · 32 = 8 possible educt oligomers. However, only one out of those 8 hexamers is the correct partner for the right educt hexamer, implying an error probability of 1 − 1/8 = 7/8 (first example in Fig. 2E).

Characterizing replication efficiency using simulation data is computationally expensive. Depending on the parameters, up to a hundred simulations, each taking a few days, are needed to obtain a single data point in Fig. 2B-D to sufficient accuracy. To enable rapid computation of the observables of interest over a wide range of parameter values, we introduce an additional approximate approach that relies on (i) the separation of time-scales between (de)hybridization and templated ligation, and (ii) on a coarse-grained representation of the oligomers in the VCG pool. The curves shown together with the simulation data in Fig. 2 are computed using this “adiabatic approach”. The approach is detailed in Supplementary Material Sections S-II and S-III, and only outlined briefly in the following: Due to the slow rate of ligation, the (de)hybridization equilibrium is typically reached long before the first ligation event happens. To determine the hybridization equilibrium, we assume oligomers of the same length to have identical concentration, regardless of their sequence. The concentration of complexes comprised of multiple oligomers is calculated using the law of mass action: It equals the product of the individual oligomer concentrations divided by the respective dissociation constant Kd, which in turn depends on the hybridization energy, ΔG. Depending on the length of the hybridization region, one or multiple hybridization partners may be possible based on the oligomer sequences present in the pool. This is accounted for by weighting the complex concentrations by combinatorial multiplicities. For each oligomer length, we impose a mass conservation constraint ensuring that the equilibrium concentration of free oligomers and the concentration of oligomers bound in complexes adds up to the total concentration in the pool. Solving the law of mass action together with the mass conservation equations numerically yields the equilibrium concentrations of free VCG and feedstock strands, and , for any given total concentration, and . Knowing the equilibrium concentration of free strands allows us to compute concentrations of productive complexes and thus observables characterizing templated ligation.

The results of the adiabatic approach agree well with the data obtained from the simulations (Fig. 2B-D). For instance, the adiabatic approach predicts the same non-monotoneous concentration dependence of replication efficiency as the simulations: VCG pools containing longer oligomers achieve higher maxima in replication efficiency at lower VCG concentration. While the available simulation data only allows for a qualitative description of this trend, the adiabatic approach enables a quantitative analysis. Fig. 2F (solid line) shows the equilibrium concentration ratio at which replication efficiency is maximal, as a function of the VCG oligomer length, LV. As expected from the qualitative trend, pools containing longer oligomers reach their maximum for lower concentration of VCG oligomers. The shaded area indicates the range of VCG concentrations within which the pool’s efficiency deviates by no more than one percent from its optimum. We observe that this range of close-to-optimal VCG concentrations increases with LV. Thus, pools containing longer oligomers require less fine-tuning of the VCG concentration for replication with high efficiency.

In addition to the numerical results, we utilize the adiabatic approach to study the optimal equilibrium VCG concentration analytically (Supplementary Material Section S-V). We find that, for any choice of LV, replication efficiency reaches its maximum when the fractions of dimerization (1+1) reactions and erroneous V+V ligations are equal (Fig. 2C for LV = 6 nt). This criterion can be used to derive a scaling law for the optimal equilibrium concentration ratio as a function of the oligomer length LV,

which is shown as a dashed line in Fig. 2F (the length-scale ΛF+F is defined in Supplementary Material Section S-V). The optimal equilibrium concentration ratio decreases exponentially with LV, while the hybridization energy γ/2 sets the inverse length-scale of the exponential decay. Analytical estimate and numerical solution agree well, as long as the hybridization is weak and oligomers are sufficiently short. For strong binding and long oligomers, complexes involving more than three strands play a non-negligible role, but such complexes are neglected in the analytical approximation.

Fig. 2G shows how the maximal replication efficiency depends on the VCG oligomer length. Consistent with the qualitative trend observed in Fig. 2D, longer oligomers enable higher maxima in replication efficiency. Regardless of the choice of γ, replication efficiency reaches 100% if LV is sufficiently high. Using Eq. (4), we find the following approximation for the maximal replication efficiency attainable at a given oligomer length (see dashed lines in Fig. 2G),

where η° is a genome-dependent constant (Supplementary Material Section S-VI). This equation provides guidance for the construction of VCG pools with high replication efficiency: Given a target efficiency, the necessary oligomer length and hybridization energy, i.e., temperature, can be calculated. In Fig. 2G, we determine the oligomer length necessary to achieve ηmax = 95% for varying hybridization energies γ. At higher temperature (weaker binding), VCG pools require longer oligomers for replication with high efficiency.

Eq. (5) is not restricted to the example genome of length LG = 16 nt introduced previously, but may be applied to genomes of arbitrary length (provided the genomes follow the construction principles outlined in the method section). In such genomes, the minimal unique subsequence length scales logarithmically with the length of the genome, LS ∼ ln(LG). Based on LG and LS, we can compute the genome-dependent quantities ΛF+F and η° via the combinatorial multiplicities of productive complexes, which allow us to compute the characteristic VCG oligomer length at which the system reaches ηmax = 95%. Just like LS, this characteristic VCG oligomer length also increases logarithmically with the genome length (Fig. 2H), allowing VCG pools with moderate oligomer sizes to encode long circular genomes.

Pools containing VCG oligomers of different lengths are dominated by the long oligomers

In the previous section, we characterized the behavior of pools containing VCG oligomers of a single length. We observed that V+V ligations are error-prone due to insufficient overlap between the educt strands and the template, whereas F+V ligations extend the VCG oligomers with high efficiency. The F+V ligations will gradually broaden the length distribution of the VCG pool, raising the question of how this broadening affects the replication behavior. In principle, introducing multiple oligomer lengths into the VCG pool might even improve the fidelity of V+V ligations, since a long VCG oligomer could serve as a template for the correct ligation of two shorter VCG oligomers.

To analyze this question quantitatively, we first consider the simple case of a VCG pool that only contains monomers, tetramers, and octamers. The concentration of monomers is set to c(1) = 0.1 mM, while the concentrations of the VCG oligomers are varied independently (Fig. 3A). Replication efficiency reaches its maximum at c(8) ≈ 0.1 μM and very low tetramer concentration, c(4) ≈ 7.4 pM, effectively resembling a single-length VCG pool containing only octamers. As shown in Fig. 3B, the maximal efficiency is surrounded by a plateau of close-to-optimal efficiency. The octamer concentration can be varied by more than one order of magnitude without significant change in efficiency. Similarly, adding tetramers does not affect efficiency as long as the tetramer concentration does not exceed the octamer concentration. Fig. 3C illustrates that the plateau of close-to-optimal efficiency coincides with the concentration regime where the ligation of a monomer to an octamer with another octamer acting as template, , is the dominant ligation reaction. For high tetramer concentration and intermediate octamer concentration, templated ligation of tetramers on octamer templates, , surpasses the contribution of ligations (green shaded area in Fig. 3C). The reactions give rise to a ridge of increased efficiency, which represents a local maximum, but is small compared to the plateau of close-to-optimal efficiency (Fig. 3B). Even though reactions of the type produce correct products in most cases, they compete with error-prone ligations like or , which reduce fidelity. Increasing the binding affinity, i.e., lowering γ, enhances the contribution of ligations at the expense of ligations, resulting in an increased local maximum in replication efficiency (Supplementary Material Fig. S9). However, very strong hybridization energy, γ = −5.0 kBT, is necessary for ligations to give rise to similar efficiency as ligations. Thus, in this example, high efficiency replication facilitated by the ligation of short VCG oligomers on long VCG oligomers is in theory possible, but requires unrealistically strong binding affinity. Moreover, this mechanism is only effective if educts and template differ significantly with respect to their length. In case short and long oligomers are similar in size, templated ligation of two short oligomers tends to be error-prone, irrespective of the choice of binding affinity (Supplementary Material Fig. S10 and S11). In that case, F+V ligations remain the most reliable replication mechanism.

Replication performance of VCG pools containing VCG oligomers of multiple lengths (multi-length VCG pools). A The pool contains a fixed concentration of monomers, , as well as tetramers and octamers at variable concentration. The hybridization energy per nearest-neighbor block is γ = 2.5 kBT. B Replication efficiency reaches its maximum for c(8) ≈ 0.1 μM and significantly lower tetramer concentration, c(4) ≈ 7.4 pM. Efficiency remains close-to-maximal on a plateau around the maximum spanning almost two orders of magnitude in tetramer and octamer concentration. In addition, efficiency exhibits a ridge of increased efficiency for high tetramer concentration and intermediate octamer concentration. C The concentrations of complexes that facilitate templated ligation are grouped by the length of the template and the educts, . We distinguish complexes producing correct (labeled ”c”) and false products (labeled ”f”). For each relevant type of complex, we highlight the region of concentration where it contributes most significantly, i.e., at least 20% of the total ligation flux. The plateau of high efficiency is dominated by the ligation of monomers to octamers, whereas the ridge of increased efficiency is due to the correct ligation of two tetramers templated by an octamer.

We observe similar behavior in VCG pools containing a range of oligomer lengths from to (Fig. 4A), again using γ = −2.5 kBT (typical RNA hybridization energy). We maximize replication efficiency with respect to the concentration of VCG oligomers under the assumption that all VCG oligomers of any length have the same concentration (uniform profile). Efficiency again reaches its maximum when F+V ligations of long VCG oligomers are the predominant type of templated ligation (see Fig. 4D, which shows the relative contribution of different ligation types in VCG pools with fixed nt and varying nt). Adding short VCG oligomers to the pool leaves the relative contributions of the dominant ligation types almost unchanged and increases the fraction of unproductive F+F ligations only marginally. Replication efficiency is governed by the behavior of the long oligomers included in the pool. Decreasing for fixed reduces the achievable maximal replication efficiency only slightly. Conversely, lowering the maximal oligomer length, , while keeping fixed causes the efficiency to decrease significantly (Fig. 4B). This is plausible because, for any , replication efficiency is bounded by the replication efficiency of the longest oligomer in the pool (blue curve in Fig. 4B, which corresponds to the orange curve in Fig. 2G). In addition, for small , short and long VCG oligomers are increasingly similar in length, which gives rise to a broad spectrum of erroneous V+V ligations that compete with the error-free F+V ligations (Fig. 4E).

Replication performance of multi-length VCG pools. A The pool contains a fixed concentration of monomers, , as well as long oligomers in the range at variable concentration . The length dependence of the concentration profile is assumed to be uniform (for panels B, D, and E) or exponential (for panel C); its steepness is set by the parameter κV. B If the length distribution is uniform, reducing decreases the maximal efficiency, whereas increasing increases it. Pools containing a range of oligomer lengths are always outperformed by single-length VCGs (blue curve). C Assuming an exponential length distribution of VCG oligomers allows to tune from a poorly-performing regime (dominated by oligomers of length ) to a well-performing regime (dominated by oligomers of length ). In the limit κV → ∞, ηmax approaches the replication efficiency of single-length pools containing only oligomers of length (dashed lines). D Especially for high , replication is dominated by primer extension of the long oligomers in the VCG, here . In this limit, addition of shorter oligomers leaves the dominant F+V ligations almost unchanged. E Reducing for fixed nt gives rise to increased significance of erroneous ligation reactions.

In a realistic prebiotic setup, the concentration profile of the VCG pool would not be uniform. Depending on the mechanism that produces the pool and its coupling to the non-equilibrium environment, it might have a concentration profile that decreases or increases exponentially with length. We use the parameter κV to control this exponential length dependence (Fig. 4A): For negative κV, the concentration increases as a function of length, while exponentially decaying length distributions have positive κV. We find that replication efficiency is high if the concentration of long VCG oligomers exceeds or at least matches the concentration of short VCG oligomers (see κV ≤ 0 in Fig. 4C). In that case, replication efficiency is dominated by the long oligomers in the pool, since these form the most stable complexes. As the concentration of long oligomers is decreased further (κV > 0), the higher stability of complexes formed by longer oligomers is eventually insufficient to compensate for the reduced concentration of long oligomers. Replication efficiency is then governed by short VCG oligomers, which exhibit lower replication efficiency. In the limit κV → ∞, replication efficiency approaches the replication efficiency of a single-length VCG pool containing only oligomers of length (Fig. 4C and Fig. 2G).

Adding dinucleotides to the feedstock decreases the efficiency of replication

So far, we focused on ensembles that contain solely monomers as feedstock. However, examining the influence of dimers on replication in VCG pools is of interest, since dinucleotides have proven to be interesting candidates for enzyme-free RNA copying [911, 23]. For this reason, we study oligomer pools like those illustrated in Fig. 5A: The ensemble contains monomers, dimers and VCG oligomers of a single length, LV. As our default scenario, the dimer concentration is set to 10% of the monomer concentration, corresponding to κF = 2.3, but this ratio can be modified by changing κF.

Replication performance of single-length VCG pools containing monomers and dimers as feedstock. A The pool contains a fixed total concentration of feedstock oligomers, , partitioned into monomers and dimers, as well as VCG oligomers of a single length, LV. The proportion of monomers and dimers can be adjusted via κF, and the concentration of the VCG oligomers is a free parameter,. B Replication efficiency exhibits a maximum at intermediate VCG concentration in systems with (dashed blue curve) and without dimers (solid blue curve). The presence of dimers reduces replication efficiency significantly, as they enhance the ligation share of incorrect F+V ligations (dashed green curve). The panel depicts the behavior for LV = 7 nt and κF = 2.3. C Optimal replication efficiency increases as a function of oligomer length, LV, and asymptotically approaches a plateau (dashed lines, Eq. (6). The value of this plateau, , is determined by the competition between correct and false 2+V reactions, both of which grow exponentially with LV. Thus, depends on the relative concentration of the dimers in the pool: the more dimers are included, the lower is . D Erroneous 1+V ligations are possible if the educt oligomer has a short overlap region with the template. The hybridization energy for such configurations is small, and independent of the length of the VCG oligomers (left). 2+V ligations may produce incorrect products via the same mechanism (middle). In addition, erroneous 2+V ligations can be caused by complexes in which two VCG oligomers hybridize perfectly to each other, but the dimer has a dangling end. The stability of these complexes increases exponentially with oligomer length (right).

Fig. 5B compares the replication efficiency of a pool with and without dimers . In both cases, pools that are rich in VCG oligomers exhibit low efficiency of replication. As erroneous V+V ligations are the dominant type of reaction in this limit, the same efficiency is obtained regardless of the presence of dimers. In contrast, when the pool is rich in feedstock, pools with and without dimers behave differently: If only monomers are included, the efficiency approaches zero, as dimerizing monomers do not contribute to the yield, and thus not to the efficiency. However, the presence of dimers enables the ligation of monomers and dimers to form trimers and tetramers, which lead to a non-zero yield. Given the low VCG concentration, dimers are more abundant than VCG oligomers and thus more likely to act as templates. Therefore, educt oligomers can only hybridize to the template with a single nucleotide-long hybridization region. Consequently, many products are incompatible with the circular genome, causing low efficiency.

Replication efficiency reaches its maximum at intermediate VCG concentrations, where replication is dominated by F+V ligations. Notably, the maximal attainable efficiency is significantly lower for ensembles with dimers than without, as dimers increase the number and the stability of complex configurations that can form incorrect products (Fig. 5C). For , ligation products are only incorrect if the overlap between the VCG educt oligomer and template is shorter than the unique subsequence length, LS. For , however, dangling end dimers can cause incorrect products even in case of long overlap of educt oligomer and template (see right column in Fig. 5C). The stability of the latter complexes depends on the length of the VCG oligomers, LV, whereas the stability of complexes facilitating incorrect monomer addition is independent of oligomer length (Fig. 5C).

In the presence of dimers, the length-dependent stability of complexes facilitating correct and incorrect F+V ligations causes a competition, which sets an upper bound on the efficiency of replication (Supplementary Material Section S-VIII),

Here, we introduced effective association constants 𝒦a, which depend differently on the VCG oligomer length, LV. While the effective association constant of complexes enabling incorrect 1+V ligations is length-independent, the effective association constant for incorrect 2+V ligations, , scales exponentially with LV,

The effective association constants for correct 1+V and 2+V ligations also scale exponentially with the oligomer length (Supplementary Material Fig. S12).

In systems without dimers, i.e., κF → ∞, ηmax approaches 100%, which is consistent with the behavior observed in the previous sections. Conversely, in systems containing dimers, the maximal efficiency remains at a value below 100%, which depends on the concentration of dimers in the feedstock. Fig. 5C shows the dependence of maximal replication efficiency on the length of VCG oligomers in pools containing monomers, dimers and VCG oligomers. As LV increases, ηmax converges towards the upper bound defined in Eq. (6) (see dashed line in Fig. 5C).

Monomers preferentially ligate to short VCG oligomers in VCG-rich pools

Efficiency of replication is limited by the same mechanism regardless of the specifics of the underlying VCG pool: Oligomers that have an insufficiently long region of hybridization (shorter than LS) to their template are likely to produce an incorrect product upon ligation. In the previous sections, we minimized the contribution of such ligations by fine-tuning the concentration of VCG oligomers. Alternatively, incorrect templated ligations could be suppressed if only monomers are activated, while longer oligomers remain inactive. This scenario has already been investigated experimentally [18]. In a prebiotic setting, this scenario would emerge, for instance, if activated monomers are produced in the environment by an in situ activation chemistry, and subsequently transported into the compartment containing the VCG, while no activation takes place within the compartment [28, 29].

To explore the replication efficiency of this monomer-only activation scenario, we first consider a pool consisting only of activated monomers as well as non-activated dimers and VCG oligomers of a single length (Fig. 6A). We vary the concentration of VCG oligomers, but keep the feedstock concentrations constant. For small VCG concentrations, we observe low efficiencies (Fig. 6B), as the ligation of two monomers, or one monomer and one dimer, are most likely. Conversely, high facilitates the formation of complexes in which VCG oligomers are extended by monomers, which implies high efficiency (Fig. 6B). Note that, unlike in systems where all oligomers are activated, replication efficiency does not decrease for high VCG concentration, as erroneous V+V ligations are impossible. Instead, perfect replication efficiency (100%) is reached for sufficiently high LV.

Replication performance of single-length VCG pools in which only the monomers are activated. A The pool contains activated monomers alongside non-activated dimers and VCG oligomers of a single length. The concentrations of monomers and dimers are fixed, c(1) = 0.091 mM and c(2) = 9.1 μM, adding up to a total feedstock concentration of , while the concentration of VCG oligomers, , can be varied. B Unlike in pools in which all oligomers are activated, replication efficiency does not decrease at high VCG concentration if only monomers are activated, as erroneous V+V ligations are impossible. Instead, replication efficiency approaches an asymptotic value of 1. C The fraction of oligomers that are in a monomer extension-competent state depends on the total concentration of VCG oligomers. At low VCG concentration, most oligomers are single-stranded, and extension of oligomers by monomers is scarce. At high VCG concentration, r1+V approaches the asymptotic value (Eq. (7), grey dashed line). In this limit, almost all oligomers form duplexes, which facilitate monomer addition upon hybridization of a monomer. Thus, the asymptotic fraction of oligomers that gets extended by monomers is not determined by the oligomer length, but by the binding affinity of monomers to existing duplexes. Conversely, the threshold concentration at which depends on oligomer length (colored dashed lines): Longer oligomers reach higher r1+V at lower VCG concentration.

While replication efficiency characterizes the relative amount of nucleotides used for the correct elongation of VCG oligomers, we can also analyze which fraction of VCG oligomers is in a monomer extension-competent state,

Here, denotes the equilibrium concentration of all complexes enabling the addition of a monomer to any VCG oligomer. We find that r1+V depends on the VCG concentration qualitatively in the same way as the efficiency: r1+V is small for small VCG concentration, butapproaches a value asymptotically for high VCGconcentration (Fig. 6C). In this limit, almost all VCG oligomers are part of a duplex. Monomers can bind to these duplexes to form complexes allowing for 1+V ligations. Since almost all VCG oligomers are already part of a duplex, increasing further does not increase the fraction of VCG oligomers that can be extended by monomers. Instead, the asymptotic value is determined by the concentration of monomers and their binding affinity Kd(1) to an existing duplex (Supplementary Material Section S-IX),

In contrast to the more complex case considered further below, here, the asymptotic value does not depend on the length of the VCG oligomers. However, the threshold VCG concentration at which scales exponentially with LV (Supplementary Material Section S-IX),

This scaling implies that longer oligomers require exponentially lower VCG concentration to achieve a given ratio r1+V (Fig. 6C), as their greater length allows them to form more stable complexes. This observation implies that pools with longer oligomers will always be more productive than pools with shorter oligomers (at equal VCG concentration).

The behavior becomes more complex in pools containing VCG oligomers of multiple lengths, due to the competitive binding within such heterogeneous pools. To illustrate this, we examine an ensemble containing VCG oligomers ranging from to along with the same feedstock as previously (activated monomers and non-activated dimers, see Fig. 7A). For simplicity, we assume that the length-distribution of VCG oligomers is uniform. We study the fraction of oligomers in a monomer extension-competent state as a function of oligomer length,

where denotes the equilibrium concentration of all complexes that allow monomer-extension of VCG oligomers of length L. At low VCG concentration, longer oligomers are more likely to be extended by monomers than shorter ones (Fig. 7B). This behavior is intuitive, as longer oligomers tend to form more stable complexes, which lead to higher productivity. Surprisingly, increasing the VCG concentration reverses the length-dependence of the productivity, such that short oligomers are more likely to be extended by monomers than long ones (note the three crossings of the curves in Fig. 7B). For example, 8-mers are more likely to undergo primer-extension than 9-mers once the VCG concentration exceeds correct value ≈ 0.5 μM. We derived a semi-analytical expression for the threshold VCG concentrations at which oligomers of two different lengths have equal productivity (see dashed lines in Fig. 7B, Supplementary Material Sections S-X and S-XI).

Replication performance of multi-length VCG pools in which only the monomers are activated. A The pool contains activated monomers as well as non-activated dimers and VCG oligomers. The concentrations of monomers and dimers are fixed, c(1) = 0.091 mM and c(2) = 9.1 μM, adding up to a total feedstock concentration of , while the total concentration of VCG oligomers, , can be varied. All VCG oligomers are assumed to have the same concentration. B At low VCG concentration, the fraction of oligomers that are in a monomer extension-competent state is higher for long oligomers than for short oligomers, while, at high VCG concentration, monomers preferentially ligate to short oligomers (“inversion of productivity”). The threshold concentrations at which a short oligomer starts to outperform a longer oligomer depend on the lengths of the compared oligomers (dashed lines). C The mechanism underlying inversion of productivity can be understood based on the pair-wise competition of different VCG oligomers, e.g., 8-mers vs. 9-mers. Over the entire range of VCG concentrations, complexes with 8-mer templates have a lower relative equilibrium concentration than complexes with 9-mer templates (bottom two curves vs. top two curves). However, as the concentration of VCG oligomers is increased, the extension fraction of 8-mers that are extended by monomers using a 9-mer as a template exceeds the fraction of extended 9-mers. D The equilibrium concentration of free oligomer decreases with increasing . Long oligomers have a lower equilibrium concentration ratio of free oligomers, as they can form more stable complexes with longer hybridization sites. E Complexes in which 8-mers serve as template are less stable than complexes with 9-mer templates, explaining the higher relative equilibrium concentration of the latter complex type (see panel C). Complexes with 9-mer template have similar stability regardless of the length of the educt oligomer, i.e., and are similarly stable. This similar stability together with the higher concentration of free 8-mers compared to 9-mers (see panel D) is the reason why the fraction of monomer-extended 8-mers exceeds the one of 9-mers (see panel C).

To understand the mechanism underlying the inversion of productivity, we analyze how different complex types contribute to r1+V(L). We introduce

which denotes the fraction of oligomers of length LE that are in a monomer extension-competent complex configuration that uses an template. The term oligomer of length LT as includes the sum over all possible configurations of complexes with the given lengths. Focusing on 8-mers and 9-mers as an example, we observe that complexes utilizing the 9-mer as template cause the inversion of productivity (top two curves in Fig. 7C): As the VCG concentration increases, the fraction of monomer-extendable 8-mers eventually surpasses the fraction of monomer-extendable 9-mers, i.e., (Fig. 7C). Two factors give rise to this feature: (i) Triplexes of type and have similar stability (Fig. 7E), and (ii) the equilibrium concentration of free 8-mers is higher than that of 9-mers (Fig. 7D). As a result, 8-mers are more likely than 9-mers to hybridize to an existing duplex , and, given the stability of the complex , 8-mers remain bound almost as stably as 9-mers. In summary, short oligomers sequester long oligomers as templates to enhance their monomer-extension rate, while long oligomers cannot make use of short oligomers as templates due to the relative instability of the corresponding complexes.

It is noteworthy that Ding et al. have already observed the inversion of productivity experimentally [18]. In their study, they included activated monomers, activated imidazolium-bridged dinucleotides and oligomers up to a length of 12 nt, and observed that the primer extension rate for short primers is higher than the extension rate of long primers. Even though our model differs from their setup in some aspects (e.g., different circular genome, no bridged dinucleotides) evaluating our model using parameters similar to those of the experimentally studied system predicts inversion of productivity that qualitatively agrees with the experimental findings (Supplementary Material Section S-XII). We therefore assume that the mechanism underlying inversion of productivity described here also applies to the experimental observations.

Discussion and conclusion

While significant progress has been made in understanding the prebiotic formation of (ribo)nucleotides [3033] and characterizing ribozymes that might play a role in an RNA world [3437], a convincing scenario bridging the gap between prebiotic chemistry and ribozyme-catalyzed replication is still missing. Here, we studied a scenario proposed by Zhou et al. [17] (the ‘Virtual Circular Genome’, VCG) using theoretical and computational approaches. We analyzed the process whereby template-directed ligation replicates the genomic information that is collectively stored in the VCG oligomers. Our analysis revealed a trade-off between the fidelity and the yield of this process: At low concentration of VCG oligomers, most of the ligations produce oligomers that are too short to specify a unique locus on the genome, resulting in a low yield of replication (Fig. 2B-C). At high VCG concentration, erroneous templated ligations limit the fidelity of replication (Fig. 2C-D). We considered two solutions to this trade-off: (i) a VCG pool composition that optimizes its replication behavior within the bounds of the fidelity-yield trade-off, and (ii) breaking the fidelity-yield trade-off by assuming that only monomers are activated.

The first solution maximizes the yield of replication for fixed fidelity. In pools containing only monomers and VCG oligomers of a single length, replication efficiency can be maximized by increasing the length of VCG oligomers and decreasing their concentration (Fig. 2F-G). This reduces the likelihood of error-prone templated ligation of long oligomers. When the pool contains VCG oligomers of multiple lengths, replication efficiency is typically governed by the longest oligomer in the pool (Fig. 4D). Including dimers as feedstock for the replication increases the error fraction (Fig. 5B-C), as dimers that bind to a template with a dangling end are prone to form an incorrect product (Fig. 5D).

The second solution eliminates the error-prone templated ligation of two VCG oligomers by assuming that only monomers are activated. This enables both fidelity and yield to remain high even at high VCG concentrations (Fig. 6B), effectively breaking the fidelity-yield trade-off. Unsurprisingly, longer VCG oligomers are then more likely to be extended than shorter oligomers at equal concentration, (Fig. 6C). However, this is only true for pools with VCG oligomers of a single length — once multiple VCG oligomer lengths compete with each other, shorter oligomers can be more productive than longer ones (Fig. 7B). This feature, which has also been observed experimentally [18], is caused by an asymmetry in the productive interaction between short and long oligomers (Fig. 7C): While short oligomers can sequester longer oligomers as templates for their extension by a monomer, short oligomers are unlikely to serve as templates for longer oligomers (Fig. 7D-E).

If the VCG scenario is to close the gap between prebiotic chemistry and ribozyme-catalyzed replication, VCG pools need to be capable of replicating (parts of) ribozymes that play a role in the emergence of life. While there are cases of small ribozymes [36] or ribozymes with small active sites, such as the Hammerhead ribozyme [38], ribozymes obtained experimentally via in vitro evolution are typically more than a hundred nucleotides long [4, 3941]. In principle, long genomes (e.g., 1000 nt long) could be replicated in VCG pools containing relatively short oligomers (e.g., 10 nt long) with high efficiency (Fig. 2H). However, this requires a very special sequence composition of the genome. In our analysis, we assumed a clear separation between feedstock and VCG oligomers, defined by a single length scale LS, where all possible sequences of length L < LS are part of the feedstock and occur in the genome sequence, while any sequence of length LLS specifies a unique locus on the virtual genome. Such a sequence composition is particularly beneficial for the virtual circular genome scenario. However, the sequence composition of a generic genome sequence is characterized by two distinct length scales LS1 and LS2: (i) The former is defined as the largest length LS1 such that all possible subsequences of length L < LS1 are present in the genome, while (ii) LS2 is the smallest length such that all subsequences of the genome with length LLS2 are unique, i.e., they appear only once in the genome. These two length scales coincide for the specific sequence composition assumed by our analysis, but LS2 could be significantly longer than LS1 in prebiotically interesting genomes containing ribozyme sequences. Note that LS1 will generally scale logarithmically with the genome length, LS1 ∼ ln LG, while LS2 is constrained by the functional requirements of the biomolecules encoded by the genome.

For efficient replication of VCG pools, the second length scale, LS2, of the genome composition will determine the required length LV of the VCG oligomers. Our analysis showed that replication is efficient, if the VCG oligomers are significantly (more than two-fold) longer than LS, which is equal to LS2 for our idealized genome composition. We expect the interplay of the competing template-directed ligation processes in VCG pools in our idealized scenario to be transferable to the general case, such that genomes with arbitrary LS2 can also be efficiently replicated, if the oligomers in the pool are more than two-fold longer than LS2. These sufficiently long oligomers enable the formation of productive complexes, in which the hybridization regions are typically longer than LS2, leading to mostly correct products. On the contrary, if the characteristic length of oligomers in the VCG pool is too short, sequence scrambling leads to a loss of the information of the VCG [19].

Based on this reasoning, it is interesting to note that the virtual circular genome scenario creates a selection pressure for prebiotic genomes to reduce their LS2: A VCG pool encoding a genome containing multiple copies of certain long sequence motifs would replicate less efficiently than a pool for which a shorter motif length already specifies a unique locus on the genome. This comparison assumes all other properties of the two systems being the same, in particular the length distribution of the oligomers in the two pools. As our analysis shows, this length distribution is the key factor determining the replication efficiency for a given genome sequence. The length distribution, in turn, is determined by an interplay between the chemical kinetics of the system and the molecular transport controlled by its physical environment: For example, templated ligation in an open system with continuous in- and outflux of oligomers can lead to a non-monotoneous length distribution with a concentration peak at a characteristic oligomer length, Lc, which is set by the competition between dehybridization and outflux [20]. Via this emergent length-scale, the environment exerts a selection pressure on replication in the VCG scenario: While a VCG pool replicates its genome with high efficiency if the environment preferentially forms long oligomers, Lc > LS2, replication in environments with small characteristic oligomer length is expected to eliminate repeating sequence motifs that are longer than Lc. Mutational errors could replace long repeated sequence motifs by alternative sequences with shorter unique motifs, which may be functionally equivalent.

Given the broad range of prebiotically plausible non-equilibrium environments [42], it is reasonable to expect that some environments provide the required conditions for efficient replication. The constraints formulated in this work can help to guide the search for self-replicating VCG pools, in the vast space of possible concentration profiles and non-equilibrium environments.

Acknowledgements

We thank T. Göppel, M. Battipede, J. C. Espinoza Campos, J. Harth-Kitzerow for stimulating discussions. This work was supported by the Deutsche Forschungs-gemeinschaft (DFG, German Research Foundation) via the CRC/TRR 392 Molecular Evolution (Project-ID 521256690), and under Germany’s Excellence Strategy (EXC-2094-390783311, ORIGINS).

Supplementary Material

S-I. Computing replication observables based on the kinetic simulation

We simulate the dynamics of VCG pools using a kinetic simulation that is based on the Gillespie algorithm. In the simulation, oligomers can hybridize to each other to form complexes, or dehybridize from an existing complex. Moreover, two oligomers can undergo templated ligation if they are hybridized adjacent to each other on a third oligomer. At each time t, the state of the system is determined by a list of all single-stranded oligomers and complexes as well as their respective copy number. We refer to the state of the system at the time t as the ensemble of compounds ℰt. Given the copy numbers, the rates ri of all possible chemical reactions i ∈ ℐ can be computed. To evolve the system in time, we need to perform two steps: (i) We sample the waiting time until the next reaction, τ, from an exponential distribution with mean , and update the simulation time, t → t + τ. (ii) We pick which reaction to perform by sampling from a categorical distribution. Here, the probability to pick reaction i equals .The copy numbers are updated according to the sampled reaction, yielding ℰt+τ. Steps (i) and (ii) are repeated until the simulation time t reaches the desired final time, tfinal. A more detailed explanation of the kinetic simulation is presented in [20, 21].

Our goal is to compute observables characterizing replication in the VCG scenario based on the full kinetic simulation. For clarity, we focus on one particular observable (yield) for the derivation. The results for other observables are stated directly, as their derivations follow analogously. Recall the definition of the yield introduced in the main text,

As we are interested in the initial replication performance of the VCG, we compute the yield based on the ligation events that take place until the characteristic timescale of ligations t0. We would like to compute the yield based on the templated ligation events that we observe in the simulation. Unfortunately, for reasonable system parameters, it is impossible to simulate the system long enough to observe sufficiently many ligation events to compute y to reasonable accuracy. For example, for a VCG pool containing monomers at a total concentration of mM and VCG oligomers of length L = 8 at a total concentration of , it would take about 1700 hours of simulation time to reach t = 5·1012 t0. Multiple such runs would be needed to estimate the mean and the variance of the observables of interest, rendering this approach unfeasible.

Instead, we compute the replication observables based on the copy number of complexes that could potentially perform a templated ligation, i.e., complexes, in which two strands are hybridized adjacent to each other, such that they could form a covalent bond. It can be shown analytically that the number of potentially productive complexes is a good approximation for the number of incorporated nucleotides. The number of incorporated nucleotides can be computed as the integral over the ligation flux, weighted by the number of nucleotides that are added in each templated ligation reaction,

Here, N (C) denotes the copy number of the complex C in the pool ℰt. Le,1 and Le,2 denote the lengths of the oligomers that undergo ligation, and 𝟙 is an indicator function which enforces that only complexes in a configuration that allows for templated ligation contribute to the reaction flux. As only few ligation events are expected to happen until τlig, it is reasonable to assume that the ensembles ℰt do not change significantly during t ∈ [0, τlig]. Therefore, the integration over time may be interpreted as a multiplication by τlig,

where ⟨…⟩ denotes the average over realizations of the ensembles t within the time interval t ∈[τeq, τlig]. Note that, at this point, we made the additional assumption that no templated ligations are taking place between [0, τeq]. This assumption is reasonable, as (i) the equilibration process is very short compared to the characteristic timescale of ligation, and (ii) the number of complexes that might allow for templated ligation during equilibration is lower than in the equilibrium (we start the simulation with an ensemble of single-stranded oligomers), implying that the rate of templated ligation is small.

In order to compute the average over different realizations of ensembles ℰ, we need to sample a set of uncorrelated ensembles that have reached the hybridization equilibrium. For this purpose, we run a full kinetic simulation. The simulation starts with a pool containing only single-stranded oligomers, and reaches the (de)hybridization equilibrium after a time τeq. We identify this timescale of equilibration by fitting an exponential function to the total hybridization energy of all complexes in the system, ΔGtot (Fig. S1A). In the set of ensembles used to evaluate the average in Eq. (S1), we only include ensembles for time t > τeq to ensure that the ensembles have reached (de)hybridization equilibrium. To ensure that the ensembles are uncorrelated, we require that the time between two ensembles that contribute to the average is at least τcorr. The correlation time, τcorr, is determined via an exponential fit to the autocorrelation function of ΔGtot (Fig. S1B). Besides computing the expectation value (Eq. S1), we are also interested in the “uncertainty” of this expectation value, i.e., in the standard deviation of the sample mean σX. (We use X as a short-hand notation for ∑C∈ℰ N (C) min(Le,1, Le,2) 𝟙(C allows for templated ligation).) The standard deviation of the sample mean, σX, is related to the standard deviation of X, σX, by the number of samples, σX = (Ns)−1/2σX. Moreover, based on the van-Kampen system size expansion, we expect the standard deviation of X to be proportional to V −1/2. Thus, σX ∝ (NsV)−1/2.

Using Eq. (S1) (as well as an analogous expression for the number of nucleotides that are incorporated in VCG oligomers), the yield can be expressed as

Characteristic timescales in the kinetic simulation. A The equilibration timescale is determined based on the total hybridization energy of all strands in the pool, ΔGtot. By fitting an exponential function to ΔGtot, we obtain a characteristic timescale, τ * (vertical dotted line), which is then used to calculate the equilibration time as τeq = 5τ * (vertical dashed line). The horizontal dashed line shows the total hybridization energy expected in (de)hybridization equilibrium according to the coarse-grained adiabatic approach (Sections S-II and S-III). B The correlation timescale is determined based on the autocorrelation of ΔGtot. We obtain τcorr (vertical dashed line) by fitting an exponential function to the autocorrelation. In both panels, we show simulation data obtained for a VCG pool containing monomers and VCG oligomers with a concentration of as well as oligomers of length L = 8 nt with a concentration of .

The additional condition 𝟙(Le,1 + Le,2LS) in the numerator ensures that the product oligomer is long enough to be counted as a VCG oligomer, i.e., at least LS nucleotides long. Analogously, the expression for the fidelity of replication reads

Multiplying fidelity and yield results in the efficiency of replication,

The ligation share of a particular type of templated ligation s(type), that is, the relative contribution of this templated-ligation type to the nucleotide extension flux, can be represented in a similar form as the other observables,

As all observables are expressed as the ratio of two expectation values, Ƶ= ⟨X⟩ / ⟨Y⟩, we can compute the uncertainty of the observables via Gaussian error propagation,

Since the variances, and , as well as the covariance, , are proportional to (NsV)−1, the standard deviation of the observable mean, σZ, scales with the inverse square root of the number of samples and the system volume, i.e., σZ ∝(NsV)−1/2. Therefore, the variance of the computed observable can be reduced by either increasing the system volume or increasing the number of samples used for averaging. Both approaches incur the same computational cost: (i) Increasing the number of samples, Ns, requires running the simulation for a longer duration, with the additional runtime scaling linearly with the number of samples. (ii) Similarly, the additional runtime needed due to increased system volume, V, also scales linearly with V (Fig. S2). One update step in the simulation always takes roughly the same amount of runtime, but the change in simulation time per update step depends on the total rate of all reactions in the system. The total rate is dominated by the association reactions, and their rate is proportional to the volume. Therefore, the change in simulation time per update step is proportional to V −1. The runtime, which is necessary to reach the same simulation time in a system with volume V as in a system with volume 1, is a factor of V longer in the larger system. With this in mind, it makes no difference whether the variance is reduced by increasing the volume or the number of samples. For practical reasons (post-processing of the simulations is then less memory- and time-consuming), we opt to choose moderate number of samples, but slightly higher system volumes to compute the observables of interest. The simulation parameters (length of oligomers, concentrations, hybridization energy, volume, number of samples, characteristic timescales) used to obtain the results presented in the main text (Main Text Fig. 2) are summarized in Tab. SI.

Simulation runtime of the full kinetic simulation for a VCG pool that includes monomers and VCG oligomers of lenght L = 8. The total concentration of feedstock monomers equals , while the total concentration of VCG oligomers equals . The energy contribution per matching nearest-neighbor block equals γ = 2.5 kBT. The volume of the system is varied, and the time-evolution is simulated until t = 5.0 · 107t0. The runtime of the simulation scales linearly with the volume of the system.

System parameters used to compute the replication observables yield, y, and replication efficiency, η, based on the kinetic simulation. The computed observables are shown in Fig. 2 in the main text.

S-II. Coarse-grained representation of complexes in the adiabatic approach

It is computationally expensive to evaluate the replication observables via the full kinetic simulation. For this reason, we develop an adiabatic approach, which allows us to compute the replication observables, provided that templated ligation is far slower than (de)hybridization. The adiabatic approach relies on a coarse-grained representation of the oligomers in the pool, which is introduced in this section.

Single Strands

In the coarse-grained description, oligomers of identical length are assumed to have equal concentration, irrespective of their sequence. This assumption is justified for two reasons: (i) We initialize the VCG pool without sequence bias, i.e., all oligomers compatible with the genome sequence are included at equal concentration. (ii) Hybridization energy in our simplified energy model (and therefore also the stability of complexes) only depends on the length of the hybridization site, not on its sequence, provided there is no mismatch. Each coarse-grained oligomer is uniquely identified by its length L, and it represents a group of oligomers with ℭ (L) distinct sequences. We refer to the number ℭ (L) as the combinatorial multiplicity of the coarse-grained oligomer. The value of ℭ (L) depends on the choice of the encoded genome. By construction (see main text), we assume that all possible oligomer sequences of length L < LS are included in the genome. For LLS, only a subset of all possible 4L sequences is included, but no sequence is repeated multiple times across the genome. Therefore, the combinatorial multiplicity equals

Schematic representation of complexes considered in the adiabatic approach. A A duplex is comprised of two strands, which we refer to as W (Watson) and C (Crick). The relative position of the strands is characterized by alignment index i; for the depicted duplex, i = −2. The length of the hybridization region is called Lo. B A triplex contains three strands. By convention, we denote the two strands that are on the same “side” of the complex as W1 and W2, and the complementary strand as C. The alignment indices i and j denote the positions of W1 and W2 relative to C. For the depicted triplex, i = −2 and j = 3. The length of the hybridization regions are called Lo,1 and Lo,2.

Two strands can form a duplex by hybridizing to each other. We refer to the bottom oligomer as ‘Crick” strand C and to the top oligomer as “Watson” strand W. A duplex is uniquely characterized by the lengths of the oligomers, LC and LW, as well as their relative alignment (Fig. S3A). The alignment index i denotes the position of the Watson strand with respect to the Crick strand. As there needs to be at least one nucleotide of overlap between the strands for a duplex to exist, the alignment index needs lie in the interval i ∈ [ − (LW −1), LC − 1]. Using the alignment index, we can also determine if the duplex has a left (or right) dangling end. The corresponding indicator variables are called dl (or dr),

Moreover, the length of the hybridization region Lo can be computed via

The hybridization energy of the duplex depends on the length of the hybridization region as well as on the existence/absence of dangling ends. For a hybridization site of length Lo, there are Lo − 1 nearest-neighbor energy blocks each of which contributes γ to the energy. Moreover, each dangling end contributes to the energy,

To compute the combinatorial multiplicity for a duplex with fixed LC, LW and alignment index i, we need to multiply the combinatorial multiplicity of the Crick strand by the number of possible hybridization partners. We assume that a hybridization partner is possible if its sequence is perfectly complementary to the lower strand within the hybridization region, whereas hybridization partners with mismatches are not accounted for. This is sensible as long as the energetic penalty for mismatches in the full kinetic simulation is sufficiently large to suppress mismatches. The number of possible hybridization partners is determined by the length of the overlap region Lo: If LoLS, the pool contains only one oligomer sequence that can act as hybridization partner by construction of the genome. For shorter hybridization regions, multiple hybridization partners might be possible. Their number is set by the combinatorial multiplicity of the Watson oligomer divided by the combinatorial multiplicity of the hybridization region,

To avoid double-counting, we only account for complexes in which the Crick strand is at least as long as the Watson strand, LCLW, and multiply ℭ (LW, LC, i) by 1/2 if LW = LC.

Triplexes

Triplexes, i.e., complexes comprised of three strands, are uniquely characterized by the length of the three oligomers, LC, LW,1, LW,2, as well as their respective alignment (Fig. S3B). The alignment index i denotes the position of strand W1 relative to strand C. Analogously, j denotes the relative position of W2 relative to oligomer C. Two strands that are hybridized to each other need to have a hybridization region of at least one nucleotide. Moreover, the strands W1 and W2 must not occupy the same position on the template strand C. Taking both requirements together, the alignment indices fall within the intervals,

A triplex may have a dangling end not only on its left or right end, but also in the gap between strands W1 and W2. Three boolean variables are necessary to denote the presence/absence of the respective dangling ends,

The length of the two hybridization regions are given by

The hybridization energy depends on the length of the overlap regions as well as on the existence of dangling ends: As in the duplex, each overlap region of length Lo,1 (or Lo,2) comprises Lo,1 −1 (or Lo,2 −1) nearest neighbor blocks, each of which contributes γ to the total energy. Moreover, every dangling end contributes γ/2. Note that the presence of a gap between strands W1 and W2, i.e., dm = 1, implies that there are two dangling ends, one for W1 and another for W2. Gaps in between two complexes contribute γ/2 per each dangling end, adding up to γ. If there is no gap between the strands, i.e., dm = 0, there are no dangling end contributions, but a new full nearest neighbor block emerges, which contributes γ to the energy. Therefore, the total energy reads,

The combinatorial multiplicity of a triplex is computed in the same way as for the duplex: The combinatorial multiplicity of the strand C is multiplied by the number of possible hybridization partners W1 and W2. Again, the number of possible partners is set by the length of the hybridization regions,

We use 𝟙 to denote the indicator function which returns 1 in case the condition in the bracket is fulfilled and zero otherwise. As all triplexes are asymmetric, there is no need to introduce a symmetry correction factor.

Tetraplexes

The largest complexes to be accounted for in our coarse-grained adiabatic approach are tetraplexes, i.e., complexes comprised of four strands. We need to distinguish three types of such complexes: (i) 3-1 tetraplexes, (ii) left-tilted 2-2 tetraplexes and (iii) right-tilted 2-2 tetraplexes. In 3-1 tetraplexes, three Watson strands are hybridized to one Crick strand (Fig. S4), whereas in 2-2 tetraplexes, two Watson strands are hybridized to two Crick strands (Fig. S5 and Fig. S6).

3-1 Tetraplexes

Schematic representation of a 3-1 tetraplex. Three strands (in the following referred to as Watson strands W1, W2, and W3), hybridize to a single template strand (Crick strand C). The positions relative to the left end of the C strand are given by the alignment indices i, j, and k; here, i = −2, j = 2, k = 5. The length of the overlap regions are denoted Lo,1, Lo,2 and Lo,3.

Fig. S4 depicts a typical 3-1 tetraplex. Such a tetraplex is uniquely characterized by the length of its oligomers, LC, LW,1, LW,2, LW,3, as well as their relative position to each other denoted by the alignment indices i, j, and k. All positions within the triplex are measured relative to the left end of the C strand. Any W strand needs to have at least one nucleotide of overlap with the C strand, but two W strands must never occupy the same position on the C strand. Consequently, the alignment indices fall within the intervals,

There are two dangling ends (left and right) and potentially two gaps between the W strands: one gap between W1 and W2 and another one between W2 and W3. The following boolean variables indicate the presence/absence of the respective dangling ends,

The length of the hybridization regions is given by

Following the same reasoning as in the case of triplexes, the energy equals

Similarly, the combinatorial multiplicity of 3-1 tetraplexes is constructed using the same reasoning as in the case of triplexes,

As 3-1 tetraplexes are not symmetric under rotation, no symmetry correction of the combinatorial multiplicity is necessary.

Left-Tilted 2-2 Tetraplexes

A 2-2 tetraplex is comprised of two C strands and two W strands. We call a 2-2 tetraplex left-tilted if strand W1 is connected to strand W2 via strand C1 (Fig. S5A). The lengths of the oligomers are called LW,1, LW,2, LC,1 and LC,2. The positions of the strands relative to each other are governed by the alignment indices. All positions are measured relative to the position of the left end of strand C1. The alignment indices may take on the following values,

Schematic representation of a left-tilted 2-2 tetraplex. A Two Watson strands (W1 and W2) are hybridized to two Crick strands (C1 and C2). Both Watson strands are hybridized to the left Crick strand C1, whereas only W2 is hybridized to the right Crick strand C2. The alignment indices i, j and k denote the position of the strands relative to the left end of C1; here, i = −2, j = 3 and k = 6. The length of the hybridization regions are called Lo,1, Lo,2 and Lo,3. B Rotating the schematic representation of a left-tilted 2-2 tetraplex by 180° produces an alternative representation of the same complex, which is again a left-tilted 2-2 tetraplex. The panel depicts the rotated tetraplex representation (variables with superscript “rot”) as well as the un-rotated representation (variables without superscript). There is a unique linear mapping between un-rotated and rotated representation, e.g., C2 after rotation always corresponds to W1 before rotation.

The complex can have dangling ends on the right and on the left end of the complex; the presence of these dangling ends is indicated by the boolean variables dl and dr. Moreover, two gaps are possible: There might be a gap between strands W1 and W2, or a gap between C1 and C2. The respective boolean variables read

We refer to the hybridization region of strand W1 and C1 as overlap region 1, to the hybridization region of strand W2 and C1 as overlap region 2 and to the hybridization region of strand W2 and C2 as overlap region 3. The length of these hybridization regions is computed via

Given the length of the hybridization region as well as the presence/absence of dangling ends, we can compute the hybridization energy,

The combinatorial multiplicity of a left-tilted 2-2 tetraplex is constructed using the same reasoning as in the case of a 3-1 tetraplex,

To prevent double-counting the same tetraplex, we include either the complex or its rotated representation in the container of possible complexes, but not both. If the complex is symmetric under rotation, we multiply the combinatorial multiplicity by 1/2. Given a left-tilted 2-2 tetraplex (LC,1, LC,2, LW,1, LW,2, i, j, k), we can compute the corresponding rotated tetraplex by applying a linear map. The mapping of the oligomers lengths is illustrated in Fig. S5B. We see that strand C2 after rotation corresponds to strand W1 before rotation, implying that . The same reasoning can be applied to all strands leading to the following map,

In order to compute the map of the alignment indices under rotation, we need to express the relative positions of the strands with respect to the position of strand C1 after rotation, which corresponded to W2 before rotation. For example, |irot| corresponds to the number of nucleotides by which strand C2 (before rotation) protrudes beyond strand W2 (before rotation). Expressed in terms of variables before rotation, this distance may be written as k +LC,2 jLW,2. Analogous relations can be derived for all alignment indices,

Schematic representation of a right-tilted 2-2 tetraplex. A Two Watson strands (W1 and W2) are hybridized to two Crick strands (C1 and C2). Unlike in the left-tilted 2-2 tetraplex, both Watson strands are hybridized to the right Crick strand C2, whereas only W1 is hybridized to the left Crick strand C1. The alignment indices i, j, and k denote the positions of the strands relative to C1; here, i = 1, k = 3, and j = 6. The length of the overlap regions are called Lo,1, Lo,2 and Lo,3. B Rotating the schematic representation of a right-tilted 2-2 tetraplex by 180° produces an alternative representation of the same complex, which is again a right-tilted 2-2 tetraplex. The panel depicts the rotated tetraplex representation (variables with superscript “rot”) as well as the un-rotated representation (variables without superscript). There is a unique linear mapping between un-rotated and rotated representation, e.g., C2 after rotation always corresponds to W1 before rotation. The mapping is identical for left- and right-tilted 2-2 tetraplexes.

Right-tilted 2-2 Tetraplexes

A 2-2 tetraplex is called right-tilted if strand W1 is connected to strand W2 via strand C2 (Fig. S6). As in the case of the left-tilted 2-2 tetraplex, the oligomer lengths are again called LW,1, LW,2, LC,1 and LC,2, but the values of the alignment indices that are possible for the right-tilted tetraplex differ from the ones of the left-tilted tetraplex,

Note that the range of i is chosen such that at least one nucleotides of strand W1 always extends to the right beyond the end of strand C1, allowing for a hybridization region between strand C2 and W1. The boolean variables denoting the presence/absence of dangling ends read

The length of the overlap regions is given by

Like in the case of left-tilted 2-2 tetraplexes, the total hybridization energy is computed via

and the combinatorial multiplicity via

We include either the tetraplex or its rotated representation in the list of possible complexes to avoid double-counting of tetraplexes. Moreover, the combinatorial multplicity is divided by 2, if the tetraplex is symmetric under rotation. It turns out that the rotation map for the right-tilted 2-2 tetraplex is identical to the one of the left-tilted 2-2 tetraplex,

S-III. Numerical solution of the (de)hybridization equilibrium in the adiabatic approach

Based on the list of complexes constructed in the previous section, we can compute the equilibrium concentration of strands and complexes reached in the (de)hybridization equilibrium. In the following, we denote the concentration of a coarse-grained oligomer with length L as c(L), and the concentration of an oligomer with length L and known sequence as cs(L). Recall that we assumed that all sequences of a given length that are compatible with the circular genome are equally likely in the pool (Section S-II). Thus, the concentration of the coarse-grained oligomer and the concentration of an oligomer with specified sequence are related by the combinatorial multiplicity,

In order to compute the concentration of a complex based on the concentration of single strands, we make use of the law of mass action. The concentration of a specific sequence realization of a complex is computed as the product of concentrations of the strands forming the complex divided by the dissociation constant Kd,

Here, is the vector denoting the lengths of the strands comprising the complex, and are the alignment indices. The dissociation constant is set by the hybridization energy, ∆G, of the complex,

where c° = 1 M is the standard concentration. Just as in the case of single strands, the concentration of the sequence-independent coarse-grained complex is related to the concentration of a complex with specific sequence realization via the combinatorial prefactor,

It can be useful to combine the combinatorial multiplicity and the dissociation constant into a single effective association constant,

Note that the effective association constant including the combinatorial multiplicity is denoted by 𝒦a (in curly font), while the association constant without combinatorial multiplicity is denoted (in regular font).

In the adiabatic approach, we study the behavior of the system on timescales that are long enough for the system to reach the (de)hybridization equilibrium, but too short for templated ligation events to take place. Therefore, the length of the oligomers are expected not to change throughout the equilibration process, and we need to introduce a separate mass conservation law for each coarse-grained oligomer, i.e., each oligomer length, that is included in the pool. For each length, the concentration of single-stranded coarse-grained oligomers of length L and the concentration of coarse-grained oligomers of length L bound in a complex need to add up to their total concentration ctot(L) set by the initial condition,

In this equation, s.t. denotes the summation over all complexes that contain at least one strand of length L. Note that we referred to the total concentration of oligomers of length L as c(L) in the main text, but for clarity we use ctot(L) in the supplementary material.

Combining the mass conservation requirement with the mass action law gives a set of coupled polynomial equations. The number of equations equals the number of distinct oligomer lengths included in the pool. The polynomial equations are of degree 4, as tetraplexes are the largest complexes to be accounted for and their concentrations equals the product of the four strands comprising the complex. We determine the equilibrium concentrations by finding the root of this set of fourth-degree polynomial equations using the Levenberg-Marquardt algorithm.

S-IV. Computing replication observables based on the adiabatic approach

Schematic representation of a complex that allows for templated ligation. The strands E1 and E2 are adjacent to each other, such that a covalent bond can form between their ends. The length of the product strand, LP, is set by the length of the educt strands, Le,1 and Le,2. The likelihood for the complex to form a product oligomer whose sequence is compatible with the true circular genome, pcorr, is determined by the length of the educts and the length of their hybridization region with the template. The parts of the complex that are depicted with hatching do not affect pcorr.

As we are not modeling templated ligation events explicitely in the adiabatic approach, we compute replication observables based on the equilibrium concentration of complexes that are in a configuration which allows for a templated ligation reaction to happen. Templated ligations are possible if two strands in the complex are adjacent to each other, i.e., there is no gap in between two oligomers that are hybridized to the same template strand (Fig. S7). Recall that the absence of a gap between two oligomers in the complex implies that the dangling end indicator variable dm = 0 (Section S-II). The length of the product strand P is equal to the sum of the lengths of the two educt strands E1 and E2, LP = LE,1 + LE,2. We can use the information about the length of the product strand to compute the yield of replication. By definition (see main text), the yield equals the fraction of nucleotides used to form VCG oligomers, i.e., strands longer than LS,

We can express this quantity in terms of the equilibrium concentration of complexes facilitating templated ligation,

denotes the summation over all complexes, in which (i) the strands E1 and E2 are adjacent to each other, i.e., dm = 0, and (ii) the length of the product Le,1 + Le,2LS. We multiply the equilibrium concentration by the length of the shorter educt strand to account for the number of incorporated nucleotides in line with the definition of the yield, Eq. (S2a).

In order to compute the fidelity of replication, we need to distinguish between product oligomers sequences that are compatible with the genomes (correct sequences) and sequences that are incompatible with the genome (false sequences). As we do not know about the details of the sequences due to the coarse-grained representation of the complexes, we need to invoke a combinatorial argument to determine the fraction of correct products. To this end, we compare the number of product sequences that might be produced in a complex of given oligomer lengths and relative oligomer position, to the number of correct products. The combinatorial multiplicity of the products that could be produced by a complex of given configuration is set by the combinatorial multiplicity of the possible templates, ℭ (Lo,1 + Lo,2), multiplied by the multiplicity of the educt strands hybridizing to the template with given lengths of the hybridization regions, Lo,1 and Lo,2,

The multiplicity of correct products equals the combinatorial multiplicity of strands that have the same length as the product,

This implies that the probability for a complex of given shape to form a correct product is given by

Using this probability, we can compute the fidelity of replication,

as well as the replication efficiency,

S-V. Scaling law of the equilibrium concentration ratio that maximizes replication efficiency

We consider a pool containing monomers and VCG oligomers of a single length. As the feedstock only contains monomers, the total monomer concentration and the total feedstock concentration are identical, . Similarly, the total VCG concentration equals the total concentration of oligomers of length . In such a system, the replication efficiency can be expressed as

where the concentrations denote the equilibrium concentration of all complexes facilitating the indicated type of templated ligation. Note that we do not include dimerizations, i.e., F+F ligations, in the numerator as they do not contribute to the yield. Moreover, the V+V ligations are multiplied by the length of the oligomers LV, as each V+V extends an existing oligomer by LV nucleotides.

Assuming that complexes comprised of three strands (triplexes) are the dominant type of complexes, we can express the equilibrium concentrations as product of the equilibrium concentrations of the free strands and the effective association constants,

Here, denotes the equilibrium concentration of a VCG oligomer with a specific sequence; the equilibrium concentration of all VCG oligomers for any sequence is computed by multiplying with the combinatorial multiplicity, . The same holds for . Eq. (S2) allows us to write the replication efficiency as,

where we introduced the ratio of equilibrium concentrations, . Maximizing Eq. (S3) with respect to rs yields,

The effective association constant are computed using the combinatorial rules outlined in Section S-III. We find that is similar to , deviating at most by 10%. Moreover,

and,

These observations allow us to obtain an significantly simplified approximate expression for ,

We used the identity in the last step. To derive the scaling of as a function of LV, we need to analyze how the effective association constants depend on LV.

  1. scales linearly with LV. In F+F ligations, an VCG oligomer acts as template facilitating the ligation of two monomers. Independent of the length of the VCG oligomer, the two monomers always have the same binding affinity to their template. However, the number of possible positions that the two adjacent monomers can have on the template scales linearly in template length, causing the effective association constant to depend linearly on LV,

  2. scales exponentially with LV. As illustrated in Fig. 2E in the main text, the length of the hybridization region equals L in complexes facilitating incorrect V+V ligations. This leads to exponential scaling of with LV, as the hybridization energy is proportional to the length of the hybridization site. The number of complexes facilitating incorrect product formation is only a function of the unique subsequence length LS, but does not depend on the oligomer length LV. Therefore, there is no multiplicative prefactor proportional LV,

Fig. S8 shows the LV-dependence of the effective association constants. The circles represent the effective association constants derived based on the combinatoric rules discussed in Section S-III, while the solid lines show the length-dependent scaling computed via Eq. (S4) and (S5).

Given the scaling of and , we find that the optimal ratio of equilibrium concentrations equals

Effective association constants of complexes facilitating F+F ligations (A), false V+V ligations (B), and F+V ligations (C). The dots depict the effective association constants derived based on the combinatoric rules presented in S-III, the solid lines represent the respective scaling laws introduced in Eq. (S4-S7). Different colors correspond to different hybridization energies per matching nearest neighbor block γ.

We obtain ropt (Fig. 2) from by multiplying with the appropriate combinatorial multiplicity,

S-VI. Characteristic length-scale of VCG oligomers enabling high replication efficiency

Starting from Eq. S3, the optimal efficiency of replication can be expressed as

As (for sufficiently high LV), we can expand in this small argument,

To understand the scaling of ηopt as function of LV, we need to know the scaling of in LV. A typical complex allowing for a F+V ligation is depicted in Fig. 2E in the main text. The length of the overlap Lo region between template and educt VCG oligomer can vary; it is at least one nucleotide, and at most LV − 1 nucleotides (one base pairing position needs to remain available for the monomer). We obtain the effective association constant by summing the contributions of all of these configurations,

Therefore, we use the following scaling ansatz for ,

Combining the scaling laws for , and with Eq. S6, we find

S-VII. Enhanced replication efficiency in multi-length VCG pools

We investigate if a pool containing a range of VCG oligomer lengths exhibits higher replication efficiency than single-length pools. To this end, we study a VCG pool comprised of monomers (with fixed concentration), as well as tetramers and octamers (with variable concentration). For moderate binding affinity, γ = −2.5 kBT, the pool reaches optimal replication efficiency, if the tetramer concentration is smaller than the octamer concentration (Main Text Fig. 3), with extensions of octamers by monomers (1+8) being the most dominant type of templated ligation reactions. For strong binding affinity, γ = −5.0 kBT, templated ligations of tetramers on octamer templates eventually achieve the same replication efficiency as (1+8) ligations (Fig. S9). The mechanism relies on a significant difference in the length of the template (octamers) and educt strands (tetramers). If longer educt strands (heptamers) are chosen, highest efficiency is achieved in the parameter regime, in which VCG oligomers are extended by monomers, regardless of the choice of hybridization energy per matching nearest-neighbor block, γ (Fig. S10 and Fig. S11).

Length-modulated enhanced ligation in pools containing tetramers and octamers for strong binding affinity, γ = −5.0 kBT. The replication efficiency reaches its maximum in the concentration regime that supports templated ligation of tetramers on octamer templates.

S-VIII. Dimer-related upper bound of replication efficiency

In a VCG pool comprised of monomers, dimers and VCG oligomers of single length, the replication efficiency can be expressed as

provided the pool operates in the concentration regime where all types of ligations other than F+V ligations are negligible. In this equation, stands for the equilibrium concentration of all complexes allowing for the tem-plated ligation of oligomers of length L to oligomers of length M if templated by oligomers of length denotes the same concentration under the additional constraint that the product oligomers need to be correct.

Length-modulated enhanced ligation in pools containing heptamers and octamers for weak binding affinity, γ = −2.5 kBT. The replication efficiency reaches its maximum in the concentration regime that is dominated by the addition of monomers to the VCG oligomers.

Length-modulated enhanced ligation in pools containing heptamers and octamers for strong binding affinity, γ = −5.0 kBT. The replication efficiency reaches its maximum in the concentration regime that is dominated by the addition of monomers to the VCG oligomers.

Complexes allowing for templated ligation need to involve at least three oligomers, but may be comprised of more strands. For the following analytical derivation, we restrict ourselves to triplexes, while the full numerical solution (see continuous lines in Main Text Fig. 5C) also includes tetraplexes. If only triplexes are accounted for, each equilibrium concentration may be written as a product of the effective association constant and the equilibrium concentrations of the involved oligomers,

Using this expression, the equation for the replication efficiency may be simplified,

The effective association constants in the denominator may be expressed as the sum of the association constants of correct and false products,

In the following, we simplify the notation by referring to as 1+V ligations, and to as 2+V ligations.

Effective association constants of complexes facilitating 1+V ligations (A) and 2+V ligations (B).

The effective association constants differ with respect to their dependence on LV: While and increase exponentially in LV, is length-independent (Fig. S12). Introducing the parametrizations,

we can express η as

Taking the limit LV → ∞ yields and upper bound for the replication efficiency,

Provided that the ratio of equilibrium concentrations of dimers and monomers is comparable to the ratio of total concentration of dimers and monomers, ceq(2)/ceq(1) ≈ ctot(2)/ctot(1), we can write

Therefore, the upper bound of the replication efficiency may be expressed as

For the genome that is analyzed in the main text, . Comparison of the analytical approach (involving only triplexes) to the numerical solution (involving tetraplexes) reveals that the contribution of tetraplexes is indeed negligible (Main Text Fig. 5)

S-IX. Asymptotic fraction of oligomers undergoing extension by monomers

In pools containing only activated monomers, as well as non-activated dimers and VCG oligomers of a single length, we observe that the fraction of oligomers that can be extended by a monomer reaches an asymptotic value for high VCG concentrations. Notably, this asymptotic value is independent of the oligomer length.

The fraction of oligomers that are in a monomer extension-competent state can be computed based on the equilibrium concentration of complexes facilitating the extension of an oligomer by a monomer,

The effective association constants of reactive triplexes can be computed based on the effective association constant of the duplexes. A If the hybridization region of the two VCG oligomers is LV − 1 nucleotides long, the monomer hybridizes to the end (start) of the template strand. As the template has no dangling end, the energy contribution of the hybridizing monomer is γ/2. B and C For hybridization regions that are shorter than LV − 1, but at least 1 nucleotide long, the energy contribution due to the added nucleotide is γ. In all cases, the prefactor 2 accounts for the two possible positions at which a monomer might be added.

Note that, in principle, the monomer can be added to the 5’- or the 3’-end of the oligomer, leading to two contributions in the numerator, which are identical due to symmetry, .

Just like in the previous section, denotes the equilibrium concentration of all complexes allowing for the templated ligation of oligomers of length L to oligomers of length M if templated by oligomers of length N. We assume that triplexes are the dominant type of complexes facilitating templated ligation, and write the equilibrium concentration as,

Therefore, the fraction of monomer-extendable oligomers reads,

In the limit of high VCG concentration, almost all VCG oligomers form duplexes. Hence, the equilibrium concentration of free oligomers can be approximated by,

implying,

As almost all monomers are free in solution, their equilibrium concentration may be approximated by the total concentration of monomers, .

From this representation of , it is not clear yet why should be independent of LV. In order to derive an LV-independent expression for , we need to express the association constant of triplexes, , in terms of the duplex association constant, . As illustrated in Fig. S13, the effective association constant of the triplex may be obtained by multiplying the effective association constant of the duplex by the binding affinity of the monomer. We need to distinguish the contributions of duplexes based on the length of their hybridization region: If the two strands in the duplex form a hybridization region that is as long as the oligomers, there is no free base pair for the monomer to hybridize; such duplexes must be excluded in the computation of the triplex association constant.

Similarly, monomers that hybridize to the end of a strand have a different binding affinity than monomers hybridizing to the center of an oligomer (Fig. S13A vs. Fig. S13B-C). All in all, the effective triplex association constant reads,

The notation implies that only complex configurations with a specific overlap length Lo are included. The full effective duplex association constant may be expressed as

We write the duplex association constants in terms of the binding energy due to their overlap length,

Therefore, the asymptotic ratio of monomer-extended oligomers reads,

Addition of a single base pair will decrease the energy by the energy of a half nearest neighbor block, i.e., by γ/2. Thus, we assign Kd(1) = exp (−|γ|/2), and find,

S-X. Analytical solution for equilibrium concentrations in multi-length VCG pools

We consider pools comprised of monomers, dimers and VCG oligomers of multiple lengths. For simplicity, we restrict ourselves to profiles with a uniform distribution of VCG oligomers, such that all VCG oligomers have the same concentration. Assuming that almost all oligomer mass is contained in single strands and duplexes, the mass conservation equations read

Note that there is a mass conservation equation for each oligomer length individually, i.e., we are dealing with a set of multiple coupled quadratic equations. We can make the system of equations dimensionless by introducing a dimensionless concentration ,

This rescaling is chosen as it is the solution to the (approximative) mass conservation equation in the high concentration limit,

under the assumption that . The latter assumption is reasonable given the concentration profile is uniform. We expect to be of order 1 in the high concentration limit.

Writing the full mass conservation equation in terms of the dimensionless concentration yields,

We drop all ratios of total concentrations, as we assume that the total concentration is the same for each oligomer length of VCG oligomer (uniform concentration profile),

By introducing the dimensionless prefactors,

we can rewrite the mass conservation equation,

To solve for , we make the assumption that the dimensionless equilibrium concentration is close to 1 for any length L. For this reason, we may assume that . This assumption is crucial, as it allows us to de-couple the quadratic equations. Therefore, the mass conservation reads,

We can use this representation to compute the equilibrium concentrations recursively: We start the recursion with the assumption , and compute with this assumption,

In the first recursion step, we compute , and solve the mass conservation equation for the equilibrium concentration,

This scheme can be applied until the approximated values of match the true values (obtained via numerical root finding) sufficiently well,

Note that the equilibrium concentration obtained after the (i + 1)-th iteration step depends on the total concentration ctot(L) via the rescaling prefactor as well as via αL. It turns out that the approximation converges after about five iteration steps; the relative error between the true (numerically computed) equilibrium concentration and the approximation is already below 1% in the first iteration step.

Comparison between approximate (analytical) and true (numerical) solution of the (de)hybridization equilibrium in multi-length VCG pools. A The approximation converges after roughly five iteration steps. The relative error drops below 1% in the first iteration step already. B Equilibrium concentrations as a function of total VCG concentration. Continuous lines show the numerical solution, while the dotted and dashed lines depict the approximation obtained in the zeroth or first iteration step respectively. The feedstock concentration is fixed, .

S-XI. Threshold concentration for the inversion of productivity

In VCG pools that contain VCG oligomers of multiple lengths as well as activated monomers, we observe “inversion of productivity”: The fraction of short VCG oligomers in a monomer extension-capable complex configuration exceeds that of long oligomers. However, this is only the case for sufficiently high concentration of VCG oligomers. The threshold concentration that is necessary for oligomers of length M to exceed the monomer-extension fraction of oligomers of length L (assuming M < L) is set by the condition r1+V(L) = r1+V(M). To compute r1+V(L), we need to account for all possible complexes, in which an oligomer L can be extended by a monomers,

As we are considering a uniform concentration profile, all VCG oligomers have the same total concentration, i.e.,ctot(L) = ctot(M). Thus, we can express the condition for the threshold concentration as

We combine this condition with the analytical approximation of the equilibrium concentration up to the first iteration step derived in Section S-X, cf. Eqs. S9-S11. As the equilibrium concentrations ceq(L), ceq(M), ceq(N), ceq(P) depend on the total concentration of VCG oligomers ctot(L) = ctot(M) = ctot(N) = ctot(P), this yields an semi-analytical criterion for the threshold concentration , at which the fraction of monomer-extended M -mers exceeds the fraction of monomer-extended L-mers.

S-XII. Inversion of productivity using parameters of system studied by Ding et al

In their experimental study, Ding et al. focus on a genome of length LG = 12 nt. In the VCG pool encoding the genome, Ding et al. include (Tab. S1 in [18])

  • dimers with 11 different sequences,

  • trimers with 20 different sequences, and

  • tetramers up to 12-mers with 24 different sequences each.

Every oligomer sequence is included with a concentration of 1 μM (so-called 1x profile), adding up to a total concentration of . The feedstock for the replication is comprised of activated imidazolium-bridged homo-dinucleotides with a total concentration of 20 mM. Over a timescale of around 4 h, these homo-dinucleotides react with each other to form 6 additional imidazolium-bridged hetero-dinucleotides as well as activated mononucleotides (Fig. S1 in [18]).

We construct a genome that mimics the properties of the genome used by Ding et al., but obeys our genome construction principles outlined in the method section of the main text. To this end, we consider a genome of length LG = 12 nt and a minimal unique subsequence length of LS = 3 nt. This implies that our VCG pool contains

  • dimers with 16 different sequences, and

  • trimers up to 12-mers with 24 different sequences.

Our model does not include imidazolium-bridged dinucleotides explicitely. Instead, we assume that the concentration of activated mononucleotides in our model equals the total concentration of activated homo-dinucleotides (20 mM) used in the experimental study. The total concentration of non-activated oligomers is treated as a free parameters.

We find that the system exhibits inversion of productivity: For the entire considered range of concentration of non-activated oligomers, 10-mers are more likely to be extended by a monomer than 12-mers (Fig. S15). Moreover, 8-mers are more productive than 12-mers (provided the concentration of non-activated oligomers exceeds roughly 0.3 μM), and more productive than 10-mers (provided ). Thus, for the experimentally used concentration of non-activated oligomers (, see vertical dashed line in Fig. S15), 8-mers are more productive than 10-mers, which are in turn more productive than 12-mers. Unlike in the experimental system, 6-mers are less likely than 8-mers and 10-mers to be extended by a monomer. We suppose that this difference can be attributed to the differences between the experimental setup and our theoretical model: (i) We choose a (slightly) different genome than Ding et al.. (ii) We model imidazolium-bridged dinucleotides as mononucleotides, as imidazolium-bridged dinucleotides only incorporate one mononucleotide at a time, just like activated mononucleotides. However, dinucleotides bind more stably to an existing complex than mononucleotides, which will affect the fraction of monomer-extended oligomers predicted.

Replication performance of multi-length VCG pools, in which only the monomers are activated. The pool includes activated monomers (ctot(1) = 20 mM, as well as non-activated oligomers of length L = 2 nt up to L = 12 nt with variable total concentration . The system exhibits inversion of productivity: 10-mers are more likely to be in a monomer-extension competent state than 12-mers, 8-mers are more likely to be extended by monomers than 10-mers (for provided ). For the experimentally used concentration (vertical dashed line), 8-mers are more productive than 10-mers, and those are more productive than 12-mers. However, unlike in the experimental system, 6-mers are less productive than 8-mers and 10-mers.