1. Computational and Systems Biology
  2. Evolutionary Biology
Download icon

An evolutionary model identifies the main evolutionary biases for the evolution of genome-replication profiles

  1. Rossana Droghetti
  2. Nicolas Agier
  3. Gilles Fischer
  4. Marco Gherardi
  5. Marco Cosentino Lagomarsino  Is a corresponding author
  1. Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, Italy
  2. Sorbonne Universitè, CNRS, Institut de Biologie Paris-Seine, Laboratory of Computational and Quantitative Biology, France
  3. Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, Milan, Italy and INFN sezione di Milano, Italy
  4. IFOM Foundation, FIRC Institute for Molecular Oncology, via Adamello 16, Italy
Research Article
  • Cited 0
  • Views 632
  • Annotations
Cite this article as: eLife 2021;10:e63542 doi: 10.7554/eLife.63542

Abstract

Recent results comparing the temporal program of genome replication of yeast species belonging to the Lachancea clade support the scenario that the evolution of the replication timing program could be mainly driven by correlated acquisition and loss events of active replication origins. Using these results as a benchmark, we develop an evolutionary model defined as birth-death process for replication origins and use it to identify the evolutionary biases that shape the replication timing profiles. Comparing different evolutionary models with data, we find that replication origin birth and death events are mainly driven by two evolutionary pressures, the first imposes that events leading to higher double-stall probability of replication forks are penalized, while the second makes less efficient origins more prone to evolutionary loss. This analysis provides an empirically grounded predictive framework for quantitative evolutionary studies of the replication timing program.

Introduction

Eukaryotes, from yeast to mammals, rely on predefined ‘replication origins’ along the genome to initiate replication Leonard and Méchali, 2013; Musiałek and Rybaczek, 2015; Ganier et al., 2019; Gilbert, 2001, but we still ignore most of the evolutionary principles shaping the biological properties of these objects. Binding by initiation complexes defines origins as discrete chromosomal loci, which are characterized by multiple layers of genomic properties, including the necessary presence of autonomously replicating sequences, nucleosome depletion, and absence of transcription (Méchali et al., 2013; Di Rienzi et al., 2012). Initiation at origins is stochastic, so that different cells of the same population undergoing genome replication in S-phase will typically initiate replication from different origins Bechhoefer and Rhind, 2012; Rhind et al., 2010.

Initiation from a single origin can be described by intrinsic rates and/or licensing events Hawkins et al., 2013. Indeed, the genome-wide replication kinetics of a population of cells can be accessed experimentally by different techniques Hawkins et al., 2013; Baker et al., 2012; Agier et al., 2013. Recent techniques also allow to measure replication progression at the single-cell level Müller et al., 2019; Hennion, 2020. The estimation of key origin parameters from data requires minimal mathematical models describing stochastic origin initiation and fork progression Retkute et al., 2012; Baker et al., 2012; de Moura et al., 2010; Zhang et al., 2017. Typically, one can extract from the data origin positions, as well as estimated origin-intrinsic characteristic firing times or rates. Knowledge of origins positions and rates makes it possible to estimate the ‘efficiency’ of an origin, that is, its probability of actively firing during S-phase, rather than being passively replicated.

Over evolution, a genome modifies its replication timing profile by ‘reprogramming’ origin positions and rates in order to maximize fitness under the constraints of the possible changes of these parameters that are physically and biologically accessible. Little is known about this process, and finding basic rules that drive origin evolution is our main focus here Koonin, 2011. The main recognized constraint determining negative selection is due to replication forks stalling between adjacent origins Newman et al., 2013; Letessier et al., 2011; Cha and Kleckner, 2002. If two converging replication forks stall with no origins in between them, it is generally agreed that replication cannot be rescued, and the event leads to cell death. Such deadly ‘double stalls’ can only happen with two converging forks generated from consecutive origins. A pioneering study by Newman et al., 2013 used a combination of data analysis and mathematical models to understand the role of lethal double-stall events on origin placement. They found that the fork per-base stall probability affects the distance between neighbor origins, and the optimal distance distribution tends to a regular spacing, which is confirmed by experimental data. Thus, origin placement is far from a uniform random distribution (which would translate into an exponential distribution of neighbor origin distances). Instead, the regular lattice-like spacing that origin tend to take is reminiscent of particles repelling each other.

Due to the streamlined genome and the experimental accessibility, yeasts are interesting systems to study experimentally the evolution of replication programs. However, at the level of the Saccharomyces genus, the replication program is highly conserved Müller and Nieduszynski, 2012. Hence, until recently, no experimental account of the evolution of the replication program was available. Our collaboration has recently produced data of this kind Agier et al., 2018 by comparing replication dynamics and origin usage of 10 distant Lachancea yeast species. This study highlights the dominance of origin birth-death events (rather than, e.g., chromosomal rearrangements) as the main evolutionary drive of the replication program changes and characterizes the main principles underlying origin birth-death events. Briefly, the fate of an origin strongly depends on its neighbourhood, particularly the distance from neighbor origins and their efficiency. Indeed, proximity to efficient origins correlates with weaker origin loss events. An evolutionary bias against weak origins could be due to the fact that their presence is neutral or even advantageous (e.g., in terms of reducing double stalls), but their advantage is not sufficiently high for them to survive drift. These findings open the question of capturing the relevant evolutionary biases acting on replication profiles in the framework of the empirical birth-death evolutionary dynamics, for which the data set Agier et al., 2018 provides an empirical testing ground.

Here, we define a minimal evolutionary birth-death model for replication program evolution encompassing all the empirical observations made by Agier et al., 2018, and we use it to investigate the main evolutionary trade-offs that could explain the data.

Results

Experimental data motivate an evolutionary model for origins turnover

This section presents a reanalysis of the experimental data from Agier et al., 2018. We summarize the main results of that study and present additional considerations on the same data, which motivate the evolutionary model framework used in the following.

Figure 1—figure supplement 1 recapitulates the Lachancea clade phylogenetic tree used in the analysis. The evolution of the temporal program of genome replication can be quantified by the divergence of the replication timing profiles across different species. Agier and coworkers found that timing profiles diverge gradually with increasing evolutionary divergence between species Agier et al., 2018. In principle, such divergence could be attributed to changes in the number, placement, and biological properties of all origins. However, a careful analysis of correlations (comparing the timing profiles and the activity of orthologous origins) shows that the main driver of program differentiation across species is the acquisition and loss of active replication origins. Specifically, the number of conserved origins decreases with increasing phylogenetic distance between species, following the same trend as the conservation of the timing profiles. This trend is the same in regions that are close to or away from breakpoints, pointing to a secondary role of genome rearrangements. In addition, the authors of Agier et al., 2018 show that the differences in the mere number of origins and the median difference in origin replication timing between pairs of species are nearly constant with phylogenetic distance, leading to exclude that origin reprogramming (rather than birth-death) plays a primary role in the evolution of the timing program.

Any model for the evolution of the replication program must (i) reproduce the empirical distribution of the inter-origin distances, (ii) reproduce the empirical distribution of the origin efficiencies, and (iii) account for the observed origin turnover dynamics. Previous analyses Agier et al., 2018; Newman et al., 2013 have shown that origins are far from following a uniform distribution along the genome. Figure 1A shows that the inter-origin distance distribution robustly shows a unimodal shape across the 10 Lachancea species studied in Agier et al., 2018. Specifically, distributions for each species show a marked peak around 35 kbp. This peak corresponds to a typical inter-origin distance, which is strikingly invariant across all Lachancea species. Figure 1B shows the distribution of the efficiencies, which is defined as the probability to actively fire during the S phase, estimated for each origin in the Lachancea clade using Equation 4 and a fit inferring the firing rates of all origins assuming a standard nucleation-growth model (see Materials and methods and Zhang et al., 2017). The single-species efficiency distributions show more variability across species than the inter-origin distance distributions, but they are consistent with a common shape and support.

Figure 1 with 4 supplements see all
Experimental data motivate an evolutionary model for replication origins turnover.

(A) Distribution of the distance between neighbor origins in 10 Lachancea species, each histogram refers to a different species (data from Agier et al., 2018), and all the plots show a marked peak around 35 kbp. (B) Distribution of the efficiency (calculated from a fit, using Equation 4) for all origins in 10 Lachancea yeast species Agier et al., 2018. (C) From Agier et al., 2018, box plot of the distribution of the distance from the nearest origin split by evolutionary events, for conserved (dark red), newly gained (red), and lost origins (black), estimated comparing six sister species of the Lachancea clade Agier et al., 2018. (D) Analysis of the origins that are nearest to conserved, newly gained, and lost, compared to the expected result if events were uncorrelated Agier et al., 2018. (E) Distribution of the efficiency of lost, conserved, and newly gained origins (respectively in black, dark red, and red) and their neighbors (gray). Note that the efficiency of lost origins is lower than average, while the efficiency of origins flanking a lost origin is higher. (F) Box plot of efficiency of all conserved and newly gained origins compared to those flanking a lost origin, which tend to be more efficient. Braces indicate subsampling (the box plots on the right side are defined by a subset of points of the box plots on the left). Box plots show the median (bar), 25–75 (box), and 10–90 (whiskers) percentiles. The data in panel (CF) refers to the six sister species of the Lachancea tree.

As mentioned above, a key result of Agier and coworkers is the insight that the evolution of the replication program is mainly shaped by the birth-death process of replication origins. Figure 1C–F recapitulates the main quantitative results that characterize this process. Note that the analyses in Figure 1C–F have been performed on the six sister species of Lachancea clade since the other species pairs are too distant to perform a reliable identification of conserved, newly gained, and lost origins Agier et al., 2018.

Figure 1C shows a box plot of the distance from the nearest origin for all the conserved (dark red), newly gained (red), and lost (black) origins. Lost replication origins tend to be closer to their neighbors, much more so than newly gained or conserved origins. This observation reveals that the distance of an origin from its nearest neighbor is correlated to the loss rate of the same origin over evolution. This is an essential feature that any evolutionary model of this process must take into account Newman et al., 2013; Agier et al., 2018. More in detail, Figure 1D further quantifies the correlation between gain and loss events of neighboring origins by comparing the fraction of observed events of loss, gain, or conservation, given the state of the nearest origin (conserved, lost, or gained). The distribution of event types for origins that are nearest neighbors of a newly gained origin deviates significantly from the null expectation of random uncorrelated events (i.e., in a simple scenario where the fractions of conserved, newly gained, and lost origins are fixed to the empirical values, and birth and death events of neighboring origins are independent). The same non-null behavior is observed for origins that are nearest to a lost origin, with the roles of gain and loss events exchanged. In summary, successive birth/death or death/birth events happen more frequently in the same genomic location than expected by chance. Beyond such a spatial correlation along the chromosomal coordinate, the analysis illustrates that birth and death events are correlated in time as well (in fact, the analyzed evolutionary events took place in the terminal branches of the phylogenetic tree, and thus they must have been close in term of evolutionary time).

Finally, Figure 1E and F show that origins lying near loci where origins were recently lost are typically in the high-efficiency range of the distribution, and that lost origins tend to be less efficient than conserved origins. Figure 1E compares the distribution of the efficiency of lost, conserved, and newly gained origins with the distribution of efficiency of the nearest origins. The efficiency of origins neighboring a loss event is higher than average, while the efficiency of lost origins is lower than average. These results clearly support the influence of origin efficiency on origin death events. This is confirmed by Figure 1F, which shows the distribution of efficiency of all conserved and newly gained origins. For both classes, considering only those origins that are nearest neighbors to a recently lost origin yields an increase in the efficiency.

Different mechanisms could lead to the correlations described above. Overall, it is clear that origin strength is somehow ‘coupled’ to birth-death events. For example, conserved origins may become more efficient after the loss of neighbor origins, or the birth of new highly efficient origins could facilitate the loss of neighbors, or losing an origin could expedite the acquisition of a new origin nearby. Overall, these results reveal that the origin birth-death process is following some specific ‘rules’ that involve both inter-origin distances and origin efficiency.

Note that the results of Figure 1E might appear to be incompatible with Figure 1D, but they are not. Figure 1E shows that the efficiency of newly gained origins is lower than average, and Figure 1D shows that the majority of origins that are nearest to a locus with a recent loss event are newly gained. The apparent contradiction arises from Figure 1E, which shows that the average efficiency of origins close to a lost one is higher than average. This inconsistency is resolved by the analysis shown in Figure 1F, which shows that origins appearing close to recently lost ones are among the most efficient.

A birth-death model including evolutionary bias from inter-origin replication fork double stalling recapitulates the main features of replication origin turnover

The joint stalling of two replication forks in the same inter-origin region along the genome is a well-characterized fatal event that may occur during S-phase. The frequency of this event in a clonal population clearly affects fitness. A previous modeling study Newman et al., 2013 focusing on yeast demonstrated that, in order to minimize the probability of a double stall anywhere along the chromosome, origins must be placed in the most ordered spatial configuration, namely all the consecutive origins must be equidistant from each other. However, the previous study did not incorporate this principle into an evolutionary dynamics of origin turnover. Thus, the important question arises of whether the tendency to avoid double stalls is related to origin gain and loss. To address this question, we defined a birth-death model, rooted in the experimental observations discussed in the previous section. This ‘double-stall-aversion model,’ described in detail below, biases the turnover of replication origins in such a way that events (in particular, birth events) leading to a decreasing double-stall probability are promoted because they increase the fitness of the cell.

In the double-stall-aversion model, the extent to which the acquisition of a new origin changes the probability of a double stall PiDS depends on the length li of the inter-origin region where the event occurred. This probability is therefore coordinate-dependent and can be derived by a procedure similar to the one carried out in Newman et al., 2013 (see more details in Materials and methods),

(1) PiDS=1-(1+πli)exp(-πli),

where li is the length of the genome region between the (i+1)th and the ith origin and π is the mean per-nucleotide fork-stall rate; we use the value from Newman et al., 2013, π=5×10-8 per nucleotide. Note that the double-stall probability is completely independent from the origin firing rates and efficiency, and depends only on the distance between the origins.

In our simulations of the model (see Materials and methods for a more detailed explanation), the genome was represented as a vector of origins, identified by the position and the firing rate. The model is a discrete-time Markov chain, and for the double-stall-aversion variant the chain is specified by the following update rules:

  • In each inter-origin region, the origin birth rate is biased by the value of the double-stall probability in that region. Specifically, the origin birth rate (per unit time) in the region i, of length li between the ith and (i+1)th origin is given by

    (2) bi=Nb¯(PiDS)γli,
  • where PiDS is the (constant) double-stall probability density in region i (Equation 1), b¯ is the birth rate (per Mbp and per unit time) extracted from experimental data (see Materials and methods), and γ is a positive real parameter that controls the strength of the bias. N is a normalization factor added to match the empirical birth rate b¯. Newborn origins are placed in the middle of the inter-origin region i.

  • Death (i.e., loss of origins) is unbiased and occurs at random origins with rate d¯ (estimated from experimental data, see Materials and methods), regardless of their efficiency or their neighbor’s efficiency.

The justification for the assumption that newborn origins are placed at midpoints in the model ultimately comes from data (Figure 1—figure supplement 2) where a strong bias in this direction is found. Relaxing this assumptions has consequences on the distance distribution and leads to poorer-performing models. We interpret this bias as the result of a faster (hence undetectable in our data) evolutionary process that counter-selects origins far from midpoints.

Firing rates in the model evolve by reshuffling of the empirical firing rate distribution, with a time scale that is set empirically (see Materials and methods and Figure 1—figure supplements 3 and 4). On shorter time scales, firing rate changes are likely more gradual, making firing rate evolution similar to a diffusion process. However, such changes are not quantifiable in our data set, which would leave the model with many extra parameters (a firing rate diffusion constant and bounds to set the empirical distributions) that are very difficult to estimate. Additionally, the firing rate distributions of the conserved (thus older) origins and of newborn (younger) ones are quite similar (Figure 1—figure supplement 3), and this condition is not generally met under a simple diffusive process.

Figure 2 shows the simulation results of the model with best-fitting parameter values (see Materials and methods and Figure 2 for other parameter values). Figure 2A and B show that the double-stall-aversion model reproduces the two main ‘structural’ features of yeast genome, namely the inter-origin distance distribution and the origin efficiency distribution. Additionally, Figure 2C and D show that the same model reproduces the observed correlations between the inter-origin distance and origin birth-death events, as well as the correlation between birth-death events and nature of the neighbor origins observed in the data (conserved, newly gained, or lost).

The double-stall-aversion model reproduces origin turnover and distributions but fails to capture correlations between origin turnover and origin strength.

The plots show the simulations of the best-fitting double-stall-aversion model compared with empirical data. (A) Inter-origin distance distribution in simulated species (blue bars) compared to the empirical distribution for the 10 Lachancea species (red diamonds). (B) Origin efficiency distribution in simulated (blue bars) vs. empirical species (red diamonds). (C) Box plot of the distance from the nearest origin split by evolutionary events, that is, for conserved (dark blue), newly gained (blue), and lost origins (black) for simulated species. (D) Fraction of origins that are nearest to conserved, newly gained, and lost for simulated species compared to the expected result for uncorrelated events. (E) Box plot of efficiency of lost, conserved, and newly gained origins (respectively in black, dark blue, and blue) and their neighbors (gray) in simulated species. The six distributions show very little variation. (F) The efficiency of all conserved and newly gained origins compared to the ones flanking a lost origin. Braces indicate subsampling. Box plots show the median (bar), 25–75 (box), and 10–90 (whiskers) percentiles. Simulation parameters (see Materials and methods): γ=2.4 overall birth and death rate b¯=13.6Mbp-1t-1, d¯=0.61t-1 and firing rate resampling rate R=0.92t-1, where t is measured by protein-sequence divergence. Panels (A) and (B) were generated using data from approximately 320,000 simulated origins, while panels (CF) were built using data from about 60,000 birth and death events and 240,000 conservation events.

The double-stall hypothesis alone fails to capture correlations of origin turnover with efficiency

In spite of the good performance of the double-stall-aversion model in explaining the empirical marginal distributions, we find that it fails to reproduce the observed correlations between the efficiency of an origin and the recent history of the nearest ones. Figure 2E shows very faint variations in efficiency of origins that are nearest neighbors to origins of different evolutionary fate. In particular, the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations. Note that Figure 2 and Figure 4 show that in the double-stall-aversion model origins nearest to a loss event are slightly more efficient than average. This trend is due to the fact that after an origin is lost, its neighbors are subject to lower interference and automatically become more efficient. However, Figure 4 shows that this null trend is too weak to explain the experimental data. These considerations indicate that a model without a direct mechanism linking the efficiency of an origin to the birth-death events of its neighbors cannot reproduce the data.

Double-stall aversion and interference between proximate origins explain the correlated evolution of origin presence and efficiency

Based on the above considerations, we defined a joint model that takes into account both the evolutionary pressure given by the double-stall probability and the direct effect of origin efficiency on birth-death events.

Specifically, this model is defined as follows:

  • The birth process is the same as in the double-stall-aversion model described above: the birth rate is biased by the double-stall probability in each inter-origin region Equation 1, and newborn origin are placed in the middle of the region.

  • Death of an origin is biased by its efficiency: less efficient origins are more easily lost. Specifically, the death rate (per unit time) for the ith origin is

    (3) di=Nd¯exp(-βeffi),
  • where effi is the efficiency of the ith origin, Equation 4, d¯ is the mean death rate extracted from experimental data (see Materials and methods). The positive parameter β tunes the interaction strength: the larger β, the steeper the dependence of di on effi. The normalizing factor N is chosen so as to match the empirical total death rate.

We note that the bias parameters β and γ are not inferred based on branch data, but on distributions of extant species (see Materials and methods).

Figure 3 gathers plots of the structural features (distribution of inter-origin distances and efficiencies, Figure 3A–B) and the evolutionary correlations involving efficiency, evolutionary fate, distance to nearest neighbor, and fate of nearest neighbor (Figure 3C–F). Overall, the joint model reproduces all the observations considered here regarding the layout of origins and their evolutionary dynamics, indicating that the experimental data can be rationalized by a fitness function that includes both the detrimental effects of nonreplicated regions and the evolutionary cost of maintaining inefficient replication origins.

Figure 3 with 1 supplement see all
A model where both fork stalling and interference affect fitness explain the correlations between origins of evolutionary events.

Result of the joint model best-fitting simulation compared with empirical data. (A) Inter-origin distance distribution in simulated species (blue bars) vs. empirical distribution for the 10 Lachancea species (red diamonds). (B) Origin efficiency distribution in simulated (blue bars) vs. empirical species (red diamonds). The agreement between simulation and experimental data shows that this joint evolutionary model reproduces the typical structural features of a yeast genome. (C) Box plot of the distance from the nearest origin split by evolutionary events, that is, for conserved (dark blue), newly gained (blue), and lost origins (black) for simulated species. (D) Fraction of origins that are nearest to conserved, newly gained, and lost for simulated species compared to the expected result for uncorrelated events. (E) Box plot of efficiency of lost, conserved, and newly gained origins (respectively in black, dark blue, and blue) and their neighbors (gray) in simulated species. (F) The efficiency of all conserved and newly gained origins compared to the ones flanking a lost origin. Braces indicate subsampling. Box plots show the median (bar), 25–75 (box), and 10–90 (whiskers) percentiles. Panels (DF) show that the model correctly reproduces the correlation between origin birth-death events over evolution and efficiency of the nearest origin. Simulation parameters (see Materials and methods): γ=2.2, β=1.9, overall birth and death rate b¯=13.6Mbp-1t-1, d¯=0.61t-1, and rate of origin firing rate reshuffling R=0.92t-1, where t is measured by protein-sequence divergence. Panels (A) and (B) show data from approximately 600,000 simulated origins, while panels (CF) data from about 100,000 birth and death events and 500,000 conservation events.

In particular, the coupling between the efficiency of an origin and the death rate of its neighbors, through the probability of passive replication, reproduces the empirical correlations shown in Figure 1. Figure 4 summarizes this crucial point of comparison between the joint efficiency/double-stall-aversion model and the pure double-stall-aversion case. The three plots compare efficiency distributions of lost, conserved, and newly gained origins (red for the data, blue for the models) with those of their neighbors (gray). Comparison of these plots shows that only the joint model reproduces the differences in efficiency of lost origins and their neighbors.

Figure 4 with 2 supplements see all
Comparison of model predictions for the correlations of origin birth-death events.

The plots in the red upper box compare efficiency distributions of the best-fitting simulation of the two different models (bottom and central panels) with experimental data (top panel). Comparison of the box plot of efficiency of lost, conserved, and newly gained origins (red for the data, blue for the models) shows better agreement of the joint efficiency/double-stall-aversion model (bottom panel) with the experimental data. Hence, the joint model reproduces well the correlation between evolutionary birth-death events of origins and efficiency of the nearest origin, while the double-stall-aversion model fails. Box plots show the median (bar), 25–75 (box), and 10–90 (whiskers) percentiles. Simulation parameters for the joint model (see Materials and methods): γ=2.2, β=1.9, and for the double-stall-aversion one: γ=2.4. General parameters: overall birth and death rate b¯=13.6Mbp-1t-1, d¯=0.61t-1 and rate of origin firing rate reshuffling R=0.92t-1, where t is measured by protein-sequence divergence. In the green lower box, we compare the predictive power of the two models for each of the tested feature of the experimental data. The box highlights that both the double-stall aversion model and the joint efficiency–double-stall model are able to reproduce the structural features of the genome. Also, the correlation between events–distance from the nearest and event–event of the nearest are correctly predicted by both models. The important difference between the two proposed models is found for the correlation between evolutionary events and origin efficiency, which is predicted and can be explained solely by the joint model.

In order to show that the stall-aversion and interference model has better quantitative agreement with the data, we also performed a simplified likelihood ratio analysis. The full likelihood of the model is complex, but we have defined ‘partial’ likelihoods for the joint and the double-stall-aversion model just taking into account the marginal probabilities shown as box plots in Figure 4, Figure 4—figure supplement 1 (see Materials and methods). Figure 1 shows that the joint model performs better for all the four chosen features. In our view, the qualitative difference shown in Figure 4 may be taken as a stronger argument in favor of the combined model in the sense that, beyond any quantitative agreement relying on parameters, the additional ingredient of a coupling between origin birth-death dynamics and origin rates is needed to explain the data.

The joint efficiency/double-stall-aversion model correctly predicts origin family divergence

Having established that the joint model is required to reproduce observations on single lineages, we turned to its predictions on observations that require knowledge of the whole phylogenetic tree, such as origin evolutionary families, defined as sets of orthologous origins Agier et al., 2018.

We thus set up a simulation of the model on a cladogenetic structure, fixed by the observed structure of the Lachancea clade phylogenetic tree (see Materials and methods for the simulations details). The outputs of each run in such simulations are nine different simulated genomes whose lineages are interconnected in the same way as the empirical species, and each branch follows the empirical divergence. We stress that these simulations just include intersecting lineages whose branched structure corresponds precisely to the lineages of the empirical tree. The phylogenetic structure does not emerge from the simulation as our model does not describe speciation. The model for the tree can simulate nine species, all the species except for Lachancea kluyveri, as this species was used as outgroup for the computation of the length of the tree branches Agier et al., 2018. We have repeated all the analyses on these simulations and verified that all the previous results hold (Figure 5—figure supplement 1). We then turned to other independent predictions of the joint model, which could be compared to measurements in Agier et al., 2018.

Figure 5A reports the dynamics of origin families. As reported in Agier et al., 2018, origins that belong to larger evolutionary families tend to have a higher efficiency compared to origins in smaller families, which is possibly due to the fact that, on average, high-efficiency origins tend to survive longer. Note, however, that there is no deterministic relation between family size and origin age because the relationship between these two is determined by the structure of the phylogenetic tree. Indeed, two families of the same size may have roots in different points of the tree, and thus the origins belonging to them may have very different ages. Thus, the prediction of the relation between origin efficiency and origin family size is not trivial. Figure 5A shows the results for the origin efficiency for families of varying size, comparing the experimental data and 100 different runs of the simulation.

Figure 5 with 2 supplements see all
The efficiency/double-stall-aversion model predicts origin divergence.

The plots compare predictions of the evolutionary model on the extent of origin divergence (simulations of the Lachancea phylogenetic tree) with empirical data. (A) Box plot of origins efficiency distributions split by family size. The plot compares origin families (sets of orthologous origins) in the nine Lachancea species (white line and red shaded areas) and simulated species (blue boxes, for 100 simulation runs). Medians are shown as white line for data, black bar for simulation, 25–75 percentiles as shaded area for data, box for simulation, and 10–90 percentiles as coarse shaded area for data, whiskers for simulation. (B) Origin divergence measured by the number of origins in the common ancestor that were lost in a pair of species, plotted as a function of total origin loss events. The plot compares model simulations (blue circles, 100 simulation runs), the experimental data (red squares), and a null model that shuffles the empirical birth-death events in each branch (green triangles, 1000 simulation runs). Error bars are standard deviations on y-axis values. Simulation parameters (for the evolutionary model, see Materials and methods): γ=2.2, β=1.9, overall birth and death rate b¯=13.6Mbp-1t-1, d¯=0.61t-1, and rate of origin firing rate reshuffling R=0.92t-1, where t is measured by protein-sequence divergence.

As a second step, we have considered the model prediction for the divergence of the shared origins in two species descending from a common ancestor. Specifically, we asked whether the number of origin death events occurring in two branches of the tree could justify the number of common origins in the two species. Indeed, whenever in a pair of species the number of shared origins is lower than the number of origins belonging to their common ancestor, this discrepancy must be due to the evolutionary loss events. These events are predicted by our model to be correlated in diverging species due to the common ancestry and the coupling of loss events to origin efficiency and distance. This correlation should lower the number of shared origins losses compared to a null expectation where loss events are not correlated. Figure 5B shows that the model correctly predicts the divergence in the number of shared origins lost during evolution without any parameter adjustment. We also verified that, as expected, a null evolutionary model is not able to reproduce this feature. The null model fixes in each branch of the simulated tree the same number of birth and death events that are present in the corresponding branch of Lachancea tree, but these events occur uniformly along the genome. The difference between the null model and the evolutionary model predictions shown in Figure 5B is a consequence of correlated origins losses due to the common genome structure, in terms of origins positions and efficiencies, that each pair of species inherit from their common ancestor.

We note that birth and death rate are inferred as global parameters, ignoring correlations. Despite this, Figure 5B shows that the model reproduces the higher correlation in birth and death events in closer-related branches than in distant branches as a consequence of the common positions and firing rates of the origins in the ancestor.

Discussion

Overall, this study provides a framework to study replication program evolution driven by replication origin birth-death events and demonstrates that both fork stalling and efficiency shape the adaptive evolution of replication programs. The model framework is predictive and falsifiable, and it can be used to formulate predictions on the phylogenetic tree. In future studies, it would be interesting to explore the predictions for the evolutionary dynamics under perturbations, such as evolution under increased replication stress or conditions where fork stalling becomes more frequent. Additionally, the framework can be used to discover specific trends, such as different evolutionary dynamics of specific genomic regions (subtelomeres Yue, 2017, regions containing repeats, etc. Arbona et al., 2018; Boos and Ferreira, 2019), role of genome spatial organization Marchal et al., 2019, and correlated firing of nearby origins.

A general question concerns the predictive value of the model proposed here on out-of-sample data. Figure 5 shows that fit-independent predictions apply across the tree. Importantly, the model is based on simple global parameters and not fine-tuned on local features of the tree. To underline this point, we verified that a model fit using only the subtree between LADA a LAWA yielded similar parameters. Clearly, we cannot exclude that the values of the birth and death rate, and also the bias parameters γ and β, could be Lachancea-specific, while we speculate that the conclusions on the relevant evolutionary mechanisms might apply more generally.

The previous approach by Newman et al., 2013 described the evolution of origin distance as an optimization process that minimizes double fork-stall events, without attempting to characterize explicitly the evolutionary dynamics. Such approaches are limited compared to the framework presented here because they can predict only the origin distance distribution, and they do not allow any prediction regarding origin and replication program evolution along lineages and across phylogenetic trees. In accordance with the results of Newman et al., we confirm that double-stall events are a primary driver of the evolution of replication programs, and we frame this finding into the empirically measured birth-death evolutionary dynamics of replication origins. Additionally, we show that next to fork-stall events, origin efficiency plays an important role in shaping the evolutionary landscape seen by a replication timing profile.

What could be the mechanisms coupling efficiency to origin birth-death? The actual process of origin death could be nearly neutral Koonin, 2016 as low-efficiency origins are – by definition – rarely used, and unused origins over evolutionary times are more prone to decay in sequence, and consequently in firing rate until they disappear. Equally, a newborn origin close to a very strong one (which would make the newborn origin relatively inefficient) could be used rarely. This would make this origin relatively less likely to establish over evolutionary times compared to an isolated newborn origin. However, rarely used origins could be essential in situations of stress (and in particular they could resolve double-stall events). Finally, a fitness cost for maintaining too many origins might set up an overall negative selection preventing a global increase in origin number Zhang et al., 2017; Karschau et al., 2012; Das et al., 2015.

Materials and methods

Data

The experimental data used in this work come from Agier et al., 2018. In particular, we made use of the data regarding the replication origins. For each origin in each of the 10 Lachancea species, this data set includes the chromosome coordinate and firing rate, and the inferred birth and death events occurred in the branches of the phylogenetic tree shown in Figure 1—figure supplement 1. Focusing on the terminal branches of the tree and on the extant replication origins, this study defines three categories of origins: (i) ‘conserved’ origins (which survived from the last ancestor), (ii) ‘newly gained’ origins gained in the last branch of the phylogenetic tree, and (iii) ‘lost’ origins, which were present in the last ancestor species and are not present in the terminal branch. Properties of the lost origins (e.g., position and firing rate) are inferred from the projection of the corresponding ones on the closest species, keeping into account synteny. Since the synteny map is less precise in distant species, the information on the origins events is only available for the six sister species in the tree, which belong to the three closest species pairs, highlighted with the red shaded area in Figure 1—figure supplement 1.

Computation of the efficiency

Request a detailed protocol

Origin efficiency was defined as the probability of actively firing during S phase (or, equivalently, the probability of not being passively replicated by forks coming from nearby origins). In practice, we computed it by the following formula:

(4) effi=(1Pi,i1)(1Pi,i+1) ,

where Pi,i+1 and Pi,i-1 are the probabilities for the ith origin of passive replication respectively from the (i+1)th and (i-1)th origins. Note that this efficiency formula Equation 4 is an approximation that only takes into account the possibility to be passively replicated by neighbor origins, neglecting the influence of other nearby origins. Following Agier et al., 2018, for computing the efficiency we assumed that the origin firing process has constant rate Zhang et al., 2017, and we thus obtain the following closed expressions for the probabilities of passive replication:

(5) Pi,i+1=λi+1λi+1+λiexp[λi|xi+1xi|v] ,

and

(6) Pi,i-1=λi-1λi-1+λiexp[-λi|xi-1-xi|v].

In the above equations, v is the typical velocity of replication forks, xi is the ith origin chromosome coordinate, and λi is the ith origin firing rate divided by the mean firing rate of the species the origin belong to. The raw firing rates in the data are affected by the different physiology of the nine Lachancea species in the experimental growth conditions (which were the same for all the species). In order to reduce these differences, we normalized the rates by their average for each given species. For this reason, we did not make use of the origin efficiency data already present in Agier et al., 2018.

Computation of the double-stall probability

Request a detailed protocol

The probability PiDS that two converging forks stall is easily computed in the limit where the stall probability per basepair is small and the number of basepairs is large. Under these assumptions, stalling is a Poisson process with rate (per basepair) π. PiDS can be written in terms of the probability PS(x) that a single fork stalls after replicating x nucleotides:

(7) PiDS=0lidx0lixdyPS(x)PS(y) ,

where li is the length (number of basepairs) of the ith inter-origin region. Imagine two converging replication forks starting from origins i and i+1: the two integration variables x and y represent the number of basepairs that each fork replicates before stalling. By using the Poisson process result PS(x)=πexp(-πx) and performing the integration, one obtains the result in Equation 1.

Evolutionary model

Request a detailed protocol

We defined origin birth-death models incorporating different evolutionary biases. In these models, the genome is described as a one-dimensional circle with discrete origin location xi, where the length of the genome is equal to the average genome length in Lachancea clade (10.7Mbp). We made use of a circular genome in order to avoid border effects. In the model, the set of origins change over evolution by three basic (stochastic) processes, birth of an origin in a certain genome region, origin death, and change of origins firing rate. We have verified that choosing linear chromosome does not alter significantly our findings, although it affects the distances between origins close to chromosome ends (Figure 3—figure supplement 1).

Overall origin birth/death rates were estimated from the data as follows. To estimate the overall birth rate b¯, we considered, for all the terminal branches of the phylogenetic tree, the number of birth events Nb, the genome length of the corresponding species L, and the length of the tree branch T, and divided Nb by LT. Then we averaged over all terminal branches. To estimate the overall death rate d¯, we followed a similar approach, taking the number of death events Nd in the terminal branches, the length of the branch T, and the number of origins in the corresponding species nori, then computing NdT-1nori-1 for all the terminal branches and averaging these values. The final results for overall birth and death rates from the origin birth-death events across the Lachancea clade are b¯=13.5627Mbp-1t-1 and d¯=0.612287t-1.

We verified that the assumption of constant rates was consistent with the empirical variability of the numbers of birth and death events per unit time along different branches of the tree by comparing simulations with data. Figure 5—figure supplement 2 shows that simulations and empirical data present similar spreading.

The process by which origin firing rates change over evolution was described as stochastic, with every origin having a fixed probability per unit time of changing its firing rate, given by R=0.92t-1, a value fixed from experimental data (see Appendix 1 and Figure 1—figure supplement 4). When a firing rate changes, it is resampled from the distribution of all the empirical normalized firing rates computed using the data in Agier et al., 2018 (see Appendix 1 and Figure 1—figure supplement 3 for more details).

Simulations

Code availability

Request a detailed protocol

The code used to run the simulations, together with instructions to run it, was shared as a repository on Mendeley data (Droghetti, 2020).

Algorithm

Request a detailed protocol

The prediction of the different evolutionary models was derived numerically, making use of custom simulations written in C++, which implement the origin birth-death dynamics as a Gillespie algorithm (Gillespie, 1976). Every model variant was required to reproduce the experimental overall rates, b¯=13.5627Mbp-1t-1 for origin birth, d¯=0.612287t-1 for origin death, and R=0.92t-1 for firing rate change. We simulated the three processes defining the model as follows. (i) the birth process has a common definition for the stall-aversion and joint model. The algorithm first tests each subsequent inter-origin region, calculates the birth probability from Equation 2, and stores the results. Subsequently, it computes the normalization factor N in order to match the empirical birth rate per nucleotide b¯. Finally, it samples all the inter-origin regions drawing birth events from the computed birth probability (Equation 2). New origins are placed the midpoints of the tested intervals. (ii) The death process is different for the stall-aversion model (unbiased) and the joint model (related to the origin efficiency). In the joint model, the algorithm first calculates the death rate for each origin using Equation 3 and stores the results. Subsequently, it computes the normalization factor N in order to match the empirical mean death rate d¯. Finally, it samples all origin drawing death events from the computed death probability. For the unbiased process (stall-aversion model), the dynamics is identical, but all the origins have the same death rate d¯ so that the algorithm can skip the calculation of N. (iii) The process updating origin firing rates over evolutionary times is common to all model variants. The probability of update per origin per unit time is R. Origins are sampled for each time step and assigned a new rate uniformly extracted from the empirical distribution of all normalized firing rates with probability Rdt.

During the simulation, the genome configuration (chromosome position, firing rate, efficiency for each origin) is known at each time step, which matches the empirical time (tree branch length, measured by protein-sequence divergence). For simulating single lineages, we started with a collection of 50 origins, with positions and firing rate uniformly drawn from all the possible ones. Rapidly, the inter-origin distances distribution, the efficiency one, and the number of origins reach a steady state (for the number of origins, set by the balance of birth and death rate, and characterized by approximately 225 origins). Configurations, including birth-death events, were printed at regular time intervals after steady state is reached. The time interval between prints is chosen to be equal to the average length of the Lachancea phylogenetic tree terminal branches in order to compare single-lineage simulations with empirical data. For simulations on a phylogenetic tree, after one species reaches the steady state, it is used as a root. To reproduce the empirical branching structure of the tree, we run the simulation, one for each branch of the phylogenetic tree, each time starting from the species at the previous branching point, for a period that matches the length of the branch. If the simulated branch is terminal, then the configuration corresponds to one of the empirical species, otherwise it corresponds to a ‘branching-point species’ and it can be used as a starting point for other simulations. Each simulation run gives nine different simulated species with the same cladogenetic structure as the empirical species (Figure 1—figure supplement 1).

Fitting procedure

Request a detailed protocol

The biased birth-death processes in the simulations rely on some parameters to tune the strength of the bias, which are the only parameters to fix by a fit, since all the other parameter values are fixed empirically. In the joint model, there are two free parameters, γ and β, that tune respectively the strength of the bias on the origin birth and on the origin death process. For a discrete set of parameter pairs spanning realistic intervals, we run hundred different simulations, each starting with a randomized genome. Considering the simulated species for all the pairs of parameter values, we quantify the discrepancy with experimental data by evaluating the L1 distance of the normalized histogram of efficiency and inter-origin distances. This quantity is a number between 0 and 2, 0 if the histograms perfectly overlap and 2 if they have completely different supports. For each pair of parameters, the analysis gives two values of discrepancy. We choose the value of γ (the parameter that tunes the bias on the birth rate based on double-stall aversion) by taking the smaller discrepancy from the inter-origin distances distribution. For the value of β (which tunes the interference bias on the death rate in joint model), we chose the one that gave us the smaller area on the efficiency distribution. For the double-stall-aversion model, the fitting procedure is identical and only requires to fix γ.

Simplified likelihood analysis

Request a detailed protocol

We performed a (simplified) likelihood ratio analysis in order to test the better quantitative performance of the combined model. The full likelihood of the models analyzed here is complex, but we have defined ‘partial’ likelihoods for the joint and the double-stall-aversion model only taking into account the marginal probabilities shown in Figure 4—figure supplement 1. Hence, the test evaluates for both models the goodness of the predicted correlation between the efficiency and firing rate of the lost origins and the ones of their neighbors. The likelihood ratio test quantifies how much the prediction of a certain model is better than a reference (‘null’) model. We chose the double-stall-aversion model as reference (equivalent to setting β=0 in the joint model). Specifically, one evaluates

(8) Lr=2log(Ljoint(γ,β)LDS(γ,β=0))=2(ljoint(γ,β)-lDS(γ,β=0)),

where LX are the likelihoods of the two models and lX are the log-likelihoods. Assuming that Lr is χ-squared distributed (this is generally the case for large samples), we could compute a p-value associated to this test.

Null birth-death model

Request a detailed protocol

We defined a null birth-death model where origin birth-death events in sister species are uncorrelated in order to analyze the divergence of shared origins and compare it with the prediction of the evolutionary model. This model implements birth and death events uniformly, regardless of origin position and firing rate, fixing the number of events for each branch of the simulated phylogenetic tree. These values are taken from the inference reported in Agier et al., 2018 (shown in Figure 3A of that study and in Figure 1—figure supplement 1). The simulation of this model starts with 220 origins (the number of origins inferred for to LA2, the species at the root of the tree). Subsequently, following the structure of the Lachancea phylogenetic tree, the simulation proceeds as follows: (i) at each branching point, the genome is copied into two daughters; (ii) for each daughter, the prescribed number of random death and birth events (in this order) is generated on random origins; and (iii) the simulation stops when it reaches the leaves of the Lachancea tree.

Appendix 1

Estimating parameters for the evolution of origin firing rates

This section motivates the model implementation of the evolutionary dynamics of firing rates. In order to quantify the change of origin firing rates over evolutionary times, we studied how the correlation between firing rates of conserved origins behaves as species diverge (Figure 1—figure supplement 3A). To quantify the divergence, for each pair of species in the Lachancea clade we calculated the Spearman correlation coefficient between the sets of firing rates belonging to corresponding origins in the two species considered (normalized by the species mean firing rate). We found that the more the species are distant, the less these two sets are correlated, which means that origin initiation rates diverge during evolution and origins lose memory of their initial firing rate. The model describes the evolution of firing rates as follows. Every origin changes its firing rate by extracting a new value from the distribution of empirical normalized ones, regardless of their previous firing rate. This process is characterized by a resampling rate R, common to all the origins, which defines the probability per unit time that an origin resamples its firing rate. The slope of the correlation coefficient in empirical data defines the speed at which the origin firing rates evolve. Hence, it is possible to fit this specific slope and extract the value of R.

In order to do that, we simulated the evolutionary process with unbiased origin death and update of the firing rate. This simulation can be performed without the birth process because the only origins that one needs to consider in computing the Spearman coefficient between two species are the conserved ones. Each simulation started from 225 origins, with firing rates randomly sampled from the empirical set of firing rates, evolving the genomes changing the firing rates with the resampling process described above and removing the origins according to the death rate estimated from the data. By performing several simulations with different values of the extracting rate R, it is possible to fit its best value. For each R tested, we ran 1000 simulations for an evolutionary time corresponding to 1.6.

After computing the Spearman correlations between snapshots at different evolutionary times, we performed an exponential fit in order to see which value of the R parameter gave the best agreement with the experimental data, finding the best-fit value R=0.92. Figure 1—figure supplement 4 shows the trend achieved by the simulation using R=0.92, and it shows a very good agreement between experimental data and simulations.

Note that in Agier et al., 2018 a similar analysis was carried out in order to verify if the reprogramming of the origins firing rate has an impact on the differentiation of replication timing. The authors analyzed the origin firing time differences between conserved replication origins in all pairs of species and found that this difference does not correlate with the phylogenetic distance between species. This finding is apparently in contrast with our results, which suggest that origin reprogramming increases with distance between species. We believe that this discrepancy is due to the higher sensitivity of the Spearman correlation and of the use of species-average normalized firing rates in this study.

The empirical data falsify the scenario where interference alone drives origin evolution

This section presents a theoretical analysis of the scenario where solely origin interference sets the evolutionary pressure on replication timing profiles. This analysis shows that a description that only takes into account the evolutionary pressure that acts on origin efficiency is not able to reproduce the origins spatial arrangement, a crucial feature in empirical yeast data. To carry out this analysis, we take a ‘maximum entropy’ approach (see Banavar JR, Maritan A, Volkov I. Applications of the principle of maximum entropy: from physics to ecology. J Phys Condens Matter. 2010;22(6):063101. doi:10.1088/0953-8984/22/6/063101) and infer an effective ‘force potential’ acting on inter-origin distance by looking at its (assumed equilibrium) distribution. Specifically, the effective potential acting on the origin efficiency starting from the empirical efficiency distribution can be analytically computed from the following formula:

where eff is the efficiency, eff[0,1], and P(eff) the efficiency probability density function.

The above potential, once given the relation between efficiency and distance between origins (Equation 4), defines another potential Hd(d) that act on the inter-origin distances. By taking the exponential of Hd(d) one obtains the expected probability distribution predicted for the distances at equilibrium.

In order to find Hd(d), one must invert Equation 4 and find d(eff). To accomplish this task, we have approximated the three-body interaction that gives the efficiency with a two-body interaction. This assumption implies that each origin feels the interference of only one of his two neighbors and is effective as long as three-origin interactions can be decomposed in two-origin components. Under this assumption, Equation 4 becomes

Note that origin efficiency (Equation 4) also depends on the firing rates of the origin and its neighbor, hence, strictly speaking, one has that

To eliminate the firing rate dependence, we computed an effective potential Hd on the distance, which averages the effect of the different firing rates. To this end, we used the mean value theorem for integrals as follows:

In other words, we substituted all the firing rates with the average one §lt;λ1 since the rates are normalized on the species average. With this simplification, going from Heff(eff) to Hd(d) is straightforward and gives

and

From the potential Hd, we can compute the prediction for the equilibrium probability distribution of inter-origin distances

where N is a normalization factor. In order to use this calculation on the data, we inferred the expected potential from the efficiency distribution, assuming that the interaction only depends on efficiency, and we then obtained the model prediction for the expected inter-origin distribution based on the efficiency profile. Comparison of this prediction with the empirical inter-origin distance distribution provides a test of the model. This procedure does not require to adjust any model parameter. Figure 4—figure supplement 2 shows the result of this analysis. The predicted distribution does not match the empirical one. This means that any evolutionary model that assumes a bias based only on the efficiency (in other words, one that takes into account only the evolutionary pressure given by origin interference) cannot reproduce (at steady state) the correct spatial organization of replication origins.

Data availability

The code used to run the simulations, together with a readme file, was shared as a Mendeley data repository (https://data.mendeley.com/datasets/vg3r5355bj/2).

The following previously published data sets were used

References

    1. Leonard AC
    2. Méchali M
    (2013) DNA replication origins
    Cold Spring Harbor Perspectives in Biology 5:a010116.
    https://doi.org/10.1101/cshperspect.a010116
    1. Rhind N
    2. Yang SC
    3. Bechhoefer J
    (2010) Reconciling stochastic origin firing with defined replication timing
    Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology 18:35–43.
    https://doi.org/10.1007/s10577-009-9093-3

Decision letter

  1. Sandeep Krishna
    Reviewing Editor; National Centre for Biological Sciences­‐Tata Institute of Fundamental Research, India
  2. Naama Barkai
    Senior Editor; Weizmann Institute of Science, Israel
  3. Alessandro De Moura
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

The reviewers and editors found that the manuscript presents a simple but compelling model that explains the dynamics of replication origin birth and death, which enhances our understanding of the selection pressures that have shaped the distribution of replication origins.

Decision letter after peer review:

Thank you for submitting your article "An evolutionary model identifies the main selective pressures for the evolution of genome-replication profiles" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Alessandro De Moura (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The reviewers appreciate that the manuscript presents a simple but compelling model that explains the dynamics of replication origin birth and death, which enhances our understanding of the selection pressures that have shaped the distribution of replication origins. However, both reviewers had a series of concerns. Please go through their reviews below. In your revised manuscript please address in particular the major concerns they have raised.

We would like to draw your attention to changes in our policy on revisions we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Reviewer #2:

The manuscript entitled "An evolutionary model identifies the main selective pressures for the evolution of genome-replication profiles" is an examination of the principles shaping evolution of replication origin placement. Overall I found the manuscript to be engaging and interesting, and the topic of general importance. It is quite compelling that with just two parameters, origin efficiency and distance between origins, a good model can be built to describe the dynamics of origin birth and death. While this work on its own is sufficiently important for publication, it would be very interesting to see whether the model can be updated in the future to address whether there are fork-stalling or origin-generating mechanisms that shape evolution of specific inter-origin spaces. This work provides a very good foundation for such efforts.

I have a few major, general concerns I would like the authors to address.

If I'm interpreting the methods correctly, it seems the parameters used in these simulations, such as mean birth rate, mean death rate, γ, and β, were fit to the data once, and used as point estimates during simulation. If true, I expect the simulations to be yielding estimates of birth and death rates with a much narrower distribution of outcomes than is likely to be realistic given what an appropriate level of confidence in those parameter estimates would be. Could the parameters be fit to data in such a way that we attain an estimate of confidence in the parameter values, from which a distribution could be generated and sampled from during simulation?

Closely related to my prior concern, I would like the authors to demonstrate the general predictive value of their model on out-of-sample data. Can the model be applied to other data on replication timing? Without such an attempt to demonstrate the model's applicability to out-of-sample prediction, the reader cannot ascertain whether the model is overfit to the Lachancea data from Agier et al., 2018. Also, keeps the parameter estimates here from being overfit to better predict origin birth and death events in closely related branches of the Lachancea tree in Figure S1? Are γ and β inferred in a way that accounts for the higher correlation in birth and death events in closer-related branches than in distal branches, or has the fit ignored those correlations?

The authors state that their model identifies selective pressures. The authors imply, and specifically state in lines 238-242, that increased death rate of origins which happen to be nearby highly efficient origins represents selective pressure against the less efficient origins. It isn't until the discussion that the authors raise the possibility that there may simply be a lack of selective pressure to retain inefficient origins that are near highly efficient origins. In my view, it's more likely that selection for the existence of an inefficient origin is simply lower than the drift barrier, so mutagenesis and drift can passively remove such origins over time without the need to invoke selection against inefficient origins.

Figure 3 is intended to show that the stall-aversion and interference model performs better at predicting correlations between efficiency of lost origins and their nearest neighbor. I agree, but I do not think Figure 3 presents a strong case for this conclusion. Figure S6 presents stronger evidence to me. While Figure 3 does qualitatively suggest that the joint model may predict the correlation between neighboring origin efficiency and origin loss better than the double-stall model alone, it almost appears to me that the model with fork stalling and interference has significantly overestimated the correlation. Is there a quantitative way, perhaps using information criteria, though I admittedly am not sure how one would go about doing that with simulations such as these, to demonstrate that the model with both effects has better predictive value than the one with only fork stalling?

There are a couple of assumptions of the model that I would like the authors to examine in further detail. First, that origin birth events occur in the middle of an inter-origin space. I am not aware of evidence pointing to this being a good a priori assumption. Can you re-run the simulations, allowing origins to arise at a random site within the inter-origin space into which it is born? Second, is it reasonable to expect origin firing rates to reshuffle to a new value randomly, without any dependence on their prior rate? Perhaps I'm mistaken, but it seems to me that an origin's firing rate should evolve more gradually, and should have a higher probability of sampling from values near its current value than from values very far from its current value.

Reviewer #3:

This paper proposes a novel and relevant evolutionary model that explains many aspects of replication origin statistics in a family of yeast species. It is a step forward in our understanding of the evolutionary pressures that affect the distribution of replication origins in Eukaryotes. I recommend it should be published in eLife, provided the authors revise the paper to address the following issue:

1. Many of the conclusions of the paper are based on the claim that the extending the model by adding an efficiency bias to the origin death rate makes the model fit the data better; in particular, they say in line 213 that "the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations."

This is reinforced in line 243, and in other parts of the text. But inspecting Figure 3, the two models (with and without a death rate bias) yield almost identical box-plots; if anything, the box-plots for the lost/nearest fractions of the pure double-stall aversion model seem visually to match the data marginally better. So why do the authors claim that the model with death rate bias is a much better fit? This is far from clear by just inspecting the data. I see no "huge difference" in the plots. There is a difference, but it is far from huge – the differences in the mean are much smaller than the size of the boxes. It seems to me unjustified to use this to choose one model over another. One way to ascertain this is to do rigorous statistical tests to determine if the differences in the means of the simulated and observed data are statistically significant; for example, a t-test.

https://doi.org/10.7554/eLife.63542.sa1

Author response

Reviewer #2:

The manuscript entitled "An evolutionary model identifies the main selective pressures for the evolution of genome-replication profiles" is an examination of the principles shaping evolution of replication origin placement. Overall I found the manuscript to be engaging and interesting, and the topic of general importance. It is quite compelling that with just two parameters, origin efficiency and distance between origins, a good model can be built to describe the dynamics of origin birth and death. While this work on its own is sufficiently important for publication, it would be very interesting to see whether the model can be updated in the future to address whether there are fork-stalling or origin-generating mechanisms that shape evolution of specific inter-origin spaces. This work provides a very good foundation for such efforts.

I have a few major, general concerns I would like the authors to address.

If I'm interpreting the methods correctly, it seems the parameters used in these simulations, such as mean birth rate, mean death rate, γ, and β, were fit to the data once, and used as point estimates during simulation. If true, I expect the simulations to be yielding estimates of birth and death rates with a much narrower distribution of outcomes than is likely to be realistic given what an appropriate level of confidence in those parameter estimates would be. Could the parameters be fit to data in such a way that we attain an estimate of confidence in the parameter values, from which a distribution could be generated and sampled from during simulation?

We agree that this point is important. To verify consistency of using single values for these parameters, we compared the variability of the local numbers of birth and death events along different branches of the tree obtained from simulations with data. Figure 5—figure supplement 2 shows that simulations and empirical data present similar spreading, especially considering death events. For births, we noticed one relevant outlier for the longest branch of the tree, which could motivate a future deeper analysis. Overall, simulations using constant rates yield a similar distribution of events to data, which appears to be satisfactory, at least to a first approximation. For the bias parameters γ and β, this additional test is not necessary, as these two parameters were fixed by choosing the values giving the smallest error on these two distributions (hence a constant parameter model, neglecting variations, is sufficiently good for these parameters). This is shown in Figure 3 (panels A and B) of the revised version main text.

Closely related to my prior concern, I would like the authors to demonstrate the general predictive value of their model on out-of-sample data. Can the model be applied to other data on replication timing? Without such an attempt to demonstrate the model's applicability to out-of-sample prediction, the reader cannot ascertain whether the model is overfit to the Lachancea data from Agier et al., 2018.

This is an interesting point but unfortunately no systematic data exist to perform this analysis. Figure 5 of the revised version shows that t-independent predictions apply across the tree. It is important that this model is based on simple global parameters, and not ne tuned on local features of the tree. To emphasize this point, we performed a model t using only a sub-sample of the yeasts, and showed that it leads to the same conclusions. For this analysis we choose the subtree between LADA and LAWA (see Author response image 1) in order to sample a subset of species that were neither too close nor too far compared to the full sample. We anyway expect that the estimated parameters may vary with particular choices of subtrees, also due to sampling effects. This procedure yielded parameter values that are very similar to the t of the full tree. For example the new death (=0.58 t1) and the birth (=13.63 Mbp1t1) rate are really close to the values estimated using the whole phylogenetic tree, and the fitting procedure of the joint model with these new parameters leads to a values for the γ and β parameters of 2.25 and 1.9 respectively, which, again, are two values really close to the ones obtained with the full tree parameters. We added a comment in the text stating explicitly that we cannot exclude that the values of the birth and death rate, and also the bias parameters γ and β could be Lachancea-specific, while the conclusions on the relevant evolutionary mechanisms might apply more generally.

Author response image 1
Subtree chosen for the model fit.

Also, keeps the parameter estimates here from being overfit to better predict origin birth and death events in closely related branches of the Lachancea tree in Figure S1? Are γ and β inferred in a way that accounts for the higher correlation in birth and death events in closer-related branches than in distal branches, or has the fit ignored those correlations?

The birth and death rate are inferred as global parameters, ignoring correlations. Despite of this, the model reproduces the higher correlation in birth and death events in closer-related branches than in distant branches as a consequence of the common positions and firing rates of the origins in the ancestor, as Figure 5B of the main text shows. The bias parameters β and γ are not inferred based on branch data, but on distributions on extant species. We clarified these points in the revised text and the caption of Figure 5.

The authors state that their model identifies selective pressures. The authors imply, and specifically state in lines 238-242, that increased death rate of origins which happen to be nearby highly efficient origins represents selective pressure against the less efficient origins. It isn't until the discussion that the authors raise the possibility that there may simply be a lack of selective pressure to retain inefficient origins that are near highly efficient origins. In my view, it's more likely that selection for the existence of an inefficient origin is simply lower than the drift barrier, so mutagenesis and drift can passively remove such origins over time without the need to invoke selection against inefficient origins.

We fully agree with this point. We have modified the text in order to state early on that the evolutionary bias against weak origins could be simply due to the fact that they are neutral, or that their advantage is not sufficiently high for them to survive drift.

Figure 3 is intended to show that the stall-aversion and interference model performs better at predicting correlations between efficiency of lost origins and their nearest neighbor. I agree, but I do not think Figure 3 presents a strong case for this conclusion. Figure S6 presents stronger evidence to me. While Figure 3 does qualitatively suggest that the joint model may predict the correlation between neighboring origin efficiency and origin loss better than the double-stall model alone, it almost appears to me that the model with fork stalling and interference has significantly overestimated the correlation. Is there a quantitative way, perhaps using information criteria, though I admittedly am not sure how one would go about doing that with simulations such as these, to demonstrate that the model with both effects has better predictive value than the one with only fork stalling?

This point was also raised by reviewer 3. In order to address it, we performed a (simplified) likelihood ratio analysis in order to show the better quantitative performance of the combined model. The full likelihood of our model is complex, but we have de ned partial likelihoods for the joint and the double stall aversion model just taking into account the marginal probabilities shown as box plots in Figure 4 of the revised text (former Figure 3) and in Figure 4—figure supplement 1. In other words, we have tested the predicted correlation between the efficiency and firing rate of the lost origins and the ones of their neighbor for both models. The likelihood ratio test quantifies how much the prediction of a certain model is better than a reference ( null ) model. We chose the double stall aversion model as reference (equivalent to setting β = 0 in the full model). See Equation 8, where LX are the likelihoods of the two models and lX are the log-likelihoods. Assuming that Lr is χ-squared distributed (this is generally the case for large samples), we could compute a P-value associated to this test. Supplementary file 1 shows that the joint model performs better for all the four chosen features. In our view, the qualitative difference shown in Figure 4 may be taken as a stronger argument in favor of the combined model, in the sense that, beyond any quantitative agreement relying on parameters, the additional ingredient is needed to explain the data. We revised the text to highlight this point.

There are a couple of assumptions of the model that I would like the authors to examine in further detail. First, that origin birth events occur in the middle of an inter-origin space. I am not aware of evidence pointing to this being a good a priori assumption. Can you re-run the simulations, allowing origins to arise at a random site within the inter-origin space into which it is born?

We agree that these analyses are useful, and we implemented the analysis of the suggested model variants relaxing specific assumptions. Author response image 2 shows the resulting inter-origin distance distribution for a model where the double stall aversion is implemented, but where newborn origins are placed at uniformly chosen position of the associated interval. The figure shows that relaxing the assumption that new origins are born at mid points has consequences on the distance distribution, irregardless of the value of the parameter γ. We interpret this as the result of a faster (hence undetectable in our data) evolutionary process that counter selects origins far from midpoints. The reason for placing newborn origins at midpoints can then be justified by the results of Newman et al.: in order to minimize the global double stall probability the origins perform several attempts within a tree branch, of which we observe only the successful ones. This choice is also justified in the data by Figure 1—figure supplement 2, which shows that birth events show a strong bias for being close to mid points of their associated intervals. We added this figure as a supplementary figure to justify our assumption.

Author response image 2
The "uniform draw" model does not reproduce the inter origin distance distribution.

The figure compares the empirical distance distribution (red point line) with the one resulting from the simulation of the uniform draw model (blue bars) with γ=10. We choose this value because for γ=10 the error made on this distribution has already saturated to the lower reached level. The model tested here cannot reproduce the inter origin distance distribution.

Second, is it reasonable to expect origin firing rates to reshuffle to a new value randomly, without any dependence on their prior rate? Perhaps I'm mistaken, but it seems to me that an origin's firing rate should evolve more gradually, and should have a higher probability of sampling from values near its current value than from values very far from its current value.

We originally started this project with the same idea, modeling rate evolution as a diffusion process. However, on the evolutionary time scales associated to the Lachancea data, it turns out that this expected gradual evolution is not visible, leaving the model with too many extra parameters (a firing rate diffusion constant, and bounds to set the empirical distributions) that are very di cult to estimate. Indeed, Author response image 3 shows the firing-rate distributions of the conserved (thus older) origins and of newborn (younger) ones. The two distribution are similar (within their errors), suggesting that (on the observation time scale) newborn origins already take a firing rate sampled from the steady-state distribution. This is condition that is not generally met under a simple diffusive process. This is also shown in Figure 1—figure supplement 3 of the revised text, which shows that the cumulative probability of the firing rates of newborn and conserved origins are similar across the different Lachancea species. In light of these results, we reverted to a simpler parameter-poor version of the model. Possibly this choice can be improved upon in the future, when additional measurements will be available. We addressed this point in the revised text.

Author response image 3
Newborn origins and conserved ones have similar distributions of firing rates.

The plot shows the probability distribution functions for the normalized firing rates for all the origins in the ten Lachancea species (red diamonds) and for those origins which have been gained in the terminal branches belonging to the six sister species (green triangles). The two distributions are similar, supporting the idea that the firing rates of the new gained origins are at steady state.

Reviewer #3:

This paper proposes a novel and relevant evolutionary model that explains many aspects of replication origin statistics in a family of yeast species. It is a step forward in our understanding of the evolutionary pressures that affect the distribution of replication origins in Eukaryotes. I recommend it should be published in eLife, provided the authors revise the paper to address the following issue:

1. Many of the conclusions of the paper are based on the claim that the extending the model by adding an efficiency bias to the origin death rate makes the model fit the data better; in particular, they say in line 213 that "the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations."

This is reinforced in line 243, and in other parts of the text. But inspecting Figure 3, the two models (with and without a death rate bias) yield almost identical box-plots; if anything, the box-plots for the lost/nearest fractions of the pure double-stall aversion model seem visually to match the data marginally better. So why do the authors claim that the model with death rate bias is a much better fit? This is far from clear by just inspecting the data. I see no "huge difference" in the plots. There is a difference, but it is far from huge – the differences in the mean are much smaller than the size of the boxes. It seems to me unjustified to use this to choose one model over another. One way to ascertain this is to do rigorous statistical tests to determine if the differences in the means of the simulated and observed data are statistically significant; for example, a t-test.

This point was also raised by reviewer 2. For the description and results of our analysis please refer to the previous section and to Supplementary file 1.

https://doi.org/10.7554/eLife.63542.sa2

Article and author information

Author details

  1. Rossana Droghetti

    Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, Milan, Italy
    Contribution
    Conceptualization, methodology, Formal analysis, Investigation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4596-7288
  2. Nicolas Agier

    Sorbonne Universitè, CNRS, Institut de Biologie Paris-Seine, Laboratory of Computational and Quantitative Biology, Paris, France
    Contribution
    software, Conceptualization
    Competing interests
    No competing interests declared
  3. Gilles Fischer

    Sorbonne Universitè, CNRS, Institut de Biologie Paris-Seine, Laboratory of Computational and Quantitative Biology, Paris, France
    Contribution
    Writing – review and editing, Conceptualization, Writing – original draft, data-curation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5732-2682
  4. Marco Gherardi

    Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, Milan, Italy and INFN sezione di Milano, Milan, Italy
    Contribution
    Writing – review and editing, Conceptualization, methodology, data-curation
    Competing interests
    No competing interests declared
  5. Marco Cosentino Lagomarsino

    1. Dipartimento di Fisica, Università degli Studi di Milano, via Celoria 16, Milan, Italy and INFN sezione di Milano, Milan, Italy
    2. IFOM Foundation, FIRC Institute for Molecular Oncology, via Adamello 16, Milan, Italy
    Contribution
    Writing – review and editing, Conceptualization, Funding acquisition, methodology, Supervision, Investigation, data-curation
    For correspondence
    marco.cosentino-lagomarsino@ifom.eu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0235-0445

Funding

Associazione Italiana per la Ricerca sul Cancro (IG REF: 23258)

  • Marco Cosentino Lagomarsino

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are very grateful to Ludovico Calabrese, Simone Pompei, and Orso Maria Romano for the useful discussions. We also thank the editors and reviewers of this manuscript for their constructive and helpful feedback. This work was supported by the Italian Association for Cancer Research, AIRC-IG (REF: 23258).

Senior Editor

  1. Naama Barkai, Weizmann Institute of Science, Israel

Reviewing Editor

  1. Sandeep Krishna, National Centre for Biological Sciences­‐Tata Institute of Fundamental Research, India

Reviewer

  1. Alessandro De Moura

Publication history

  1. Received: September 28, 2020
  2. Accepted: May 20, 2021
  3. Accepted Manuscript published: May 20, 2021 (version 1)
  4. Version of Record published: June 18, 2021 (version 2)

Copyright

© 2021, Droghetti et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 632
    Page views
  • 75
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Epidemiology and Global Health
    Hannah R Meredith et al.
    Research Article

    Human mobility is a core component of human behavior and its quantification is critical for understanding its impact on infectious disease transmission, traffic forecasting, access to resources and care, intervention strategies, and migratory flows. When mobility data are limited, spatial interaction models have been widely used to estimate human travel, but have not been extensively validated in low- and middle-income settings. Geographic, sociodemographic, and infrastructure differences may impact the ability for models to capture these patterns, particularly in rural settings. Here, we analyzed mobility patterns inferred from mobile phone data in four Sub-Saharan African countries to investigate the ability for variants on gravity and radiation models to estimate travel. Adjusting the gravity model such that parameters were fit to different trip types, including travel between more or less populated areas and/or different regions, improved model fit in all four countries. This suggests that alternative models may be more useful in these settings and better able to capture the range of mobility patterns observed.

    1. Computational and Systems Biology
    Daniel Griffith, Alex S Holehouse
    Tools and Resources

    The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.