An evolutionary model identifies the main evolutionary biases for the evolution of genomereplication profiles
Abstract
Recent results comparing the temporal program of genome replication of yeast species belonging to the Lachancea clade support the scenario that the evolution of the replication timing program could be mainly driven by correlated acquisition and loss events of active replication origins. Using these results as a benchmark, we develop an evolutionary model defined as birthdeath process for replication origins and use it to identify the evolutionary biases that shape the replication timing profiles. Comparing different evolutionary models with data, we find that replication origin birth and death events are mainly driven by two evolutionary pressures, the first imposes that events leading to higher doublestall probability of replication forks are penalized, while the second makes less efficient origins more prone to evolutionary loss. This analysis provides an empirically grounded predictive framework for quantitative evolutionary studies of the replication timing program.
Introduction
Eukaryotes, from yeast to mammals, rely on predefined ‘replication origins’ along the genome to initiate replication Leonard and Méchali, 2013; Musiałek and Rybaczek, 2015; Ganier et al., 2019; Gilbert, 2001, but we still ignore most of the evolutionary principles shaping the biological properties of these objects. Binding by initiation complexes defines origins as discrete chromosomal loci, which are characterized by multiple layers of genomic properties, including the necessary presence of autonomously replicating sequences, nucleosome depletion, and absence of transcription (Méchali et al., 2013; Di Rienzi et al., 2012). Initiation at origins is stochastic, so that different cells of the same population undergoing genome replication in Sphase will typically initiate replication from different origins Bechhoefer and Rhind, 2012; Rhind et al., 2010.
Initiation from a single origin can be described by intrinsic rates and/or licensing events Hawkins et al., 2013. Indeed, the genomewide replication kinetics of a population of cells can be accessed experimentally by different techniques Hawkins et al., 2013; Baker et al., 2012; Agier et al., 2013. Recent techniques also allow to measure replication progression at the singlecell level Müller et al., 2019; Hennion, 2020. The estimation of key origin parameters from data requires minimal mathematical models describing stochastic origin initiation and fork progression Retkute et al., 2012; Baker et al., 2012; de Moura et al., 2010; Zhang et al., 2017. Typically, one can extract from the data origin positions, as well as estimated originintrinsic characteristic firing times or rates. Knowledge of origins positions and rates makes it possible to estimate the ‘efficiency’ of an origin, that is, its probability of actively firing during Sphase, rather than being passively replicated.
Over evolution, a genome modifies its replication timing profile by ‘reprogramming’ origin positions and rates in order to maximize fitness under the constraints of the possible changes of these parameters that are physically and biologically accessible. Little is known about this process, and finding basic rules that drive origin evolution is our main focus here Koonin, 2011. The main recognized constraint determining negative selection is due to replication forks stalling between adjacent origins Newman et al., 2013; Letessier et al., 2011; Cha and Kleckner, 2002. If two converging replication forks stall with no origins in between them, it is generally agreed that replication cannot be rescued, and the event leads to cell death. Such deadly ‘double stalls’ can only happen with two converging forks generated from consecutive origins. A pioneering study by Newman et al., 2013 used a combination of data analysis and mathematical models to understand the role of lethal doublestall events on origin placement. They found that the fork perbase stall probability affects the distance between neighbor origins, and the optimal distance distribution tends to a regular spacing, which is confirmed by experimental data. Thus, origin placement is far from a uniform random distribution (which would translate into an exponential distribution of neighbor origin distances). Instead, the regular latticelike spacing that origin tend to take is reminiscent of particles repelling each other.
Due to the streamlined genome and the experimental accessibility, yeasts are interesting systems to study experimentally the evolution of replication programs. However, at the level of the Saccharomyces genus, the replication program is highly conserved Müller and Nieduszynski, 2012. Hence, until recently, no experimental account of the evolution of the replication program was available. Our collaboration has recently produced data of this kind Agier et al., 2018 by comparing replication dynamics and origin usage of 10 distant Lachancea yeast species. This study highlights the dominance of origin birthdeath events (rather than, e.g., chromosomal rearrangements) as the main evolutionary drive of the replication program changes and characterizes the main principles underlying origin birthdeath events. Briefly, the fate of an origin strongly depends on its neighbourhood, particularly the distance from neighbor origins and their efficiency. Indeed, proximity to efficient origins correlates with weaker origin loss events. An evolutionary bias against weak origins could be due to the fact that their presence is neutral or even advantageous (e.g., in terms of reducing double stalls), but their advantage is not sufficiently high for them to survive drift. These findings open the question of capturing the relevant evolutionary biases acting on replication profiles in the framework of the empirical birthdeath evolutionary dynamics, for which the data set Agier et al., 2018 provides an empirical testing ground.
Here, we define a minimal evolutionary birthdeath model for replication program evolution encompassing all the empirical observations made by Agier et al., 2018, and we use it to investigate the main evolutionary tradeoffs that could explain the data.
Results
Experimental data motivate an evolutionary model for origins turnover
This section presents a reanalysis of the experimental data from Agier et al., 2018. We summarize the main results of that study and present additional considerations on the same data, which motivate the evolutionary model framework used in the following.
Figure 1—figure supplement 1 recapitulates the Lachancea clade phylogenetic tree used in the analysis. The evolution of the temporal program of genome replication can be quantified by the divergence of the replication timing profiles across different species. Agier and coworkers found that timing profiles diverge gradually with increasing evolutionary divergence between species Agier et al., 2018. In principle, such divergence could be attributed to changes in the number, placement, and biological properties of all origins. However, a careful analysis of correlations (comparing the timing profiles and the activity of orthologous origins) shows that the main driver of program differentiation across species is the acquisition and loss of active replication origins. Specifically, the number of conserved origins decreases with increasing phylogenetic distance between species, following the same trend as the conservation of the timing profiles. This trend is the same in regions that are close to or away from breakpoints, pointing to a secondary role of genome rearrangements. In addition, the authors of Agier et al., 2018 show that the differences in the mere number of origins and the median difference in origin replication timing between pairs of species are nearly constant with phylogenetic distance, leading to exclude that origin reprogramming (rather than birthdeath) plays a primary role in the evolution of the timing program.
Any model for the evolution of the replication program must (i) reproduce the empirical distribution of the interorigin distances, (ii) reproduce the empirical distribution of the origin efficiencies, and (iii) account for the observed origin turnover dynamics. Previous analyses Agier et al., 2018; Newman et al., 2013 have shown that origins are far from following a uniform distribution along the genome. Figure 1A shows that the interorigin distance distribution robustly shows a unimodal shape across the 10 Lachancea species studied in Agier et al., 2018. Specifically, distributions for each species show a marked peak around 35 kbp. This peak corresponds to a typical interorigin distance, which is strikingly invariant across all Lachancea species. Figure 1B shows the distribution of the efficiencies, which is defined as the probability to actively fire during the S phase, estimated for each origin in the Lachancea clade using Equation 4 and a fit inferring the firing rates of all origins assuming a standard nucleationgrowth model (see Materials and methods and Zhang et al., 2017). The singlespecies efficiency distributions show more variability across species than the interorigin distance distributions, but they are consistent with a common shape and support.
As mentioned above, a key result of Agier and coworkers is the insight that the evolution of the replication program is mainly shaped by the birthdeath process of replication origins. Figure 1C–F recapitulates the main quantitative results that characterize this process. Note that the analyses in Figure 1C–F have been performed on the six sister species of Lachancea clade since the other species pairs are too distant to perform a reliable identification of conserved, newly gained, and lost origins Agier et al., 2018.
Figure 1C shows a box plot of the distance from the nearest origin for all the conserved (dark red), newly gained (red), and lost (black) origins. Lost replication origins tend to be closer to their neighbors, much more so than newly gained or conserved origins. This observation reveals that the distance of an origin from its nearest neighbor is correlated to the loss rate of the same origin over evolution. This is an essential feature that any evolutionary model of this process must take into account Newman et al., 2013; Agier et al., 2018. More in detail, Figure 1D further quantifies the correlation between gain and loss events of neighboring origins by comparing the fraction of observed events of loss, gain, or conservation, given the state of the nearest origin (conserved, lost, or gained). The distribution of event types for origins that are nearest neighbors of a newly gained origin deviates significantly from the null expectation of random uncorrelated events (i.e., in a simple scenario where the fractions of conserved, newly gained, and lost origins are fixed to the empirical values, and birth and death events of neighboring origins are independent). The same nonnull behavior is observed for origins that are nearest to a lost origin, with the roles of gain and loss events exchanged. In summary, successive birth/death or death/birth events happen more frequently in the same genomic location than expected by chance. Beyond such a spatial correlation along the chromosomal coordinate, the analysis illustrates that birth and death events are correlated in time as well (in fact, the analyzed evolutionary events took place in the terminal branches of the phylogenetic tree, and thus they must have been close in term of evolutionary time).
Finally, Figure 1E and F show that origins lying near loci where origins were recently lost are typically in the highefficiency range of the distribution, and that lost origins tend to be less efficient than conserved origins. Figure 1E compares the distribution of the efficiency of lost, conserved, and newly gained origins with the distribution of efficiency of the nearest origins. The efficiency of origins neighboring a loss event is higher than average, while the efficiency of lost origins is lower than average. These results clearly support the influence of origin efficiency on origin death events. This is confirmed by Figure 1F, which shows the distribution of efficiency of all conserved and newly gained origins. For both classes, considering only those origins that are nearest neighbors to a recently lost origin yields an increase in the efficiency.
Different mechanisms could lead to the correlations described above. Overall, it is clear that origin strength is somehow ‘coupled’ to birthdeath events. For example, conserved origins may become more efficient after the loss of neighbor origins, or the birth of new highly efficient origins could facilitate the loss of neighbors, or losing an origin could expedite the acquisition of a new origin nearby. Overall, these results reveal that the origin birthdeath process is following some specific ‘rules’ that involve both interorigin distances and origin efficiency.
Note that the results of Figure 1E might appear to be incompatible with Figure 1D, but they are not. Figure 1E shows that the efficiency of newly gained origins is lower than average, and Figure 1D shows that the majority of origins that are nearest to a locus with a recent loss event are newly gained. The apparent contradiction arises from Figure 1E, which shows that the average efficiency of origins close to a lost one is higher than average. This inconsistency is resolved by the analysis shown in Figure 1F, which shows that origins appearing close to recently lost ones are among the most efficient.
A birthdeath model including evolutionary bias from interorigin replication fork double stalling recapitulates the main features of replication origin turnover
The joint stalling of two replication forks in the same interorigin region along the genome is a wellcharacterized fatal event that may occur during Sphase. The frequency of this event in a clonal population clearly affects fitness. A previous modeling study Newman et al., 2013 focusing on yeast demonstrated that, in order to minimize the probability of a double stall anywhere along the chromosome, origins must be placed in the most ordered spatial configuration, namely all the consecutive origins must be equidistant from each other. However, the previous study did not incorporate this principle into an evolutionary dynamics of origin turnover. Thus, the important question arises of whether the tendency to avoid double stalls is related to origin gain and loss. To address this question, we defined a birthdeath model, rooted in the experimental observations discussed in the previous section. This ‘doublestallaversion model,’ described in detail below, biases the turnover of replication origins in such a way that events (in particular, birth events) leading to a decreasing doublestall probability are promoted because they increase the fitness of the cell.
In the doublestallaversion model, the extent to which the acquisition of a new origin changes the probability of a double stall ${P}_{i}^{\mathrm{DS}}$ depends on the length l_{i} of the interorigin region where the event occurred. This probability is therefore coordinatedependent and can be derived by a procedure similar to the one carried out in Newman et al., 2013 (see more details in Materials and methods),
where l_{i} is the length of the genome region between the (i+1)th and the ith origin and π is the mean pernucleotide forkstall rate; we use the value from Newman et al., 2013, $\pi =5\times {10}^{8}$ per nucleotide. Note that the doublestall probability is completely independent from the origin firing rates and efficiency, and depends only on the distance between the origins.
In our simulations of the model (see Materials and methods for a more detailed explanation), the genome was represented as a vector of origins, identified by the position and the firing rate. The model is a discretetime Markov chain, and for the doublestallaversion variant the chain is specified by the following update rules:
In each interorigin region, the origin birth rate is biased by the value of the doublestall probability in that region. Specifically, the origin birth rate (per unit time) in the region $i$, of length l_{i} between the ith and (i+1)th origin is given by
(2) ${b}_{i}=N\overline{b}{\left({P}_{i}^{\mathrm{DS}}\right)}^{\gamma}{l}_{i},$
where ${P}_{i}^{\mathrm{DS}}$ is the (constant) doublestall probability density in region $i$ (Equation 1), $\overline{b}$ is the birth rate (per Mbp and per unit time) extracted from experimental data (see Materials and methods), and γ is a positive real parameter that controls the strength of the bias. $N$ is a normalization factor added to match the empirical birth rate $\overline{b}$. Newborn origins are placed in the middle of the interorigin region $i$.
Death (i.e., loss of origins) is unbiased and occurs at random origins with rate $\overline{d}$ (estimated from experimental data, see Materials and methods), regardless of their efficiency or their neighbor’s efficiency.
The justification for the assumption that newborn origins are placed at midpoints in the model ultimately comes from data (Figure 1—figure supplement 2) where a strong bias in this direction is found. Relaxing this assumptions has consequences on the distance distribution and leads to poorerperforming models. We interpret this bias as the result of a faster (hence undetectable in our data) evolutionary process that counterselects origins far from midpoints.
Firing rates in the model evolve by reshuffling of the empirical firing rate distribution, with a time scale that is set empirically (see Materials and methods and Figure 1—figure supplements 3 and 4). On shorter time scales, firing rate changes are likely more gradual, making firing rate evolution similar to a diffusion process. However, such changes are not quantifiable in our data set, which would leave the model with many extra parameters (a firing rate diffusion constant and bounds to set the empirical distributions) that are very difficult to estimate. Additionally, the firing rate distributions of the conserved (thus older) origins and of newborn (younger) ones are quite similar (Figure 1—figure supplement 3), and this condition is not generally met under a simple diffusive process.
Figure 2 shows the simulation results of the model with bestfitting parameter values (see Materials and methods and Figure 2 for other parameter values). Figure 2A and B show that the doublestallaversion model reproduces the two main ‘structural’ features of yeast genome, namely the interorigin distance distribution and the origin efficiency distribution. Additionally, Figure 2C and D show that the same model reproduces the observed correlations between the interorigin distance and origin birthdeath events, as well as the correlation between birthdeath events and nature of the neighbor origins observed in the data (conserved, newly gained, or lost).
The doublestall hypothesis alone fails to capture correlations of origin turnover with efficiency
In spite of the good performance of the doublestallaversion model in explaining the empirical marginal distributions, we find that it fails to reproduce the observed correlations between the efficiency of an origin and the recent history of the nearest ones. Figure 2E shows very faint variations in efficiency of origins that are nearest neighbors to origins of different evolutionary fate. In particular, the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations. Note that Figure 2 and Figure 4 show that in the doublestallaversion model origins nearest to a loss event are slightly more efficient than average. This trend is due to the fact that after an origin is lost, its neighbors are subject to lower interference and automatically become more efficient. However, Figure 4 shows that this null trend is too weak to explain the experimental data. These considerations indicate that a model without a direct mechanism linking the efficiency of an origin to the birthdeath events of its neighbors cannot reproduce the data.
Doublestall aversion and interference between proximate origins explain the correlated evolution of origin presence and efficiency
Based on the above considerations, we defined a joint model that takes into account both the evolutionary pressure given by the doublestall probability and the direct effect of origin efficiency on birthdeath events.
Specifically, this model is defined as follows:
The birth process is the same as in the doublestallaversion model described above: the birth rate is biased by the doublestall probability in each interorigin region Equation 1, and newborn origin are placed in the middle of the region.
Death of an origin is biased by its efficiency: less efficient origins are more easily lost. Specifically, the death rate (per unit time) for the ith origin is
(3) ${d}_{i}=N\overline{d}\mathrm{exp}\left(\beta {\mathrm{eff}}_{i}\right),$where eff_{i} is the efficiency of the ith origin, Equation 4, $\overline{d}$ is the mean death rate extracted from experimental data (see Materials and methods). The positive parameter β tunes the interaction strength: the larger β, the steeper the dependence of d_{i} on eff_{i}. The normalizing factor $N$ is chosen so as to match the empirical total death rate.
We note that the bias parameters β and γ are not inferred based on branch data, but on distributions of extant species (see Materials and methods).
Figure 3 gathers plots of the structural features (distribution of interorigin distances and efficiencies, Figure 3A–B) and the evolutionary correlations involving efficiency, evolutionary fate, distance to nearest neighbor, and fate of nearest neighbor (Figure 3C–F). Overall, the joint model reproduces all the observations considered here regarding the layout of origins and their evolutionary dynamics, indicating that the experimental data can be rationalized by a fitness function that includes both the detrimental effects of nonreplicated regions and the evolutionary cost of maintaining inefficient replication origins.
In particular, the coupling between the efficiency of an origin and the death rate of its neighbors, through the probability of passive replication, reproduces the empirical correlations shown in Figure 1. Figure 4 summarizes this crucial point of comparison between the joint efficiency/doublestallaversion model and the pure doublestallaversion case. The three plots compare efficiency distributions of lost, conserved, and newly gained origins (red for the data, blue for the models) with those of their neighbors (gray). Comparison of these plots shows that only the joint model reproduces the differences in efficiency of lost origins and their neighbors.
In order to show that the stallaversion and interference model has better quantitative agreement with the data, we also performed a simplified likelihood ratio analysis. The full likelihood of the model is complex, but we have defined ‘partial’ likelihoods for the joint and the doublestallaversion model just taking into account the marginal probabilities shown as box plots in Figure 4, Figure 4—figure supplement 1 (see Materials and methods). Figure 1 shows that the joint model performs better for all the four chosen features. In our view, the qualitative difference shown in Figure 4 may be taken as a stronger argument in favor of the combined model in the sense that, beyond any quantitative agreement relying on parameters, the additional ingredient of a coupling between origin birthdeath dynamics and origin rates is needed to explain the data.
The joint efficiency/doublestallaversion model correctly predicts origin family divergence
Having established that the joint model is required to reproduce observations on single lineages, we turned to its predictions on observations that require knowledge of the whole phylogenetic tree, such as origin evolutionary families, defined as sets of orthologous origins Agier et al., 2018.
We thus set up a simulation of the model on a cladogenetic structure, fixed by the observed structure of the Lachancea clade phylogenetic tree (see Materials and methods for the simulations details). The outputs of each run in such simulations are nine different simulated genomes whose lineages are interconnected in the same way as the empirical species, and each branch follows the empirical divergence. We stress that these simulations just include intersecting lineages whose branched structure corresponds precisely to the lineages of the empirical tree. The phylogenetic structure does not emerge from the simulation as our model does not describe speciation. The model for the tree can simulate nine species, all the species except for Lachancea kluyveri, as this species was used as outgroup for the computation of the length of the tree branches Agier et al., 2018. We have repeated all the analyses on these simulations and verified that all the previous results hold (Figure 5—figure supplement 1). We then turned to other independent predictions of the joint model, which could be compared to measurements in Agier et al., 2018.
Figure 5A reports the dynamics of origin families. As reported in Agier et al., 2018, origins that belong to larger evolutionary families tend to have a higher efficiency compared to origins in smaller families, which is possibly due to the fact that, on average, highefficiency origins tend to survive longer. Note, however, that there is no deterministic relation between family size and origin age because the relationship between these two is determined by the structure of the phylogenetic tree. Indeed, two families of the same size may have roots in different points of the tree, and thus the origins belonging to them may have very different ages. Thus, the prediction of the relation between origin efficiency and origin family size is not trivial. Figure 5A shows the results for the origin efficiency for families of varying size, comparing the experimental data and 100 different runs of the simulation.
As a second step, we have considered the model prediction for the divergence of the shared origins in two species descending from a common ancestor. Specifically, we asked whether the number of origin death events occurring in two branches of the tree could justify the number of common origins in the two species. Indeed, whenever in a pair of species the number of shared origins is lower than the number of origins belonging to their common ancestor, this discrepancy must be due to the evolutionary loss events. These events are predicted by our model to be correlated in diverging species due to the common ancestry and the coupling of loss events to origin efficiency and distance. This correlation should lower the number of shared origins losses compared to a null expectation where loss events are not correlated. Figure 5B shows that the model correctly predicts the divergence in the number of shared origins lost during evolution without any parameter adjustment. We also verified that, as expected, a null evolutionary model is not able to reproduce this feature. The null model fixes in each branch of the simulated tree the same number of birth and death events that are present in the corresponding branch of Lachancea tree, but these events occur uniformly along the genome. The difference between the null model and the evolutionary model predictions shown in Figure 5B is a consequence of correlated origins losses due to the common genome structure, in terms of origins positions and efficiencies, that each pair of species inherit from their common ancestor.
We note that birth and death rate are inferred as global parameters, ignoring correlations. Despite this, Figure 5B shows that the model reproduces the higher correlation in birth and death events in closerrelated branches than in distant branches as a consequence of the common positions and firing rates of the origins in the ancestor.
Discussion
Overall, this study provides a framework to study replication program evolution driven by replication origin birthdeath events and demonstrates that both fork stalling and efficiency shape the adaptive evolution of replication programs. The model framework is predictive and falsifiable, and it can be used to formulate predictions on the phylogenetic tree. In future studies, it would be interesting to explore the predictions for the evolutionary dynamics under perturbations, such as evolution under increased replication stress or conditions where fork stalling becomes more frequent. Additionally, the framework can be used to discover specific trends, such as different evolutionary dynamics of specific genomic regions (subtelomeres Yue, 2017, regions containing repeats, etc. Arbona et al., 2018; Boos and Ferreira, 2019), role of genome spatial organization Marchal et al., 2019, and correlated firing of nearby origins.
A general question concerns the predictive value of the model proposed here on outofsample data. Figure 5 shows that fitindependent predictions apply across the tree. Importantly, the model is based on simple global parameters and not finetuned on local features of the tree. To underline this point, we verified that a model fit using only the subtree between LADA a LAWA yielded similar parameters. Clearly, we cannot exclude that the values of the birth and death rate, and also the bias parameters γ and β, could be Lachanceaspecific, while we speculate that the conclusions on the relevant evolutionary mechanisms might apply more generally.
The previous approach by Newman et al., 2013 described the evolution of origin distance as an optimization process that minimizes double forkstall events, without attempting to characterize explicitly the evolutionary dynamics. Such approaches are limited compared to the framework presented here because they can predict only the origin distance distribution, and they do not allow any prediction regarding origin and replication program evolution along lineages and across phylogenetic trees. In accordance with the results of Newman et al., we confirm that doublestall events are a primary driver of the evolution of replication programs, and we frame this finding into the empirically measured birthdeath evolutionary dynamics of replication origins. Additionally, we show that next to forkstall events, origin efficiency plays an important role in shaping the evolutionary landscape seen by a replication timing profile.
What could be the mechanisms coupling efficiency to origin birthdeath? The actual process of origin death could be nearly neutral Koonin, 2016 as lowefficiency origins are – by definition – rarely used, and unused origins over evolutionary times are more prone to decay in sequence, and consequently in firing rate until they disappear. Equally, a newborn origin close to a very strong one (which would make the newborn origin relatively inefficient) could be used rarely. This would make this origin relatively less likely to establish over evolutionary times compared to an isolated newborn origin. However, rarely used origins could be essential in situations of stress (and in particular they could resolve doublestall events). Finally, a fitness cost for maintaining too many origins might set up an overall negative selection preventing a global increase in origin number Zhang et al., 2017; Karschau et al., 2012; Das et al., 2015.
Materials and methods
Data
The experimental data used in this work come from Agier et al., 2018. In particular, we made use of the data regarding the replication origins. For each origin in each of the 10 Lachancea species, this data set includes the chromosome coordinate and firing rate, and the inferred birth and death events occurred in the branches of the phylogenetic tree shown in Figure 1—figure supplement 1. Focusing on the terminal branches of the tree and on the extant replication origins, this study defines three categories of origins: (i) ‘conserved’ origins (which survived from the last ancestor), (ii) ‘newly gained’ origins gained in the last branch of the phylogenetic tree, and (iii) ‘lost’ origins, which were present in the last ancestor species and are not present in the terminal branch. Properties of the lost origins (e.g., position and firing rate) are inferred from the projection of the corresponding ones on the closest species, keeping into account synteny. Since the synteny map is less precise in distant species, the information on the origins events is only available for the six sister species in the tree, which belong to the three closest species pairs, highlighted with the red shaded area in Figure 1—figure supplement 1.
Computation of the efficiency
Request a detailed protocolOrigin efficiency was defined as the probability of actively firing during S phase (or, equivalently, the probability of not being passively replicated by forks coming from nearby origins). In practice, we computed it by the following formula:
where ${P}_{i,i+1}$ and ${P}_{i,i1}$ are the probabilities for the ith origin of passive replication respectively from the (i+1)th and (i1)th origins. Note that this efficiency formula Equation 4 is an approximation that only takes into account the possibility to be passively replicated by neighbor origins, neglecting the influence of other nearby origins. Following Agier et al., 2018, for computing the efficiency we assumed that the origin firing process has constant rate Zhang et al., 2017, and we thus obtain the following closed expressions for the probabilities of passive replication:
and
In the above equations, $v$ is the typical velocity of replication forks, x_{i} is the ith origin chromosome coordinate, and ${\lambda}_{i}^{{}^{\prime}}$ is the ith origin firing rate divided by the mean firing rate of the species the origin belong to. The raw firing rates in the data are affected by the different physiology of the nine Lachancea species in the experimental growth conditions (which were the same for all the species). In order to reduce these differences, we normalized the rates by their average for each given species. For this reason, we did not make use of the origin efficiency data already present in Agier et al., 2018.
Computation of the doublestall probability
Request a detailed protocolThe probability ${P}_{i}^{\mathrm{DS}}$ that two converging forks stall is easily computed in the limit where the stall probability per basepair is small and the number of basepairs is large. Under these assumptions, stalling is a Poisson process with rate (per basepair) π. ${P}_{i}^{\mathrm{DS}}$ can be written in terms of the probability ${P}^{\mathrm{S}}\left(x\right)$ that a single fork stalls after replicating $x$ nucleotides:
where l_{i} is the length (number of basepairs) of the ith interorigin region. Imagine two converging replication forks starting from origins $i$ and $i+1$: the two integration variables $x$ and $y$ represent the number of basepairs that each fork replicates before stalling. By using the Poisson process result ${P}^{\mathrm{S}}\left(x\right)=\pi \mathrm{exp}\left(\pi x\right)$ and performing the integration, one obtains the result in Equation 1.
Evolutionary model
Request a detailed protocolWe defined origin birthdeath models incorporating different evolutionary biases. In these models, the genome is described as a onedimensional circle with discrete origin location x_{i}, where the length of the genome is equal to the average genome length in Lachancea clade ($10.7Mbp$). We made use of a circular genome in order to avoid border effects. In the model, the set of origins change over evolution by three basic (stochastic) processes, birth of an origin in a certain genome region, origin death, and change of origins firing rate. We have verified that choosing linear chromosome does not alter significantly our findings, although it affects the distances between origins close to chromosome ends (Figure 3—figure supplement 1).
Overall origin birth/death rates were estimated from the data as follows. To estimate the overall birth rate $\overline{b}$, we considered, for all the terminal branches of the phylogenetic tree, the number of birth events ${N}_{b}$, the genome length of the corresponding species $L$, and the length of the tree branch $T$, and divided ${N}_{b}$ by $LT$. Then we averaged over all terminal branches. To estimate the overall death rate $\overline{d}$, we followed a similar approach, taking the number of death events ${N}_{d}$ in the terminal branches, the length of the branch $T$, and the number of origins in the corresponding species ${n}_{ori}$, then computing ${N}_{d}{T}^{1}{n}_{ori}^{1}$ for all the terminal branches and averaging these values. The final results for overall birth and death rates from the origin birthdeath events across the Lachancea clade are $\overline{b}=13.5627Mb{p}^{1}{t}^{1}$ and $\overline{d}=0.612287{t}^{1}$.
We verified that the assumption of constant rates was consistent with the empirical variability of the numbers of birth and death events per unit time along different branches of the tree by comparing simulations with data. Figure 5—figure supplement 2 shows that simulations and empirical data present similar spreading.
The process by which origin firing rates change over evolution was described as stochastic, with every origin having a fixed probability per unit time of changing its firing rate, given by $R=0.92{t}^{1}$, a value fixed from experimental data (see Appendix 1 and Figure 1—figure supplement 4). When a firing rate changes, it is resampled from the distribution of all the empirical normalized firing rates computed using the data in Agier et al., 2018 (see Appendix 1 and Figure 1—figure supplement 3 for more details).
Simulations
Code availability
Request a detailed protocolThe code used to run the simulations, together with instructions to run it, was shared as a repository on Mendeley data (Droghetti, 2020).
Algorithm
Request a detailed protocolThe prediction of the different evolutionary models was derived numerically, making use of custom simulations written in C++, which implement the origin birthdeath dynamics as a Gillespie algorithm (Gillespie, 1976). Every model variant was required to reproduce the experimental overall rates, $\overline{b}=13.5627Mb{p}^{1}{t}^{1}$ for origin birth, $\overline{d}=0.612287{t}^{1}$ for origin death, and $R=0.92{t}^{1}$ for firing rate change. We simulated the three processes defining the model as follows. (i) the birth process has a common definition for the stallaversion and joint model. The algorithm first tests each subsequent interorigin region, calculates the birth probability from Equation 2, and stores the results. Subsequently, it computes the normalization factor $N$ in order to match the empirical birth rate per nucleotide $\overline{b}$. Finally, it samples all the interorigin regions drawing birth events from the computed birth probability (Equation 2). New origins are placed the midpoints of the tested intervals. (ii) The death process is different for the stallaversion model (unbiased) and the joint model (related to the origin efficiency). In the joint model, the algorithm first calculates the death rate for each origin using Equation 3 and stores the results. Subsequently, it computes the normalization factor $N$ in order to match the empirical mean death rate $\overline{d}$. Finally, it samples all origin drawing death events from the computed death probability. For the unbiased process (stallaversion model), the dynamics is identical, but all the origins have the same death rate $\overline{d}$ so that the algorithm can skip the calculation of $N$. (iii) The process updating origin firing rates over evolutionary times is common to all model variants. The probability of update per origin per unit time is $R$. Origins are sampled for each time step and assigned a new rate uniformly extracted from the empirical distribution of all normalized firing rates with probability $Rdt$.
During the simulation, the genome configuration (chromosome position, firing rate, efficiency for each origin) is known at each time step, which matches the empirical time (tree branch length, measured by proteinsequence divergence). For simulating single lineages, we started with a collection of 50 origins, with positions and firing rate uniformly drawn from all the possible ones. Rapidly, the interorigin distances distribution, the efficiency one, and the number of origins reach a steady state (for the number of origins, set by the balance of birth and death rate, and characterized by approximately 225 origins). Configurations, including birthdeath events, were printed at regular time intervals after steady state is reached. The time interval between prints is chosen to be equal to the average length of the Lachancea phylogenetic tree terminal branches in order to compare singlelineage simulations with empirical data. For simulations on a phylogenetic tree, after one species reaches the steady state, it is used as a root. To reproduce the empirical branching structure of the tree, we run the simulation, one for each branch of the phylogenetic tree, each time starting from the species at the previous branching point, for a period that matches the length of the branch. If the simulated branch is terminal, then the configuration corresponds to one of the empirical species, otherwise it corresponds to a ‘branchingpoint species’ and it can be used as a starting point for other simulations. Each simulation run gives nine different simulated species with the same cladogenetic structure as the empirical species (Figure 1—figure supplement 1).
Fitting procedure
Request a detailed protocolThe biased birthdeath processes in the simulations rely on some parameters to tune the strength of the bias, which are the only parameters to fix by a fit, since all the other parameter values are fixed empirically. In the joint model, there are two free parameters, γ and β, that tune respectively the strength of the bias on the origin birth and on the origin death process. For a discrete set of parameter pairs spanning realistic intervals, we run hundred different simulations, each starting with a randomized genome. Considering the simulated species for all the pairs of parameter values, we quantify the discrepancy with experimental data by evaluating the L1 distance of the normalized histogram of efficiency and interorigin distances. This quantity is a number between 0 and 2, 0 if the histograms perfectly overlap and 2 if they have completely different supports. For each pair of parameters, the analysis gives two values of discrepancy. We choose the value of γ (the parameter that tunes the bias on the birth rate based on doublestall aversion) by taking the smaller discrepancy from the interorigin distances distribution. For the value of β (which tunes the interference bias on the death rate in joint model), we chose the one that gave us the smaller area on the efficiency distribution. For the doublestallaversion model, the fitting procedure is identical and only requires to fix γ.
Simplified likelihood analysis
Request a detailed protocolWe performed a (simplified) likelihood ratio analysis in order to test the better quantitative performance of the combined model. The full likelihood of the models analyzed here is complex, but we have defined ‘partial’ likelihoods for the joint and the doublestallaversion model only taking into account the marginal probabilities shown in Figure 4—figure supplement 1. Hence, the test evaluates for both models the goodness of the predicted correlation between the efficiency and firing rate of the lost origins and the ones of their neighbors. The likelihood ratio test quantifies how much the prediction of a certain model is better than a reference (‘null’) model. We chose the doublestallaversion model as reference (equivalent to setting $\beta =0$ in the joint model). Specifically, one evaluates
where ${L}_{X}$ are the likelihoods of the two models and ${l}_{X}$ are the loglikelihoods. Assuming that ${L}_{r}$ is χsquared distributed (this is generally the case for large samples), we could compute a pvalue associated to this test.
Null birthdeath model
Request a detailed protocolWe defined a null birthdeath model where origin birthdeath events in sister species are uncorrelated in order to analyze the divergence of shared origins and compare it with the prediction of the evolutionary model. This model implements birth and death events uniformly, regardless of origin position and firing rate, fixing the number of events for each branch of the simulated phylogenetic tree. These values are taken from the inference reported in Agier et al., 2018 (shown in Figure 3A of that study and in Figure 1—figure supplement 1). The simulation of this model starts with 220 origins (the number of origins inferred for to LA2, the species at the root of the tree). Subsequently, following the structure of the Lachancea phylogenetic tree, the simulation proceeds as follows: (i) at each branching point, the genome is copied into two daughters; (ii) for each daughter, the prescribed number of random death and birth events (in this order) is generated on random origins; and (iii) the simulation stops when it reaches the leaves of the Lachancea tree.
Appendix 1
Estimating parameters for the evolution of origin firing rates
This section motivates the model implementation of the evolutionary dynamics of firing rates. In order to quantify the change of origin firing rates over evolutionary times, we studied how the correlation between firing rates of conserved origins behaves as species diverge (Figure 1—figure supplement 3A). To quantify the divergence, for each pair of species in the Lachancea clade we calculated the Spearman correlation coefficient between the sets of firing rates belonging to corresponding origins in the two species considered (normalized by the species mean firing rate). We found that the more the species are distant, the less these two sets are correlated, which means that origin initiation rates diverge during evolution and origins lose memory of their initial firing rate. The model describes the evolution of firing rates as follows. Every origin changes its firing rate by extracting a new value from the distribution of empirical normalized ones, regardless of their previous firing rate. This process is characterized by a resampling rate $R$, common to all the origins, which defines the probability per unit time that an origin resamples its firing rate. The slope of the correlation coefficient in empirical data defines the speed at which the origin firing rates evolve. Hence, it is possible to fit this specific slope and extract the value of $R$.
In order to do that, we simulated the evolutionary process with unbiased origin death and update of the firing rate. This simulation can be performed without the birth process because the only origins that one needs to consider in computing the Spearman coefficient between two species are the conserved ones. Each simulation started from 225 origins, with firing rates randomly sampled from the empirical set of firing rates, evolving the genomes changing the firing rates with the resampling process described above and removing the origins according to the death rate estimated from the data. By performing several simulations with different values of the extracting rate $R$, it is possible to fit its best value. For each $R$ tested, we ran 1000 simulations for an evolutionary time corresponding to 1.6.
After computing the Spearman correlations between snapshots at different evolutionary times, we performed an exponential fit in order to see which value of the $R$ parameter gave the best agreement with the experimental data, finding the bestfit value $R=0.92$. Figure 1—figure supplement 4 shows the trend achieved by the simulation using $R=0.92$, and it shows a very good agreement between experimental data and simulations.
Note that in Agier et al., 2018 a similar analysis was carried out in order to verify if the reprogramming of the origins firing rate has an impact on the differentiation of replication timing. The authors analyzed the origin firing time differences between conserved replication origins in all pairs of species and found that this difference does not correlate with the phylogenetic distance between species. This finding is apparently in contrast with our results, which suggest that origin reprogramming increases with distance between species. We believe that this discrepancy is due to the higher sensitivity of the Spearman correlation and of the use of speciesaverage normalized firing rates in this study.
The empirical data falsify the scenario where interference alone drives origin evolution
This section presents a theoretical analysis of the scenario where solely origin interference sets the evolutionary pressure on replication timing profiles. This analysis shows that a description that only takes into account the evolutionary pressure that acts on origin efficiency is not able to reproduce the origins spatial arrangement, a crucial feature in empirical yeast data. To carry out this analysis, we take a ‘maximum entropy’ approach (see Banavar JR, Maritan A, Volkov I. Applications of the principle of maximum entropy: from physics to ecology. J Phys Condens Matter. 2010;22(6):063101. doi:10.1088/09538984/22/6/063101) and infer an effective ‘force potential’ acting on interorigin distance by looking at its (assumed equilibrium) distribution. Specifically, the effective potential acting on the origin efficiency starting from the empirical efficiency distribution can be analytically computed from the following formula:
where $\mathrm{eff}$ is the efficiency, $\mathrm{e}\mathrm{f}\mathrm{f}\in [0,1]$, and $P\left(\mathrm{eff}\right)$ the efficiency probability density function.
The above potential, once given the relation between efficiency and distance between origins (Equation 4), defines another potential ${H}_{d}\left(d\right)$ that act on the interorigin distances. By taking the exponential of ${H}_{d}\left(d\right)$ one obtains the expected probability distribution predicted for the distances at equilibrium.
In order to find ${H}_{d}\left(d\right)$, one must invert Equation 4 and find $d\left(\mathrm{eff}\right)$. To accomplish this task, we have approximated the threebody interaction that gives the efficiency with a twobody interaction. This assumption implies that each origin feels the interference of only one of his two neighbors and is effective as long as threeorigin interactions can be decomposed in twoorigin components. Under this assumption, Equation 4 becomes
Note that origin efficiency (Equation 4) also depends on the firing rates of the origin and its neighbor, hence, strictly speaking, one has that
To eliminate the firing rate dependence, we computed an effective potential ${H}_{d}^{\prime}$ on the distance, which averages the effect of the different firing rates. To this end, we used the mean value theorem for integrals as follows:
In other words, we substituted all the firing rates with the average one $\S lt;\lambda \ge 1$ since the rates are normalized on the species average. With this simplification, going from ${H}_{\mathrm{eff}}\left(\mathrm{eff}\right)$ to ${H}_{d}^{\prime}\left(d\right)$ is straightforward and gives
and
From the potential ${H}_{d}^{\prime}$, we can compute the prediction for the equilibrium probability distribution of interorigin distances
where $N$ is a normalization factor. In order to use this calculation on the data, we inferred the expected potential from the efficiency distribution, assuming that the interaction only depends on efficiency, and we then obtained the model prediction for the expected interorigin distribution based on the efficiency profile. Comparison of this prediction with the empirical interorigin distance distribution provides a test of the model. This procedure does not require to adjust any model parameter. Figure 4—figure supplement 2 shows the result of this analysis. The predicted distribution does not match the empirical one. This means that any evolutionary model that assumes a bias based only on the efficiency (in other words, one that takes into account only the evolutionary pressure given by origin interference) cannot reproduce (at steady state) the correct spatial organization of replication origins.
Data availability
The code used to run the simulations, together with a readme file, was shared as a Mendeley data repository (https://data.mendeley.com/datasets/vg3r5355bj/2).

SRA  project number: SRP111158. Accessions ranging from SRR5807795 to SRR5807891ID SRP111158. The evolution of the temporal program of genome replication.
References

The spatiotemporal program of replication in the genome of lachancea kluyveriGenome Biology and Evolution 5:370–388.https://doi.org/10.1093/gbe/evt014

The evolution of the temporal program of genome replicationNature Communications 9:2199.https://doi.org/10.1038/s41467018046284

Inferring where and when replication initiates from genomewide replication timing dataPhysical Review Letters 108:268101.https://doi.org/10.1103/PhysRevLett.108.268101

Replication timing and its emergence from stochastic processesTrends in Genetics: TIG 28:374–381.https://doi.org/10.1016/j.tig.2012.03.011

Replication timing is regulated by the number of mcms loaded at originsGenome Research 25:1886–1892.https://doi.org/10.1101/gr.195305.115

Mathematical modelling of whole chromosome replicationNucleic Acids Research 38:5623–5633.https://doi.org/10.1093/nar/gkq343

Maintaining replication origins in the face of genomic changeGenome Research 22:1940–1952.https://doi.org/10.1101/gr.138248.112

Metazoan dna replication originsCurrent Opinion in Cell Biology 58:134–141.https://doi.org/10.1016/j.ceb.2019.03.003

A general method for numerically simulating the stochastic time evolution of coupled chemical reactionsJournal of Computational Physics 22:403–434.https://doi.org/10.1016/00219991(76)900413

FORKseq: replication landscape of the Saccharomyces cerevisiae genome by nanopore sequencingGenome BiologyNature MethodsCell Reports 21:125.https://doi.org/10.1186/s13059020020133

Optimal placement of origins for dna replicationPhysical Review Letters 108:058101.https://doi.org/10.1103/PhysRevLett.108.058101

Are there laws of genome evolution?PLOS Computational Biology 7:e1002173.https://doi.org/10.1371/journal.pcbi.1002173

DNA replication originsCold Spring Harbor Perspectives in Biology 5:a010116.https://doi.org/10.1101/cshperspect.a010116

Control of dna replication timing in the 3d genomeNature Reviews Molecular Cell Biology 20:721–737.https://doi.org/10.1038/s415800190162y

Genetic and epigenetic determinants of dna replication origins, position and activationCurrent Opinion in Genetics & Development 23:124–131.https://doi.org/10.1016/j.gde.2013.02.010

Replisome stall events have shaped the distribution of replication origins in the genomes of yeastsNucleic Acids Research 41:9705–9718.https://doi.org/10.1093/nar/gkt728

Mathematical modeling of genome replicationPhysical Review. E 86:031916.https://doi.org/10.1103/PhysRevE.86.031916

Reconciling stochastic origin firing with defined replication timingChromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology 18:35–43.https://doi.org/10.1007/s1057700990933

Celltocell variability and robustness in sphase duration from genome replication kineticsNucleic Acids Research 45:8190–8198.https://doi.org/10.1093/nar/gkx556
Decision letter

Sandeep KrishnaReviewing Editor; National Centre for Biological Sciences‐Tata Institute of Fundamental Research, India

Naama BarkaiSenior Editor; Weizmann Institute of Science, Israel

Alessandro De MouraReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The reviewers and editors found that the manuscript presents a simple but compelling model that explains the dynamics of replication origin birth and death, which enhances our understanding of the selection pressures that have shaped the distribution of replication origins.
Decision letter after peer review:
Thank you for submitting your article "An evolutionary model identifies the main selective pressures for the evolution of genomereplication profiles" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Alessandro De Moura (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The reviewers appreciate that the manuscript presents a simple but compelling model that explains the dynamics of replication origin birth and death, which enhances our understanding of the selection pressures that have shaped the distribution of replication origins. However, both reviewers had a series of concerns. Please go through their reviews below. In your revised manuscript please address in particular the major concerns they have raised.
We would like to draw your attention to changes in our policy on revisions we have made in response to COVID19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.
Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.
Reviewer #2:
The manuscript entitled "An evolutionary model identifies the main selective pressures for the evolution of genomereplication profiles" is an examination of the principles shaping evolution of replication origin placement. Overall I found the manuscript to be engaging and interesting, and the topic of general importance. It is quite compelling that with just two parameters, origin efficiency and distance between origins, a good model can be built to describe the dynamics of origin birth and death. While this work on its own is sufficiently important for publication, it would be very interesting to see whether the model can be updated in the future to address whether there are forkstalling or origingenerating mechanisms that shape evolution of specific interorigin spaces. This work provides a very good foundation for such efforts.
I have a few major, general concerns I would like the authors to address.
If I'm interpreting the methods correctly, it seems the parameters used in these simulations, such as mean birth rate, mean death rate, γ, and β, were fit to the data once, and used as point estimates during simulation. If true, I expect the simulations to be yielding estimates of birth and death rates with a much narrower distribution of outcomes than is likely to be realistic given what an appropriate level of confidence in those parameter estimates would be. Could the parameters be fit to data in such a way that we attain an estimate of confidence in the parameter values, from which a distribution could be generated and sampled from during simulation?
Closely related to my prior concern, I would like the authors to demonstrate the general predictive value of their model on outofsample data. Can the model be applied to other data on replication timing? Without such an attempt to demonstrate the model's applicability to outofsample prediction, the reader cannot ascertain whether the model is overfit to the Lachancea data from Agier et al., 2018. Also, keeps the parameter estimates here from being overfit to better predict origin birth and death events in closely related branches of the Lachancea tree in Figure S1? Are γ and β inferred in a way that accounts for the higher correlation in birth and death events in closerrelated branches than in distal branches, or has the fit ignored those correlations?
The authors state that their model identifies selective pressures. The authors imply, and specifically state in lines 238242, that increased death rate of origins which happen to be nearby highly efficient origins represents selective pressure against the less efficient origins. It isn't until the discussion that the authors raise the possibility that there may simply be a lack of selective pressure to retain inefficient origins that are near highly efficient origins. In my view, it's more likely that selection for the existence of an inefficient origin is simply lower than the drift barrier, so mutagenesis and drift can passively remove such origins over time without the need to invoke selection against inefficient origins.
Figure 3 is intended to show that the stallaversion and interference model performs better at predicting correlations between efficiency of lost origins and their nearest neighbor. I agree, but I do not think Figure 3 presents a strong case for this conclusion. Figure S6 presents stronger evidence to me. While Figure 3 does qualitatively suggest that the joint model may predict the correlation between neighboring origin efficiency and origin loss better than the doublestall model alone, it almost appears to me that the model with fork stalling and interference has significantly overestimated the correlation. Is there a quantitative way, perhaps using information criteria, though I admittedly am not sure how one would go about doing that with simulations such as these, to demonstrate that the model with both effects has better predictive value than the one with only fork stalling?
There are a couple of assumptions of the model that I would like the authors to examine in further detail. First, that origin birth events occur in the middle of an interorigin space. I am not aware of evidence pointing to this being a good a priori assumption. Can you rerun the simulations, allowing origins to arise at a random site within the interorigin space into which it is born? Second, is it reasonable to expect origin firing rates to reshuffle to a new value randomly, without any dependence on their prior rate? Perhaps I'm mistaken, but it seems to me that an origin's firing rate should evolve more gradually, and should have a higher probability of sampling from values near its current value than from values very far from its current value.
Reviewer #3:
This paper proposes a novel and relevant evolutionary model that explains many aspects of replication origin statistics in a family of yeast species. It is a step forward in our understanding of the evolutionary pressures that affect the distribution of replication origins in Eukaryotes. I recommend it should be published in eLife, provided the authors revise the paper to address the following issue:
1. Many of the conclusions of the paper are based on the claim that the extending the model by adding an efficiency bias to the origin death rate makes the model fit the data better; in particular, they say in line 213 that "the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations."
This is reinforced in line 243, and in other parts of the text. But inspecting Figure 3, the two models (with and without a death rate bias) yield almost identical boxplots; if anything, the boxplots for the lost/nearest fractions of the pure doublestall aversion model seem visually to match the data marginally better. So why do the authors claim that the model with death rate bias is a much better fit? This is far from clear by just inspecting the data. I see no "huge difference" in the plots. There is a difference, but it is far from huge – the differences in the mean are much smaller than the size of the boxes. It seems to me unjustified to use this to choose one model over another. One way to ascertain this is to do rigorous statistical tests to determine if the differences in the means of the simulated and observed data are statistically significant; for example, a ttest.
https://doi.org/10.7554/eLife.63542.sa1Author response
Reviewer #2:
The manuscript entitled "An evolutionary model identifies the main selective pressures for the evolution of genomereplication profiles" is an examination of the principles shaping evolution of replication origin placement. Overall I found the manuscript to be engaging and interesting, and the topic of general importance. It is quite compelling that with just two parameters, origin efficiency and distance between origins, a good model can be built to describe the dynamics of origin birth and death. While this work on its own is sufficiently important for publication, it would be very interesting to see whether the model can be updated in the future to address whether there are forkstalling or origingenerating mechanisms that shape evolution of specific interorigin spaces. This work provides a very good foundation for such efforts.
I have a few major, general concerns I would like the authors to address.
If I'm interpreting the methods correctly, it seems the parameters used in these simulations, such as mean birth rate, mean death rate, γ, and β, were fit to the data once, and used as point estimates during simulation. If true, I expect the simulations to be yielding estimates of birth and death rates with a much narrower distribution of outcomes than is likely to be realistic given what an appropriate level of confidence in those parameter estimates would be. Could the parameters be fit to data in such a way that we attain an estimate of confidence in the parameter values, from which a distribution could be generated and sampled from during simulation?
We agree that this point is important. To verify consistency of using single values for these parameters, we compared the variability of the local numbers of birth and death events along different branches of the tree obtained from simulations with data. Figure 5—figure supplement 2 shows that simulations and empirical data present similar spreading, especially considering death events. For births, we noticed one relevant outlier for the longest branch of the tree, which could motivate a future deeper analysis. Overall, simulations using constant rates yield a similar distribution of events to data, which appears to be satisfactory, at least to a first approximation. For the bias parameters γ and β, this additional test is not necessary, as these two parameters were fixed by choosing the values giving the smallest error on these two distributions (hence a constant parameter model, neglecting variations, is sufficiently good for these parameters). This is shown in Figure 3 (panels A and B) of the revised version main text.
Closely related to my prior concern, I would like the authors to demonstrate the general predictive value of their model on outofsample data. Can the model be applied to other data on replication timing? Without such an attempt to demonstrate the model's applicability to outofsample prediction, the reader cannot ascertain whether the model is overfit to the Lachancea data from Agier et al., 2018.
This is an interesting point but unfortunately no systematic data exist to perform this analysis. Figure 5 of the revised version shows that tindependent predictions apply across the tree. It is important that this model is based on simple global parameters, and not ne tuned on local features of the tree. To emphasize this point, we performed a model t using only a subsample of the yeasts, and showed that it leads to the same conclusions. For this analysis we choose the subtree between LADA and LAWA (see Author response image 1) in order to sample a subset of species that were neither too close nor too far compared to the full sample. We anyway expect that the estimated parameters may vary with particular choices of subtrees, also due to sampling effects. This procedure yielded parameter values that are very similar to the t of the full tree. For example the new death (=0.58 t^{−}1) and the birth (=13.63 Mbp^{−}1t^{−}1) rate are really close to the values estimated using the whole phylogenetic tree, and the fitting procedure of the joint model with these new parameters leads to a values for the γ and β parameters of 2.25 and 1.9 respectively, which, again, are two values really close to the ones obtained with the full tree parameters. We added a comment in the text stating explicitly that we cannot exclude that the values of the birth and death rate, and also the bias parameters γ and β could be Lachanceaspecific, while the conclusions on the relevant evolutionary mechanisms might apply more generally.
Also, keeps the parameter estimates here from being overfit to better predict origin birth and death events in closely related branches of the Lachancea tree in Figure S1? Are γ and β inferred in a way that accounts for the higher correlation in birth and death events in closerrelated branches than in distal branches, or has the fit ignored those correlations?
The birth and death rate are inferred as global parameters, ignoring correlations. Despite of this, the model reproduces the higher correlation in birth and death events in closerrelated branches than in distant branches as a consequence of the common positions and firing rates of the origins in the ancestor, as Figure 5B of the main text shows. The bias parameters β and γ are not inferred based on branch data, but on distributions on extant species. We clarified these points in the revised text and the caption of Figure 5.
The authors state that their model identifies selective pressures. The authors imply, and specifically state in lines 238242, that increased death rate of origins which happen to be nearby highly efficient origins represents selective pressure against the less efficient origins. It isn't until the discussion that the authors raise the possibility that there may simply be a lack of selective pressure to retain inefficient origins that are near highly efficient origins. In my view, it's more likely that selection for the existence of an inefficient origin is simply lower than the drift barrier, so mutagenesis and drift can passively remove such origins over time without the need to invoke selection against inefficient origins.
We fully agree with this point. We have modified the text in order to state early on that the evolutionary bias against weak origins could be simply due to the fact that they are neutral, or that their advantage is not sufficiently high for them to survive drift.
Figure 3 is intended to show that the stallaversion and interference model performs better at predicting correlations between efficiency of lost origins and their nearest neighbor. I agree, but I do not think Figure 3 presents a strong case for this conclusion. Figure S6 presents stronger evidence to me. While Figure 3 does qualitatively suggest that the joint model may predict the correlation between neighboring origin efficiency and origin loss better than the doublestall model alone, it almost appears to me that the model with fork stalling and interference has significantly overestimated the correlation. Is there a quantitative way, perhaps using information criteria, though I admittedly am not sure how one would go about doing that with simulations such as these, to demonstrate that the model with both effects has better predictive value than the one with only fork stalling?
This point was also raised by reviewer 3. In order to address it, we performed a (simplified) likelihood ratio analysis in order to show the better quantitative performance of the combined model. The full likelihood of our model is complex, but we have de ned partial likelihoods for the joint and the double stall aversion model just taking into account the marginal probabilities shown as box plots in Figure 4 of the revised text (former Figure 3) and in Figure 4—figure supplement 1. In other words, we have tested the predicted correlation between the efficiency and firing rate of the lost origins and the ones of their neighbor for both models. The likelihood ratio test quantifies how much the prediction of a certain model is better than a reference ( null ) model. We chose the double stall aversion model as reference (equivalent to setting β = 0 in the full model). See Equation 8, where L_{X} are the likelihoods of the two models and l_{X} are the loglikelihoods. Assuming that L_{r} is χsquared distributed (this is generally the case for large samples), we could compute a Pvalue associated to this test. Supplementary file 1 shows that the joint model performs better for all the four chosen features. In our view, the qualitative difference shown in Figure 4 may be taken as a stronger argument in favor of the combined model, in the sense that, beyond any quantitative agreement relying on parameters, the additional ingredient is needed to explain the data. We revised the text to highlight this point.
There are a couple of assumptions of the model that I would like the authors to examine in further detail. First, that origin birth events occur in the middle of an interorigin space. I am not aware of evidence pointing to this being a good a priori assumption. Can you rerun the simulations, allowing origins to arise at a random site within the interorigin space into which it is born?
We agree that these analyses are useful, and we implemented the analysis of the suggested model variants relaxing specific assumptions. Author response image 2 shows the resulting interorigin distance distribution for a model where the double stall aversion is implemented, but where newborn origins are placed at uniformly chosen position of the associated interval. The figure shows that relaxing the assumption that new origins are born at mid points has consequences on the distance distribution, irregardless of the value of the parameter γ. We interpret this as the result of a faster (hence undetectable in our data) evolutionary process that counter selects origins far from midpoints. The reason for placing newborn origins at midpoints can then be justified by the results of Newman et al.: in order to minimize the global double stall probability the origins perform several attempts within a tree branch, of which we observe only the successful ones. This choice is also justified in the data by Figure 1—figure supplement 2, which shows that birth events show a strong bias for being close to mid points of their associated intervals. We added this figure as a supplementary figure to justify our assumption.
Second, is it reasonable to expect origin firing rates to reshuffle to a new value randomly, without any dependence on their prior rate? Perhaps I'm mistaken, but it seems to me that an origin's firing rate should evolve more gradually, and should have a higher probability of sampling from values near its current value than from values very far from its current value.
We originally started this project with the same idea, modeling rate evolution as a diffusion process. However, on the evolutionary time scales associated to the Lachancea data, it turns out that this expected gradual evolution is not visible, leaving the model with too many extra parameters (a firing rate diffusion constant, and bounds to set the empirical distributions) that are very di cult to estimate. Indeed, Author response image 3 shows the firingrate distributions of the conserved (thus older) origins and of newborn (younger) ones. The two distribution are similar (within their errors), suggesting that (on the observation time scale) newborn origins already take a firing rate sampled from the steadystate distribution. This is condition that is not generally met under a simple diffusive process. This is also shown in Figure 1—figure supplement 3 of the revised text, which shows that the cumulative probability of the firing rates of newborn and conserved origins are similar across the different Lachancea species. In light of these results, we reverted to a simpler parameterpoor version of the model. Possibly this choice can be improved upon in the future, when additional measurements will be available. We addressed this point in the revised text.
Reviewer #3:
This paper proposes a novel and relevant evolutionary model that explains many aspects of replication origin statistics in a family of yeast species. It is a step forward in our understanding of the evolutionary pressures that affect the distribution of replication origins in Eukaryotes. I recommend it should be published in eLife, provided the authors revise the paper to address the following issue:
1. Many of the conclusions of the paper are based on the claim that the extending the model by adding an efficiency bias to the origin death rate makes the model fit the data better; in particular, they say in line 213 that "the observed huge divergence in efficiency between lost origins and their neighbors is absent in the model simulations."
This is reinforced in line 243, and in other parts of the text. But inspecting Figure 3, the two models (with and without a death rate bias) yield almost identical boxplots; if anything, the boxplots for the lost/nearest fractions of the pure doublestall aversion model seem visually to match the data marginally better. So why do the authors claim that the model with death rate bias is a much better fit? This is far from clear by just inspecting the data. I see no "huge difference" in the plots. There is a difference, but it is far from huge – the differences in the mean are much smaller than the size of the boxes. It seems to me unjustified to use this to choose one model over another. One way to ascertain this is to do rigorous statistical tests to determine if the differences in the means of the simulated and observed data are statistically significant; for example, a ttest.
This point was also raised by reviewer 2. For the description and results of our analysis please refer to the previous section and to Supplementary file 1.
https://doi.org/10.7554/eLife.63542.sa2Article and author information
Author details
Funding
Associazione Italiana per la Ricerca sul Cancro (IG REF: 23258)
 Marco Cosentino Lagomarsino
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We are very grateful to Ludovico Calabrese, Simone Pompei, and Orso Maria Romano for the useful discussions. We also thank the editors and reviewers of this manuscript for their constructive and helpful feedback. This work was supported by the Italian Association for Cancer Research, AIRCIG (REF: 23258).
Senior Editor
 Naama Barkai, Weizmann Institute of Science, Israel
Reviewing Editor
 Sandeep Krishna, National Centre for Biological Sciences‐Tata Institute of Fundamental Research, India
Reviewer
 Alessandro De Moura
Publication history
 Received: September 28, 2020
 Accepted: May 20, 2021
 Accepted Manuscript published: May 20, 2021 (version 1)
 Version of Record published: June 18, 2021 (version 2)
Copyright
© 2021, Droghetti et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 928
 Page views

 97
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Genetics and Genomics
Cardiometabolic diseases encompass a range of interrelated conditions that arise from underlying metabolic perturbations precipitated by genetic, environmental, and lifestyle factors. While obesity, dyslipidaemia, smoking, and insulin resistance are major risk factors for cardiometabolic diseases, individuals still present in the absence of such traditional risk factors, making it difficult to determine those at greatest risk of disease. Thus, it is crucial to elucidate the genetic, environmental, and molecular underpinnings to better understand, diagnose, and treat cardiometabolic diseases. Much of this information can be garnered using systems genetics, which takes populationbased approaches to investigate how genetic variance contributes to complex traits. Despite the important advances made by human genomewide association studies (GWAS) in this space, corroboration of these findings has been hampered by limitations including the inability to control environmental influence, limited access to pertinent metabolic tissues, and often, poor classification of diseases or phenotypes. A complementary approach to human GWAS is the utilisation of model systems such as genetically diverse mouse panels to study natural genetic and phenotypic variation in a controlled environment. Here, we review mouse genetic reference panels and the opportunities they provide for the study of cardiometabolic diseases and related traits. We discuss how the postGWAS era has prompted a shift in focus from discovery of novel genetic variants to understanding gene function. Finally, we highlight key advantages and challenges of integrating complementary genetic and multiomics data from human and mouse populations to advance biological discovery.

 Computational and Systems Biology
Predicting the function of a protein from its amino acid sequence is a longstanding challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks to directly predict a variety of protein functions – Enzyme Commission (EC) numbers and Gene Ontology (GO) terms – directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignmentbased methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an inbrowser graphical interface for protein function prediction in which all computation is performed on the user’s personal computer with no data uploaded to remote servers. Moreover, these models place fulllength amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, please visit https://googleresearch.github.io/proteinfer/.