Evolutionary dynamics of incubation periods
Abstract
The incubation period for typhoid, polio, measles, leukemia and many other diseases follows a rightskewed, approximately lognormal distribution. Although this pattern was discovered more than sixty years ago, it remains an open question to explain its ubiquity. Here, we propose an explanation based on evolutionary dynamics on graphs. For simple models of a mutant or pathogen invading a networkstructured population of healthy cells, we show that skewed distributions of incubation periods emerge for a wide range of assumptions about invader fitness, competition dynamics, and network structure. The skewness stems from stochastic mechanisms associated with two classic problems in probability theory: the coupon collector and the random walk. Unlike previous explanations that rely crucially on heterogeneity, our results hold even for homogeneous populations. Thus, we predict that two equally healthy individuals subjected to equal doses of equally pathogenic agents may, by chance alone, show remarkably different time courses of disease.
https://doi.org/10.7554/eLife.30212.001eLife digest
When one child goes to school with a throat infection, many of his or her classmates will often start to come down with a sore throat after two or three days. A few of the children will get sick sooner, the very next day, while others may take about a week. As such, there is a distribution of incubation periods – the time from exposure to illness – across the children in the class.
When plotted on a graph, the distribution of incubation periods is not the normal bell curve. Rather the curve looks lopsided, with a long tail on the right. Plotting the logarithms of the incubation periods, however, rather than the incubation periods themselves, does give a normal distribution. As such, statisticians refer to this kind of curve as a “lognormal distribution". Remarkably, many other, completely unrelated, diseases – like typhoid fever or bladder cancer – also have approximately lognormal distributions of incubation periods. This raised the question: why do such different diseases show such a similar curve?
Working with a simple mathematical model in which chance plays a key role, OttinoLöffler et al. calculate how long it takes for a bacterial infection or cancer cell to take over a network of healthy cells. The model explains why a lognormallike distribution of incubation periods, modeled as takeover times, is so ubiquitous. It emerges from the random dynamics of the incubation process itself, as the diseasecausing microbe or mutant cancer cell competes with the cells of the host.
Intuitively, this new analysis builds on insights from the “coupon collector’s problem”: a classical problem in mathematics that describes the situation where a person collects items like baseball cards, stamps, or cartoon monsters in a videogame. If a random item arrives every day, and the collector’s luck is bad, they may have to wait a long time to collect those last few items. Similarly, in the model of OttinoLöffler et al., the takeover time is dominated by dramatic slowdowns near the start or end of the infection process. These effects lead to an approximately lognormal distribution, with long waits, as seen in so many diseases.
OttinoLöffler et al. do not anticipate that their findings will have direct benefits for medicine or public health. Instead, they believe their results could help to advance basic research in the fields of epidemiology, evolutionary biology and cancer research. The findings might also make an impact outside biology. The term “contagion” has now become a familiar metaphor for the spread of everything from computer viruses to bank failures. This model sheds light on how long it takes for a contagion to take over a network, for a variety of idealized networks and spreading processes.
https://doi.org/10.7554/eLife.30212.002Introduction
The discovery that incubation periods tend to follow rightskewed distributions originally came from epidemiological investigations of incidents in which many people were simultaneously and inadvertently exposed to a pathogen. For example, at a church dinner in Hanford, California on March 17, 1914, ninetythree individuals became infected with typhoid fever after eating contaminated spaghetti prepared by an asymptomatic carrier known to posterity as Mrs. X. Using the known time of exposure and onset of symptoms for the 93 cases, Sawyer, 1914 found that the incubation periods ranged from 3 to 29 days, with a mode of only 6 days and a distribution that was strongly skewed to the right. Similar results were later found for other infectious diseases. Surveying the literature in 1950, Sartwell noted a striking pattern: the incubation periods of diseases as diverse as streptococcal sore throat (Sartwell, 1950) (Figure 1a), measles (Stillerman and Thalhimer, 1944), polio, malaria, chicken pox, and the common cold were all, to a good approximation, lognormally distributed (Sartwell, 1950). On a time scale of years instead of days, the incubation periods for bladder cancer (Goldblatt, 1949) (Figure 1b), skin cancer, radiationinduced leukemia, and other cancers were also found to be approximately lognormally distributed (Armenian and Lilienfeld, 1974).
Two natural questions arise: Why should incubation periods be distributed at all, and why should they be distributed in the same way for different diseases? Previous explanations rest on the presumed heterogeneity of the host, the pathogen, or the dose (Sartwell, 1950; Nishiura, 2007; Horner and Samsa, 1992). To see how this works, return to the typhoid outbreak at the Hanford church dinner (Sawyer, 1914). Every person who ate that spaghetti presumably had a different level of overall health and immune function, and every plate of spaghetti was likely contaminated with a different dose and possibly even strain of typhoid.
Suppose the typhoid bacteria proliferated exponentially fast within the hosts and triggered symptoms when they reached a fixed threshold. Then, if the bacterial dose, growth rate, or triggering threshold were normally distributed across the hosts, one can show that the resulting distribution of incubation periods would have been either exactly or approximately lognormal (see Results, ‘Influence of heterogeneity’). On the other hand, there is counterevidence that lognormal distributions can occur even if some of these sources of heterogeneity are lacking. For example, Sartwell, 1950 reanalyzed data from a study (Bodian et al., 1949) in which identical doses and strains of polio virus were injected into the brains of hundreds of rhesus monkeys. The incubation period, defined as the time from inoculation to the onset of paralysis, was still found to be approximately lognormally distributed, even though the route of infection and the viral dose and strain were held constant. Moreover, the lognormal distributions commonly observed for human diseases have a particular shape, with a dispersion factor (Sartwell, 1950) around $1.11.5$, which previous models cannot explain without special parameter tuning. (See Box 1 for the definition of dispersion factors.)
Here, we propose a new explanation for the skewed distribution of incubation periods. Instead of heterogeneity, it relies on the stochastic dynamics of the incubation process, as the pathogen invades, multiplies, and competes with itself and the cells of the host in a structured network topology. The theory predicts that under a broad range of circumstances, incubation periods should follow a rightskewed distribution that resembles a lognormal, but is actually a Gumbel, one of the universal extreme value distributions (Kotz and Nadarajah, 2000). Heterogeneity is not required, but it is allowed; it does not qualitatively alter our results when included.
Results
Mathematical Model
We model the incubation process using the formalism of evolutionary graph theory (Lieberman et al., 2005; Nowak, 2006; Ohtsuki et al., 2006; Ashcroft et al., 2015). A network of $N\gg 1$ nodes is used to represent an environment within a host where a pathogenic agent, such as a harmful bacterium or a cancer cell, is invading and reproducing. The network could represent several plausible biological scenarios, for example the intestinal microbiome, where harmful typhoid bacteria are competing against a benign resident population of gut flora in a mixing system (modeled as a complete graph); or it could represent mutated leukemic stemcells vying for space against healthy hematopoietic stem cells within the wellorganized threedimensional bone marrow space (modeled as a 3D lattice); or a flat epithelial sheet with an early squamous cancer compromising and invading nearby healthy cells (modeled as a 2D lattice). For the sake of generality, we will refer to the two types of agents as healthy residents and harmful invaders.
While Sartwell’s law has been applied to many different types of diseases with diverse etiologies, the model we propose makes the most sense for asexually reproducing invaders, like cancer cells or bacteria. Viruses, on the other hand, often reproduce with a ‘onetomany’ dynamic, which is not faithfully captured in this model. So, while the general phenomenon of network invasion seems to apply to viruses as well, the model in its present form is not well suited to describe their dynamics.
Box 1.
Dispersion factors.
The Dispersion Factor of a distribution or dataset is defined to be its geometric standard deviation. Or more explicitly, given a positive dataset ${x}_{n}$, it is ${\sigma}_{G}$, where
We measure this quantity for multiple reasons. While ${\mu}_{G}$ is a dimensionful quantity, ${\sigma}_{G}$ is dimensionless. Secondly, $\mathrm{log}({\sigma}_{G})$ is the maximum likelihood estimator for the scale parameter of an unshifted lognormal distribution. Moreover, this is the quantity Sartwell used to describe the variability of incubation periods (Sartwell, 1950), so it is a useful point of comparison.
https://doi.org/10.7554/eLife.30212.004Considering asexually reproducing and competing invaders, then, we choose to model the invasion dynamics as a Moran process (Moran, 1958; Williams and Bjerknes, 1972; Lieberman et al., 2005; Nowak, 2006). Invaders are assigned a relative fitness $r$ (suggestively called the carcinogenic advantage by Williams and Bjerknes, 1972). The fitness of residents is normalized to 1. We consider two versions of the Moran process. In the Birthdeath (Bd) version (Figure 2a), a random node is chosen, with probability proportional to its fitness. It gives birth to a single offspring. Then, one of its neighbors is chosen uniformly at random to die and is replaced by the offspring (Figure 2b). We also consider Deathbirth (Db) updates (Figure 2c,d). In this version of the model, a node is randomly selected for death, with probability proportional to $1/r$; then a copy of a uniformly random neighbor replaces it. To test the robustness of our results, we study both versions of the Moran model on various networks: complete graphs, star graphs, ErdősRényi random graphs, one, two, and threedimensional lattices, and smallworld, scalefree, and $k$regular networks. We also vary the invader fitness $r$ and the model criterion for the onset of symptoms. These extensions are presented in the Materials and methods, Figures 5, 6. Box 2 discusses other variants of the Moran model. Here we focus on the simplest cases to elucidate the basic mechanisms.
Box 2.
Nomenclature for the Moran model.
There are many distinct variations on the basic Moran model, but the six most popular all consist of a twostep update. The first step selects a node over the entire population, whereas the second step selects a neighboring node from the neighbors of the first.
However, the content of each step can vary from one model to another. For example, the selection in each step can occur in a fitnessweighted fashion. Also, the node in the first step can be interpreted as either giving birth, or dying.
To avoid confusion, we use standard abbreviations to distinguish the different models, as illustrated by the table below. The order of letters indicates the order of the steps, and the capitalization denotes which steps have a fitness dependence. For example, dB refers to the update rule where the first step uniformly selects a node from the entire population to die, and then one of its neighbors is selected, with probability proportional to fitness, to replace it.
In this paper, we focus exclusively on Bd and Db. Although much the same analysis could be done on any of the other update rules, these two were showcased due to their relative popularity and simplicity.
Step order  Fitnessstep first  Fitnessstep second  Both fitnesssteps 

Birth first  Bd  bD  BD 
Death first  Db  dB  DB 
Our simulations start with a single invader placed at a random node in a network of otherwise healthy residents. The update rule is applied at discrete time steps. In the long run, either the invaders replace all the residents, or vice versa. If symptoms are triggered when the entire network has been taken over by invaders, then the incubation period is the number of time steps between the introduction of the invader and its fixation. On the other hand, if the invaders die out and the healthy cells take over, then the process is stopped and no observable symptoms manifest. Later, in the paper, we consider a generalization from complete to partial takeovers, but for now the incubation period will refer to a complete takeover.
Our notion of time in this model is linked directly to the biology of invasion of a reproducing asexual pathogen that divides and replaces other cells sequentially. Instead of considering divisions as a rate, and therefore linking the dynamics to real time, we consider time steps to be individual division events. This is more akin to the standard methods of modeling chemical interactions, as in the Gillespie algorithm (Gillespie, 1977). This focus on the biology of the individual pathogen (or cancer cell) also provides a simple explanation for how diseases with very different natural histories can have the same analytic distribution of incubation time. As each different disease would have a different characteristic mean doubling time, while the shape of the distributions might be the same, the physical time taken would scale with the characteristic proliferation time. Future iterations of this model could consider deriving an exact scaling between physical time and this biological eventbased updating of time.
Infinitely fit invaders
First, consider what happens if the invaders have infinite fitness ($r\to \mathrm{\infty}$) in the Birthdeath model. While an exaggeration, this case is instructive and is a reasonable approximation for aggressive cancers and infections. In this limit, the dynamics simplify enormously: only the invaders reproduce. But because they give birth and replace their neighbors blindly, they waste time whenever they compete between themselves and one invader replaces another. These random selfreplacements slow down the incubation process, and make it highly variable. In fact, the level of infighting is what determines the incubation period in this case. Beyond fitness, the topology of the network matters too. For lowdimensional networks, exemplified by a twodimensional lattice (Figure 3a , red circles), the growth rate of the invader population remains roughly constant as takeover occurs. This leads to a normal distribution of incubation periods (Figure 3a, red circles; and see Methods and Materials, ‘Birthdeath, other solvable networks’). However, on very highdimensional networks like the complete graph (Figure 3a, blue circles), the distribution becomes right skewed. Intuitively, this happens because every invader now has a chance of replacing any healthy node or any other invader. It is as if at every time step a candidate node for replacement gets blindly drawn from a bag, relabeled as an invader, and returned to the bag. At the start of the incubation process, almost every draw adds another invader to the population and the infection progresses rapidly. But near the end, it will take many, many draws to blindly fish out the last remaining healthy node, as needed to terminate the incubation period. This slowingdown phenomenon near the end should feel familiar to anyone who has tried to complete a collection of baseball cards, stamps, or coupons, since they are all manifestations of the coupon collector's problem, a wellstudied concept in probability theory (Pósfai, 2010; Feller, 1968; Erdős and Rényi, 1961). Because of those frustratingly long waits to collect the final healthy node, the incubation period distribution gets skewed to the right. In the infinite$N$ limit (see Methods and Materials, ‘Birthdeath, complete graph’), the coupon collector’s process returns a Gumbel distribution, which resembles a lognormal and can be mistaken for it (Read, 1998). Indeed, when a Gumbel and a lognormal are fit to the same real data, as in Figure 1, it is hard to tell them apart. All this analysis can easily be repeated for the Deathbirth model with minimal changes.
Neutrally fit invaders
At the other extreme, suppose the invaders have no selective advantage ($r=1$). Then a different stochastic mechanism skews the distribution of incubation periods to the right (Figure 3b and Methods and Materials, ‘Random Walk Skewness’). For many networks, the dynamics reduce to an unbiased random walk on the number of invaders, with waiting times at each population level. There are two absorbing states, corresponding to both $0$0 and $N$ invaders for the two kinds of fixation. However, we only care about random walks that successfully hit $N$, as these represent disease processes that manifest symptoms, so we must always condition on its success. This demands that the invader experience early success and growth, pushing it away from probable extinction. This conditioning introduces a bias that makes short incubation times probable, but long walks may still occasionally occur, driving the mean time above the median. In short, a conditioned random walk will introduce a right skew in the distribution of incubation periods. This effect holds for both high and lowdimensional networks (Figure 3b), and for Birthdeath and Deathbirth dynamics.
Testing robustness to update rule and truncation
Rightskewed distributions typically persist in the face of various perturbations to the model, but some perturbations can turn them into normal distributions. For example, suppose we allow symptoms to occur when invaders take over only a fraction $f$ of the whole network. This is a reasonable consideration as leukemic cells need not take over all the bone marrow before leukemia becomes evident, nor does typhoid need to overwhelm all the cells in the microbiome before causing fever; indeed it is likely far fewer in both cases. Figure 4 contrasts what happens for Birthdeath and Deathbirth dynamics under these assumptions. When $r=\mathrm{\infty}$, the Gumbel distribution of Figure 3a persists for $f=1$ (Figure 4a), but turns into a normal distribution (Baum and Billingsley, 1965) when $f=0.9$ (Figure 4b) or $f=0.1$ (Figure 4c). Yet under Deathbirth dynamics, the distribution stays Gumbel for all nonzero values of $f$ (Figure 4d,e,f). The fact that birthdeath dynamics returns a normal for $0<f<1$ whereas Deathbirth still returns a Gumbel can be rationalized via various convergence theorems (Baum and Billingsley, 1965; OttinoLöffler et al., 2017; Pósfai, 2010). However, the fact that similar update rules behave so differently under a reasonable perturbation should caution us to be mindful of our choice of models.
Influence of heterogeneity
Historically, the distribution of incubation periods has been ascribed to heterogeneity (Sartwell, 1950; Nishiura, 2007; Horner and Samsa, 1992) in the fitness (growth rate, say) or dose of the pathogen, or in host factors like immune response. To see how these potential sources of heterogeneity could account for the skewed and approximately lognormal distribution of incubation periods, consider a pathogen growing exponentially with rate $r$ from an initial population ${N}_{0}$, so that its population at time $t$ is given by $N(t)={N}_{0}{e}^{rt}$. If an immune response or other detectable symptoms are triggered when $N$ reaches a threshold population $\theta $, then the incubation time $T$ satisfies $N(T)={N}_{0}{e}^{rT}=\theta $. Solving for $T$ yields
So if either the threshold $\theta $ or the inoculum ${N}_{0}$ are normally distributed across the host population, the incubation period $T$ will be lognormally distributed. Likewise, but in a more qualitative sense, a normal distribution of pathogen growth rates $r$ will also produce a skewed distribution that resembles a lognormal (Nishiura, 2007). However, if there is no randomness in any of those sources, this model predicts a single deterministic value of $T$ for the incubation period.
In contrast, the stochastic model proposed here does not need these sources of heterogeneity to produce rightskewed distributions. But if they happen to be present, as they likely are for many real diseases, our model can accommodate them. Indeed, when any of the three sources of heterogeneity are included in our model, they only serve to make the predicted distributions even more rightskewed, as we now show.
First, to emulate the heterogeneity of the strength of the pathogen, we assume heterogeneity in the parameter $r$ (which, in our model, governs the fitness of the invading cells relative to those of the host). In particular we randomly draw a different $r>0$ in each simulation, to simulate different hosts being infected with different pathogenic strains. The resulting distribution of invader fixation times depends on the distribution of the $r$’s, but our investigations demonstrate they consistently produce rightskewed distributions (Figure 5a).
Second, to emulate the heterogeneity of host factors like immune response, we allow variability in the parameter $f$, which quantifies the fraction of the network that needs to be invaded before symptoms appear. Let ${T}_{f}$ denote the time it takes for $N\cdot f$ of the original resident nodes to be replaced by invaders. If we draw $f$ randomly from some distribution, then essentially each host has a different threshold at which symptoms appear. In contrast to Figure 4b, where we saw that repeated simulations for a host population with a single, fixed, deterministic $f$ can cause skewed distributions to turn into normal distributions, that is no longer the case when heterogeneity is included, as Figure 5b indicates. In fact, the heterogeneity actually causes even more rightskew than before.
Third, emulating variable doses is also straightforward. Instead of always starting with a single invader cell, we choose the initial number of invaders according to some distribution. Again, this modification does not remove the rightskewed behavior established in the Moran model (Figure 5c).
Finally, we can apply all these sources of heterogeneity at once, and remain with a rightskewed distribution (Figure 5d). In summary, although our main results were obtained by analyzing stochastic models of homogeneous host and pathogen populations, allowing for heterogeneity makes the predicted rightskewed distributions more, not less, prominent.
Discussion
The evolutionary dynamical model presented here is intended to mimic the withinhost development of certain cancers and bacterial infections. It is not well suited to the dynamics of viruses. Thus, explaining why Sartwell's law also holds for so many viral diseases remains an open question.
Our model suggests two basic mechanisms underlie the observed rightskewed, approximately lognormal distributions of incubation periods. When the fitness of the pathogen is high, the skew comes from coupon collection; when the pathogen fitness is neutral or low, the skew comes from conditioned random walks; and at intermediate fitnesses, a combination of the two creates skew. Neither of these effects demand any heterogeneity from the invader or the host. However, the model can accommodate such heterogeneity, either by having the invader fitness $r$ be randomly drawn, or by having symptoms occur when a random fraction $f$ of the host network has been invaded. Our simulations show that both sources of heterogeneity only exaggerate the level of rightskewness we would have seen without them (See Results,‘Influence of heterogeneity’, Figure 5).
Beyond accounting qualitatively for the distributions of incubation periods, our model accounts for a quantitative feature that has never been explained before. As shown in Methods and Materials, Table 1, the distributions generated by highly fit pathogens and mutants are predicted to have dispersion factors (also known as geometric standard deviations; see Box 1) of about $1.11.4$, close to the actual values of $1.11.5$ observed for various infectious diseases (Sartwell, 1950;Sartwell, 1966;Nishiura, 2007). Moreover, the model also helps to explain why so few infectious diseases yield dispersion factors greater than 1.5. Such high dispersion factors arise only for $r\approx 1$, corresponding to pathogens or mutants that are only slightly more fit than the resident populations against which they are competing.
On the other hand, it is tempting to speculate that this regime of nearly neutral fitness may be more relevant to cancer development. While it is likely that tumor cells late in the disease process have much higher fitness than healthy cells secondary to continued selection (Scott and Marusyk, 2017), there is ample evidence that most cancers have long latency periods, for example in genetic data from pancreatic cancers (Yachida et al., 2010). One could speculate that during this early period, which accounts for the majority of the cancer’s time in the patient, the fitness is nearly neutral. For the cancer data reviewed by (Armenian and Lilienfeld, 1974), the observed distributions typically had dispersion factors around $1.41.9$. In our model, these high dispersion factors tend to arise when the invader is only slightly more fit than residents. This is also consistent with the suggestion of (Williams and Bjerknes, 1972); the shape of tumors in the model most closely resembled that of real tumors when the fitness of the invaders was only slightly above neutral.
In 1546, Fracastorii, 1930 described the incubation of rabies after a bite from an rabid dog as ‘stealthy, slow, and gradual.’ Today, nearly five centuries later, the dynamics of incubation processes remain stealthy and slow to yield their secrets. We have tried to shed light on their patterns of variability with the help of a new conceptual tool, evolutionary graph theory. This approach provides a possible solution to the longstanding question of why so many disparate diseases show such similarlyshaped distributions of incubation periods. What remains is to quantify the dynamics of incubation processes experimentally with highresolution measurements in time and space.
Aside from their possible application to incubation processes, our results also shed light on a broader theoretical question in evolutionary dynamics: when a mutant invades a structured population of residents, how does the distribution of mutant fixation times depend on the network structure of the population? Early work in evolutionary graph theory (Lieberman et al., 2005; Nowak, 2006; Ohtsuki et al., 2006) concentrated on the network’s impact on the probability of mutant fixation and the mean time to fixation. More recent studies have gone beyond the mean time to consider the full distribution of fixation times (Ashcroft et al., 2015), as we have also done here. We hope that our exact results for disparate topologies and dynamics will stimulate further investigations of these important questions in evolutionary biology.
Materials and methods
Here we describe the model and our analytical and numerical results in further detail. We also test the robustness of our claims with respect to relaxation of the various assumptions in the model. See the Appendix for complete proofs of analytical results.
Birthdeath, complete graph
Request a detailed protocolThe population of cells is represented by a network of $N$ nodes. Edges between nodes indicate which cells can potentially interact with each other. There are two types of cells: harmful invaders with fitness $r$, and healthy residents with fitness 1. All simulations are initialized with a single invader placed at a random node.
The Moran Birthdeath (Bd) update rule has two steps. First, a node is randomly selected out of the total population, with probability proportional to its fitness. Second, a neighbor of the first node is chosen, uniformly at random, and takes on the type of the first node.
In a complete graph, all nodes are adjacent. Therefore, the probability of adding a new invader, given there are currently $m$ invaders, is
In the limit of infinite fitness, $(r\to \mathrm{\infty})$, the first term approaches one and we get
and the probability of the invader population ever decreasing is 0. So the time $T$ to invader fixation is sum of all the transition times $m\to m+1$ for $m=1,2,\mathrm{\dots},N1$. These transition times can be calculated as follows. For the population to take $t$ steps to go from $m$ to $m+1$ invaders, nothing must have happened for $t1$ steps before advancing on the $t$’th step. The probability of this happening is exactly
In other words, the time to add a new invader is exactly a geometric random variable. Therefore, the total fixation time is just
This random variable $T$ describes a process identical to that of the coupon collector’s problem (Pósfai, 2010; Feller, 1968). In both, we have a collection of $N1$ nodes, and draw a random one with replacement at each time step. If we pick a healthy node, we relabel it and toss it back, and repeat until there are no healthy nodes left. By adapting classic results (Erdős and Rényi, 1961; Baum and Billingsley, 1965), we show in the Appendix that it is straightforward to find the asymptotic distribution of $T$ as $N$ gets large. To normalize this distribution, note that its mean is $\mu ={\sum}_{m}{p}_{m}^{1}\approx N\mathrm{log}(N)+N\gamma $. Then we find
Here $\gamma \approx 0.5772$ is the EulerMascheroni constant, $\stackrel{\mathit{d}}{\to}$ denotes convergence in distribution, and a Gumbel($\alpha ,\beta $) random variable has a density given by
This prediction for the normalized distribution of the incubation period $T$ agrees with simulations on large networks (Figure 6a).
A Gumbel distribution of incubation periods has previously been obtained for a variant of this model. Instead of working with the large$N$ limit of a complete graph, it assumed a continuoustime birthdeath model of an invading microbial population whose dynamics were governed by differential equations (Williams, 1965).
Birthdeath, other solvable networks
Request a detailed protocolThe analysis of the finite$N$ complete graph sets up an important framework that can be applied to more complicated networks. For example, in the Appendix we prove that the distribution of fixation times $T$ for a star network also converges to a Gumbel for $N\gg 1$, specifically:
This prediction matches simulations (Figure 6b).
The same framework also applies to a onedimensional (1D) ring lattice, but instead of using the couponcollector framework, we need to cite the LindebergFeller central limit theorem (Durrett, 1991). As shown in the Appendix, this gives us
This prediction agrees with simulations (Figure 6c).
For a twodimensional square lattice, it is more difficult to produce analytical results that are both rigorous and exact. But by making an approximation based on the geometry of the lattice, and using the fact that the population growth rate is proportional to its surface area (see the Appendix, ”Normally distributed fixation times for 2D lattice’), we can make a nonrigorous analytical guess about the distribution of the fixation times $T$. Via these arguments, and given $\mu =\text{E}[T]$ and ${\sigma}^{2}=\text{Var}(T)$, we predict
Despite the approximation, this prediction works well (Figure 6d).
By similar arguments, we predict that lattices of dimension $d\ge 3$ have rightskewed asymptotic distributions of fixation times. Specifically, given $\eta :=11/d$, we predict
where $\zeta $ is the Riemann zeta function. The methods used to derive that can also be used to create approximate finitesize distributions for the lattices (Figure 6e).
In particular, we predict positive skew for all $d\ge 3$ and for the skew to increase monotonically with dimension (see the Appendix). Meanwhile, both 1D and 2D lattices have normal asymptotic distributions, and therefore no skew. This establishes $d=2$ as a critical dimension in these dynamics, transitioning from zero skew to positive skew.
Incidentally, these arguments also suggest that appropriate infinitedimensional networks will asymptotically have a Gumbel distribution. This is numerically true for the ErdősRényi random graph (Figure 6f).
For more complex networks, such as the WattsStrogatz smallworld network, the $k$regular random graph, and the BarabasiAlbert scalefree network, we currently lack theory to predict the asymptotic distributions analytically. However, numerical simulations produce simulations that are all wellapproximated by a noncentral lognormal, obeying Sartwell’s law (Sartwell, 1950) (Figure 7a,c,e).
Table 1 shows that geometric standard deviations of the incubation period distributions for all of these networks fall around $1.11.4$, in agreement with the dispersion factors of $1.11.5$ observed for many infectious diseases (Sartwell, 1950; Horner and Samsa, 1992).
Random walk skewness
Request a detailed protocolSo far we have focused on infinitely fit invaders ($r\to \mathrm{\infty}$). Now we consider the opposite extreme, where invaders have nearly neutral fitness ($r\approx 1$) relative to the residents. We will show that rightskewed distributions of incubation periods occur in this limit as well, but for a completely different reason than coupon collection.
The analysis is again simplest for the complete graph, so we return to that case. As before, the probability of an invader replacing a resident in the next time step is
Similarly, the probability of an invader being replaced by a resident in the next time step is
So the probability of the next replacement adding a new invader is
This defines a random walk with drift $q$ on the invader population.
Only a special subset of these walks are relevant to the computation of the incubation period distribution. For the incubation period to be welldefined, the invader population must not go extinct. Therefore, we need to condition on the fact that the invader population $m$ hits $N$ before it ever hits 0. For the limiting case $r=1$, corresponding to a perfectly neutral invader, we can show with martingale methods that the resulting distribution of incubation periods will be strongly skewed to the right as $N$ gets large (see the Appendix). This is to be expected: there are only a few ways to walk from one to $N$ quickly, while there are many ways to have a long, meandering excursion before finally getting there.
The variance from this conditioned random walk process tends to drown out the effects of network topology. The distribution of incubation periods ends up looking similar for diverse networks (Figure 8), including complex networks (Figure 7b,d,f). So even though no coupon collection happens at low finesses $r\approx 1$, the effect of the conditioned random walk is more than enough to generate rightskewed distributions of incubation periods. In fact, this conditioned random walk mechanism at low $r$ produces an even higher dispersion factor ($\approx 1.7$) than coupon collection does at high $r$ (see Table 1).
Influence of nonstatic population
Request a detailed protocolIn many diseases, it is unlikely that the total network size would remain constant in time. For example, targeted radiation and chemotherapy leads to a loss of mass in both the tumor and the substrate tissue. Depending on the specific physical case, the population levels of invaders and residents can have many nontrivial time dependencies. As a firstorder examination of the effects of timevarying populations, three simple cases were considered on the complete graph for the intermediate fitness of $r=10$. As a baseline, the distribution for a constant population was measured in Figure 9a.
We considered a case when the resident population was growing. At every time step, a new resident node was added with probability $1/N$, which was chosen so that takeover would happen in finite time. Even still, the majority of the run will be spent when the resident population is small, with takeovers and new additions occurring at a roughly even pace. This led to an accentuated level of right skew in Figure 9b.
We then considered a case where the resident population was constantly shrinking. Again, the probability of change was $1/N$ every time step, but this time it decreased the resident population by 1. While there is still a visible right skew in Figure 9c, it was somewhat lessened due to the global shrinkage speeding up the coupon collecting process.
Finally, we considered a randomly varying resident population. Here, the resident population increases or decreases by one every time step, each with probability 1/2. This randomwalking population level also leads to an extreme level of skew in Figure 9d.
Appendix 1
Agreement of geometric and exponential variables I
Proposition: Suppose we have a family of sequences ${({p}_{m})}_{m=1}^{M}$, with $0\le {p}_{m}\le 1$ for all $m$ and $M$, where ${p}_{m}$ may depend on $M$. Define $\text{Geo}(p)$ to be a geometric random variable with distribution
for $k=1,2,\mathrm{\dots}$. Further, let $\mathcal{E}(p)$ be an exponential random variable with distribution
for $x\ge 0$. Given some function $L:=L(M)$ such that ${lim}_{M\to \mathrm{\infty}}L=\mathrm{\infty}$ and
and given ${T}_{G}:={\sum}_{m=1}^{M}X({p}_{m})$, ${T}_{E}:={\sum}_{m=1}^{M}\mathcal{E}({p}_{m})$, and $\mu :={\sum}_{m=1}^{M}1/{p}_{m}$, then
The symbol “$\sim $” means the ratio of characteristic functions goes to one as $N$ gets large. That is, the random variables on both sides converge to each other in distribution as $M$ gets large.
Proof: The proof of this claim simply involves calculating the characteristic functions and taking a limit. We have presented the details elsewhere (OttinoLöffler et al., 2017).
Agreement of geometric and exponential variables II
Proposition: Given the setup in the previous proposition, define ${\sigma}_{G}^{2}=\text{Var}({T}_{G})$ and ${\sigma}_{E}^{2}=\text{Var}({T}_{E})$. If
then
Proof: Our first goal is to show that
We do this by using the proposition in ‘Agreement of geometric and exponential variables I’, substituting ${\sigma}_{G}$ for $L$. First we check that Equation (8) is satisfied. Notice that
by hypothesis. Hence
But notice that
Therefore, the proposition is proven.
Condition for normality
Proposition: Let $T={\sum}_{m=1}^{M}\mathcal{E}({p}_{m})$, define ${\sigma}^{2}=\text{Var}(T)={\sum}_{m}^{M}{p}_{m}^{2},$ and let ${lim}_{M\to \mathrm{\infty}}{p}_{m}\sigma =\mathrm{\infty}$. If
then
Proof: Apply the LindebergFeller central limit theorem (Durrett, 1991) to the random variables
By construction, ${\sum}_{m}({Y}_{m},M)=(T\mu )/\sigma $, $E[{Y}_{m,M}]=0$ and ${\sum}_{m}E[{Y}_{m,M}^{2}]=1$. So in order to apply the theorem (and thereby get our desired result), we simply need satisfy the Lindeberg condition for any $\u03f5>0$, as given by
Notice that ${Y}_{m,M}<\u03f5$ implies
By hypothesis, the right hand side will eventually be less than 0, meaning that eventually ${Y}_{m,M}<\u03f5$ will be impossible. So if we define ${c}_{m}=1+\u03f5{p}_{m}\sigma $, then we know that $\underset{M\to \mathrm{\infty}}{lim}{\text{Lind}}_{M}=\underset{M\to \mathrm{\infty}}{lim}{\displaystyle \sum _{m=1}^{M}}E[{Y}_{m,M}^{2};{Y}_{m,M}>\u03f5].$
So for large enough $M$, we have
By hypothesis, ${p}_{m}\sigma $ grows without bound, so the term ${p}_{m}^{2}{\sigma}^{2}$ will be dominant. Therefore, there is some constant $D$ such that we can create the upper bound
From here, we can apply the hypothesis to get $\underset{M\to \mathrm{\infty}}{lim}{\text{Lind}}_{M}\le \underset{M\to \mathrm{\infty}}{lim}D{\displaystyle \sum _{m=1}^{M}}\mathrm{exp}\left(\u03f5{p}_{m}\sigma \right)=0.$
Thus the Lindeberg condition holds. Therefore, by applying the theorem, we conclude that
Normally distributed fixation times for 1D lattice
Let us start with a onedimensional (1D) ring of $N$ nodes, with exactly one invader at the start. Under infinite$r$ Birthdeath (Bd) dynamics, a uniform random invader gives birth at every time step and replaces one of its neighbors, also uniformly at random.
Because of the simple topology of the ring, the growing chain of invaders will always advance from the left or right ends. This means that if there are currently $m$ invaders, then the probability of a new invader being added in this time step is exactly given by
for $m=1,2,\mathrm{\dots},M$, where we have defined $M:=N1$.
The probability of spending exactly $t$ time steps at $m$ invaders is given by the odds of doing nothing for exactly $t1$ steps and then advancing at the last step, and so is given by ${(1{p}_{m})}^{t1}{p}_{m}$. In other words, the time spent with $m$ invaders is given by the geometric random variable $\text{Geo}({p}_{m})$. Therefore the total fixation time $T$ is given by $T={\sum}_{m=1}^{M}\text{Geo}({p}_{m})$.
By applying the results of section ‘Agreement of geometric and exponential variables II’, we can switch to using exponential random variables. From here we wish to apply results from section ‘Condition for normality’, so we need to check if ${\sum}_{m=1}^{M}\mathrm{exp}(\u03f5{p}_{m}\sigma )$ converges to zero, given $\sigma ={\sum}_{m=1}^{M}{p}_{m}^{2}$. Using asymptotics and a constant $D$, we find
This lets us cite our proposition, and conclude that the fixation time is asymptotically distributed as a normal, with
Gumbel distributed fixation times for star graph
Next consider the infinite$r$ Bd dynamics on a star graph. This network consists of one ‘hub’ node and $N$‘spoke’ nodes, with edges exclusively between the hub and spokes. We will place the initial invader at the hub, since starting from a spoke is a trivial perturbation off of that.
Now, given that $m$ of the $N$ spokes are invaders, then the odds of another one turning into an invader in one time step is simply the odds of choosing the hub from the set of all invaders, times the probability of replacing an existing healthy spoke. So
for $m=0,1,\mathrm{\dots},N1$. As before, the fixation time $T$ is the sum of geometric random variables $\text{Geo}({p}_{m})$. However, it is easy to use section ‘Agreement of geometric and exponential variables II’ to show that the total fixation time is well approximated by the sum of exponential random variables $\mathcal{E}({p}_{m}),$ given we normalize by ${N}^{2}$. So we know
The crux of finding the limiting distribution of $T$ is noticing that $N{p}_{m}\approx (Nm)/N$ for large $m$. The sequence ${r}_{m}:=(Nm)/N,m=0,\mathrm{\dots},N1$ corresponds to the wellknown coupon collector’s problem (see main text). Imagine a child trying to complete a collection of $N$ cards by buying one random card each week. The probability of getting a new card the first week is 1; the probability of getting a new card after the first has been collected is $(N1)/N$; the probability of getting another new card after two cards have been collected is $(N2)/N$; and so on until the probability of getting the last card is $1/N$.
The time ${T}_{C}$ to complete this collection (which is also well approximated by the sum of exponentials) has been the subject of much historical study. In fact, an exact distribution for ${T}_{C}$ in the limit of large $N$ is known (OttinoLöffler et al., 2017; Pósfai, 2010; Feller, 1968; Erdős and Rényi, 1961; Baum and Billingsley, 1965; Rubin and Zidek, 1965), and is given in normalized form by
where $\gamma \approx 0.5772$ is the EulerMascheroni constant.
Therefore, if we can connect our fixation time $T$ to the coupon collector’s time ${T}_{C}$, we would know its distribution. We do so by taking the ratio of their respective characteristic functions, using their respective approximations as exponential random variables. Letting $k=Nm$, we find a characteristic function
for our fixation time, and
for the coupon collector’s time.
Taking the ratio gives
Taking the Taylor expansion of the logarithm for $N\gg 1$, we can substitute in an appropriate function ${R}_{m}$ which gets small as $N$ gets large. Thus
The first sum is bounded above in norm by a constant times $\mathrm{log}(N)$, so the first term goes to zero as $N$ gets large. Similarly, the second term goes to zero quickly, meaning that ${\varphi}_{C}/\varphi \to 0$. Hence
Using Equation (16) to connect Equation (14) to Equation (15), we find
as desired.
Normally distributed fixation times for 2D lattice
We wish to find the limiting distribution of fixation times of infinite$r$ Bd growth on a $d$dimensional square lattice, assuming periodic boundaries. We will eventually focus on the twodimensional (2D) lattice, but let us set up every case for $2\le d<\mathrm{\infty}$ right now.
Unlike the previous cases, the probability of adding a new invader always depends on the exact configuration of the existing invaders. So we can no longer define exact values for ${p}_{m}$ that describe the fixation time $T$ as a simple sum of random variables.
But even though we do not know the exact shape of the invader cluster, the simple network structure can motivate a reasonable approximation. In particular, given a sufficiently smooth and convex volume $V$ in a $d$dimensional lattice, we should expect the volume to have a surface area proportional to ${V}^{\eta}$, where $\eta =11/d$.
Assuming this to be true, recall the basic dynamics of infinite$r$ Bd with $N$ nodes and $m$ invaders. First, we uniformly select one node out of the population of invaders, which is always a probability of 1/$m$per node. Then we replace one of the invader’s neighbors, uniformly at random. However, only invaders on the surface of the cluster even have a chance at replacing a healthy node!
Given sufficient regularity of the boundary of the cluster, this means that the probability of an invader replacing a healthy node is proportional to
Using this logic, we expect the probability of adding an invader to go as ${m}^{\eta}/m$ at the start. However, remember we have a periodic lattice, so using ${m}^{\eta}$ as the surface area of the cluster stops being true halfway through, and using the smaller healthy cluster’s volume is a better approximation. In other words, the surface area grows as ${m}^{\eta}$ at the start when the cluster begins forming, and shrinks as ${(Nm)}^{\eta}$ at the end when there are only a few healthy cells left.
This intuitive reasoning suggests that the ‘true’ probability of adding a new invader, given that there are already $m$ of them, should be roughly proportional to
The fact that we only estimated ${q}_{m}$ up to proportionality is adequate, because when we use section ‘Agreement of geometric and exponential variables II’, any multiplicative factors will just be absorbed by the variance anyway when we write down the normalized distribution.
The case of $d=2$ is special, so let us plug in $d=2$ and $\eta =1/2$ where appropriate. This gives
Now use section ‘Agreement of geometric and exponential variables II’. Skip ahead to defining $T$ to be the sum of exponential random variables, and split the sum in half. Thus
We assume $N$ is even without loss of generality.
First, we show that ${T}_{a}$ is normally distributed using the section ‘Condition for normality’. To use this result, we first need to calculate $\text{Var}({T}_{a})$. This is exactly
Therefore, we have
as $N\to \mathrm{\infty}$, so the first condition is satisfied.
Next, we need to show that a certain sum of exponentials converges to zero. In particular, for any $\u03f5>0$, we examine
We can make the bound $\sqrt{\text{Var}({T}_{a})}>N/8$ and ${q}_{m}=1/\sqrt{m}>1/\sqrt{N}$, so
Therefore ${S}_{N}\to 0$ as $N$ gets large. So the second condition is satisfied. Therefore, ${T}_{a}$ is distributed according to a normal.
However, we need to do the same to ${T}_{b}$. So let us estimate the variance of this second contribution. Letting $k=Nm,$ we get
So for large $N$, we obtain the following bound:
Therefore,
This satisfies the first condition from section ‘Condition for normality’.
To satisfy the second condition, we again fix some arbitrary $\u03f5>0$ and calculate a certain sum of exponentials. By calculating and choosing careful bounds, we get
This sum can be approximated from above by an appropriate integral:
Therefore, ${S}_{N}\to 0$ as $N$ gets large. This satisfies the second condition in section ‘Condition for normality’, meaning that we now know that ${T}_{b}$ is distributed as a normal.
Since the sum of normal variables returns a normal variable, this means that $T={T}_{a}+{T}_{b}$ is also normal. Hence, we expect for the fixation time of infinite$r$ Bd on a 2D lattice to be distributed as a normal, like the 1D ring lattice but, as we will now show, unlike $d\ge 3$.
Nonnormality for $d\ge 3$
Here, we wish to find the limiting distribution of fixation times of infinite$r$ Bd growth on a $d$dimensional square lattice, assuming periodic boundaries. Right now, we will look only at $3\le d<\mathrm{\infty}$, since $d=1,2,$ and $\mathrm{\infty}$ are special cases.
We did the bulk of the setup for this case in section ‘Normally distributed fixation times for 2D lattice’, so we have the approximate probabilities of adding an invader to be given by
as before. And again, we define the ‘approximate’ fixation time $T$ to be the sum of the exponential random variables $\mathcal{E}({q}_{m})$. Splitting the sum into a front and back half gives
But even with such aggressive approximations, we cannot present a closed form for the distribution of $T$. However, we can still calculate an important quantity: the skew of the distribution.
Since we use section ‘Agreement of geometric and exponential variables II’, let us skip to defining $T$ to be the sum of exponential random variables, and split the sum in half, as we did with the 2D lattice. Thus
We assume $N$ is even without loss of generality.
First, let us find which half contributes more variance. Bound Var(${T}_{a}$) as
We similarly approximate Var(${T}_{b})$, setting $k=Nm$ and finding
Since $\eta \ge 2/3$, only the first term survives as $N$ gets large, so
where $\zeta $ is the usual Riemann zeta function. Notice that $2>2/d+1$ for $d\ge 3$.
To use both these variances, recall the skewness summation formulation: if we have random variables ${X}_{i}$ with variances ${\sigma}_{i}^{2}$ and skews ${\kappa}_{i}$, then their sum has a skewness of
So this means that
as $N\to \mathrm{\infty}$. Here, we have used the fact that ${T}_{a}$ has a finite skew (actually, it is easy to use section ‘Condition for normality’ to show that ${T}_{a}$ is distributed as a normal, and thus has zero skew.) Hence the asymptotic skew of $T$ is just the asymptotic skew of ${T}_{b}$.
We can calculate the skew of ${T}_{b}$ by reusing Equation (20), this time on the exponential variables defining ${T}_{b}$. Therefore
By Equation (19), the denominator limits to ${N}^{3}\zeta {(2\eta )}^{3/2}$. Meanwhile, the numerator looks like
The first term in the parentheses converges to $\zeta (3\eta )$, whereas the rest of the terms converge to 0. Combining the numerator and denominator gives us the conclusion that the skew for the full distribution is given by
Recall that $\eta =11/d$, so the denominator diverges for $d=2$ and both then numerator and denominator diverge for $d=1$. However, since this expression for $\text{Skew}(T)$ is monotone increasing in $\eta $ for $2/3\le \eta <1$, every dimension $d\ge 3$ will attain a unique skew, and therefore a unique limiting distribution. Moreover, if we take $d\to \mathrm{\infty}$, then $\eta \to 1$, and therefore the skew becomes $12\sqrt{6}\zeta (3)/{\pi}^{3}$, which is exactly the skew for a Gumbel distribution, supporting our assertion that highdimensional systems attain higher skews.
For the purposes of estimating the distributions at finite $N$, it is convenient to use the random variable
where
Since the second half of the dynamics contribute the majority of the variance, we should expect this to provide a reasonable approximation of $T$.
Asymptotic skew of conditioned random walk
When $r=1$ for Bd dynamics on the complete graph, the number of invaders becomes very flexible. In fact, the probability of adding an invader on any time step is exactly equal to the probability of removing an invader, so
Hence the probability that the next event increases the number of invaders is always 1/2. This means that, if we ignore the waiting times, the population of invaders obeys a simple random walk on the values $m=1,\mathrm{\dots},N1$ with 0 and $N$ acting as absorbing states. So we can understand the times required to take over the network by understanding these simple dynamics. To set up the analysis, let ${X}_{n}$ denote the number of invaders after $n$ population changes. Therefore
where ${x}_{i}\in \{1,+1\}$, each with probability 1/2. We will use the waitomitted time $n$ in this section as a firstorder approximation of the true takeover time. This way, the present analysis is generalizable to most networks. Moreover, scaling and numerical arguments based on the results here can show that the bulk of the final distribution is defined by this randomwalk process.
Although we always start at $m=1$, we only care about the invader takeover result because that is the only case for which disease symptoms would be manifested. Let us define the stopping time
which records the first time the random walk ${X}_{n}$ hits the value $m$. Given that the invader population cannot go negative or above $N$, the walk’s stopping time is
In the main text, we cared only about the conditioned random walk times and whether they tended to be right skewed. So we now study the first few conditional moments
for $i=1,2,3$.
It isn’t hard to set up a linear recurrence relation to find the probability of hitting 0 or $N$. In fact, if the state of the walk is at $m$, then the probability of hitting $N$ is exactly
This is a useful fact for simulation; instead of directly simulating ${X}_{n}$ and discarding all the cases that hit 0, we can directly simulate the conditioned random walk. If we define
and treat it as a Markov chain, then we can easily calculate the transition probabilities. By applying Bayes’s law,
While this speeds up simulations by a good deal in certain cases, it is not particularly useful for quantifying the distribution of the random walk times themselves.
To identify the moments of $T$, we will want to apply results from martingale theory in general, and optional stopping in particular. To start, define the random variable
We are going to want this to be a martingale. Let’s define ${\mathcal{F}}_{n}$ to be the sigma field consisting of all information from the first $n$ steps of the random walk. Therefore, $E({x}_{n+1}{\mathcal{F}}_{n})=0$, since we cannot predict the direction of the next step. However, we do know that $E({x}_{n+1}^{2}{\mathcal{F}}_{n})=1$, because the steps will always be size 1, regardless of our ignorance. Putting this together gives
So ${M}_{n}^{(1)}$ wellapproximates its future, meaning that it is a proper martingale.
Thank to this, we can cite an optional stopping theorem (Durrett, 1991). So we expect the expectation of this variable to be the same at the stopping time as at the start, so
Since we start at $n=0$ and ${X}_{0}=1$, the left hand side trivially gives
However, the right hand side gives something a bit more complicated, since we need to condition on the possible endpoints, remembering we start at ${X}_{0}=1$. So
Thus we have now calculated both sides of Equation (22), meaning that we now have a value for the first moment of time of the conditioned random walk, given by
The procedure to find the next two moments is not too substantially different: just repeat the same steps of verification and evaluation on ${M}_{n}^{(1)}$’s siblings
These two martingales reveal the next two moments, which are given by
This is all we need to compute the asymptotic skew of the conditioned fixation times. In the limit of large $N$, only the dominant terms of each ${\mu}_{i}$ will survive. The $N$’s cancel out in this limit, leading to a constant given by
The conclusion is that in the limit of large $N$, the skew of the distribution of fixation times for a conditioned random walk will always be positive. Therefore, we expect rightskewed distributions to be typical for the $r=1$ limit of Birthdeath dynamics.
References

The distribution of incubation periods of neoplastic diseasesAmerican Journal of Epidemiology 99:92–100.https://doi.org/10.1093/oxfordjournals.aje.a121599

Asymptotic distributions for the coupon collector's problemThe Annals of Mathematical Statistics 36:1835–1839.https://doi.org/10.1214/aoms/1177699813

Differentiation of types of poliomyelitis viruses III. The grouping of fourteen strains Into three basic immunological typesAmerican Journal of Hygiene 49:234–245.

On a classical problem of probability theoryPublication of The Mathematical Institute of the Hungarian Academy of Sciences 6:215–220.

BookAn Introduction to Probability Theory and Its Applications: Volume INew York: John Wiley & Sons.

BookDe Contagione et Contagiosis Morbis et Eorum Curatione, Libri IIIGP Putnam and Sons.

Exact stochastic simulation of coupled chemical reactionsThe Journal of Physical Chemistry 81:2340–2361.https://doi.org/10.1021/j100540a008

Vesical tumours induced by chemical compoundsOccupational and Environmental Medicine 6:65–81.https://doi.org/10.1136/oem.6.2.65

Criteria for the use of Sartwell's incubation period model to study chronic diseases with uncertain etiologyJournal of Clinical Epidemiology 45:1071–1080.https://doi.org/10.1016/08954356(92)90147F

The effect of selection in a haploid genetic populationMathematical Proceedings of the Cambridge Philosophical Society 54:463–467.https://doi.org/10.1017/S0305004100003017

Early efforts in modeling the incubation period of infectious diseases with an acute course of illnessEmerging Themes in Epidemiology 4:2.https://doi.org/10.1186/1742762242

BookEvolutionary Dynamics: Exploring the Equations of LifeHarvard University Press.

Takeover times for a simple model of network infectionPhysical Review E 96:012313.https://doi.org/10.1103/PhysRevE.96.012313

A lognormal approximation for the collector’s problemThe American Statistician 52:175–180.

ReportA Waiting Time Distribution Arising From the Coupon Collector’s ProblemStanford, United States: Department of Statistics, Stanford University.

The distribution of incubation periods of infectious diseaseAmerican Journal of Hygiene 51:310–318.

The incubation period and the dynamics of infectious diseaseAmerican Journal of Epidemiology 83:204–216.https://doi.org/10.1093/oxfordjournals.aje.a120576

Ninetythree persons infected by a typhoid carrier at a public dinnerJournal of the American Medical Association LXIII:1537–1542.https://doi.org/10.1001/jama.1914.02570180023005

Somatic clonal evolution: A selectioncentric perspectiveBiochimica et Biophysica Acta (BBA)  Reviews on Cancer 1867:139–150.https://doi.org/10.1016/j.bbcan.2017.01.006

Attack rate and incubation period of measles: significance of age and of conditions of exposureAmerican Journal of Diseases of Children 67:15–21.https://doi.org/10.1001/archpedi.1944.02020010022002

The basic birthdeath model for microbial infectionsJournal of the Royal Statistical Society. Series B 27:338–360.
Decision letter

Ben CooperReviewing Editor; Mahidol Oxford Tropical Medicine Research Unit, Thailand
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Evolutionary dynamics of incubation periods" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Naama Barkai as the Senior Editor. One of the reviewers, Martin A Nowak, has agreed to share his name.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Essential revisions:
1) All the reviewers found the calculations extremely interesting and considered the results to be novel and important. However, there was a shared concern that the connection of these results with the biological phenomenon of the incubation period was not on firm ground. In particular, the assumption that 100% – or close to 100% – of the host cells are infected when symptomatic infection starts was not wellmotivated, and the biological plausibility of this assumption was unclear.
2) We would like to encourage the authors to reconsider the interpretation of their findings. Ideally they should provide a broader a list of biological phenomena (not only incubation periods) that could be described by their model.
3) If they want to strengthen their interpretation of incubation periods, a mechanistic description how the model describes specific diseases is needed along with a discussion of possible caveats.
4) It would also be of interest to discuss the question of timevarying N (either growing N, or declining N as a result of the pathogen killing host cells). However, following consultation, it was felt that anything more than discussion of this question is beyond the scope of the current work.
Reviewer #1:
Strogatz et al. present a model for why the length of disease incubation periods tend to have a rightskewed probability distribution. The core idea is marvelously simple: even in the absence of complications such as host or pathogen heterogeneity, a simple stochastic model for withinhost disease spread predicts a rightskewed distribution for the time between inoculation and the pathogen reaching a threshold value (corresponding to the threshold of immune activation that marks the end of the incubation period). The main text is unusually clearly written for a pure theory paper.
My primary concerns surround the relatively limited connection back to the biology. The authors consider multiple different possibilities for withinhost network connectivity and provide some biological justification for the different topologies (e.g., structured tissue = 3d lattice, epithelium = 2d lattice, etc.); however:
1) The analysis of the Moran model focuses on the time to fixation of the invading allele (i.e. 100% infection of the modeled network); however, immune activation does not wait until 100% of the population is infected but rather activates at a much lower threshold (admittedly again in a stochastic fashion). One of the key findings of this work is that a rightskewed distribution arises due to the "coupon collector problem" where the last few uninfected nodes in the graph can take a long time to be infected, particularly for highdimensional topologies. However, if the incubation period were to end when, say, any 1% of cells were infected rather than 100%, then the coupon collector problem seems less (or perhaps not at all) relevant.
2) Infection / reproduction in a spatial situation will often happen in parallel rather than serially. The argument that highdimensional topology may lead to a slow down of the final infections appears predicated on the idea that the rate of infection is constant due to the serial nature of the Moran model – but the overall infection rate presumably increases as a function of the surface area of the infected volume (and analogously for other topologies). Does this really not matter for the distribution of time to fixation? This area is not my specialty, but the theory surrounding Fisher's wave of advance may be relevant.
3) Many pathogens cause infections in a bursting fashion (e.g., lytic viruses) for which it is not obvious if the Moran model assumption of one random birth / one random death at a time is relevant.
Reviewer #2:
OttinoLoeffler et al. propose a simple and elegant solution, based on invasion dynamics in structured population, to the interesting observation that incubation periods for a variety of diseases are rightskewed. I found the idea to be very elegant, the text wellwritten, and overall the arguments to be compelling and easy to follow. I especially liked the interpretation for the dispersal coefficients naturally ranging between certain limits. However, I have a few concerns/questions that I would like to see addressed:
1) In order to apply the ideas of evolutionary graph theory the authors assume that the populations are finite (all graphs have only N nodes); however, this is not always true (certainly not for all the diseases that the authors mention in the abstract and introduction) and I would have liked to have seen from the authors at the very least an acknowledgement of this strong assumption and a discussion of how relaxing this assumption might affect the conclusions. It would be even better (and really interesting) if the authors could pick one example of dynamicallygrowing network (this should be doable at least for the complete graph) and see how that affects their predictions.
2) The Deathbirth (DB) dynamic that the authors employ is actually different from the DB dynamic proposed by Ohtsuki et al. 2006 (unlike the BD dynamic which is the same). Ohtsuki et al. consider death to be random (all nodes have probability 1/N) and birth competition to be among the neighbors, proportional to their fitness. Of course, it's no problem proposing a new variant; I'm just curious whether the authors had a biological reason for choosing this variant of the DB update rule. Especially since they find in their Materials and methods (section on truncation) that the update rule actually matters a lot, a fact that has been observed in evolutionary graph theory more broadly. On that note, I thought this result was sufficiently important that it deserved at least a couple of sentences in the Discussion (rather than just being mentioned in the Materials and methods); I had the same reaction to all the results in that section ("Testing robustness to update rule, fitness, and truncation"), which I thought deserved some mention in the main Discussion.
3) My final point is more a question than a concern: the authors apply this method to inhost dynamics and incubation periods; however, it seems like it could apply to epidemiological questions as well (e.g. spread of flu in a population). Have the authors considered the parallels? Are there any data analogous to incubation periods that could be employed to show the applicability of this model to epidemiological questions as well?
Reviewer #3:
In Evolutionary Dynamics of Incubation Periods, OttinoLoffler et al. investigate the distribution of times to fixation and to near fixation of mutants on an evolutionary graph as a model for incubation periods. Using this conceptual framework, they endeavor to explain features of the observed distribution of incubation times, including approximate lognormality with dispersion in the range of 1.11.5, without invoking heterogeneity in invader fitness, disease burden at which symptoms manifest, or initial dosage.
OttinoLoffler et al. provide analytical calculations for the distribution of fixation times on complete, star, and cycle graphs in the infinitefitness mutant limit, as well as numerical simulation results on a variety of other graphs. The distribution of times to fixation for a mutant in a structured population is an extremely important and interesting theoretical question. The authors provide novel and fascinating mathematical results.
I am very much in favor of publication of these findings. I do have some questions or concerns regarding the biological interpretation of the paper.
What is the relationship between the model, specifically time to fixation in a Moran process on a graph, and the biological phenomenon of an incubation period. Clarification about the biological motivation for this theoretical framework and for the different update rules could strengthen the paper. In particular, which of the several diseases mentioned in the paper does the model apply to, and why? Also, clarification regarding the choice of fixation time to quantify the length of the incubation period would strengthen the paper. Lowering the threshold to below 100 percent appears to remove the skew in some instances.
Second, might the distribution of clock times differ from the distribution of step times in the process.
Third, certain model choices produce symmetric, rather than skewed, distributions of times to a threshold. I am curious if a more complete investigation of this observation could shed light on the appearance of skewness in different evolutionary scenarios.
A final small, technical clarification would help to understand the derivation of skewness when r = 1 at the end of the paper. How is the number of steps which do not change the number of mutants being counted? Further, at the end of the fourth paragraph of the subsection “Asymptotic skew of conditioned random walk”, should An(1) be Mn(1)?
Again I wish to emphasize the great interest and novelty of this work. While some issues associated with the interpretation of the paper remain unclear, I think that the authors will be able to address them by pointing to biological scenarios, perhaps even outside of infection, in which their calculations provide valuable insight.
https://doi.org/10.7554/eLife.30212.019Author response
Essential revisions:
1) All the reviewers found the calculations extremely interesting and considered the results to be novel and important. However, there was a shared concern that the connection of these results with the biological phenomenon of the incubation period was not on firm ground. In particular, the assumption that 100% – or close to 100% – of the host cells are infected when symptomatic infection starts was not wellmotivated, and the biological plausibility of this assumption was unclear.
We agree, and have addressed this point in a new paragraph added to Results, called “Testing robustness to update rule and truncation.” The relevant sentences are: “For example, suppose we allow symptoms to occur when invaders take over only a fraction f of the whole network. This is a reasonable consideration as leukemic cells need not take over all the bone marrow before leukemia becomes evident, nor does typhoid need to overwhelm all the cells in the microbiome before causing fever; indeed it is likely far fewer in both cases.” Figure 6 (D, E, F) show that the rightskewed distribution persists for Deathbirth dynamics, even if only 10% takeover triggers the appearance of symptoms. (Actually, the rightskewed distribution can be proven to persist for any fixed, nonzero fraction f. We cite the relevant math literature in the text.)
On the other hand, Birthdeath dynamics are sensitive to the choice of f. They produce a distribution that switches from a Gumbel when f = 100% to a normal distribution for any f less than 100%.
We explain why the two cases differ in the caption to Figure 6. “The difference in sensitivity between the two types of dynamics can be explained intuitively by when coupon collection occurs. […] In contrast, coupon collection occurs near the end of the invasion for Birthdeath dynamics (when residents are scarce), and hence gets truncated when f < 1, giving rise to a normal instead of a rightskewed distribution.”
2) We would like to encourage the authors to reconsider the interpretation of their findings. Ideally they should provide a broader a list of biological phenomena (not only incubation periods) that could be described by their model.
We appreciate this helpful suggestion. In response, we have added the following paragraph at the end of the Discussion: “Aside from their possible application to incubation processes, our results also shed light on a broader theoretical question in evolutionary dynamics: when a mutant invades a structured population of residents, how does the distribution of mutant fixation times depend on the network structure of the population? […] We hope that our exact results for disparate topologies and dynamics will stimulate further investigations of these important questions in evolutionary biology.”
3) If they want to strengthen their interpretation of incubation periods, a mechanistic description how the model describes specific diseases is needed along with a discussion of possible caveats.
We have tried to connect the model more closely to specific diseases. In the section of the Results called “Mathematical model,” the relevant passage reads: “A network of N ≫ 1 nodes is used to represent an environment within a host where a pathogenic agent, such as a bacterium or cancer cell, is invading and reproducing. The network could represent several plausible biological scenarios, for example the intestinal microbiome, where harmful typhoid bacteria are competing against a benign resident population of gut flora in a mixing system (modeled as a complete graph); or it could represent mutated leukemic stemcells vying for space against healthy hematopoietic stem cells within the wellorganized threedimensional bone marrow space (modeled as a 3D lattice); or a flat epithelial sheet with an early squamous cancer compromising and invading nearby healthy cells (modeled as a 2D lattice).”
Regarding possible caveats, we have clarified (in Results, “Mathematical model”) that the model probably does not apply to viruses:
“While Sartwell’s law has been applied to diseases as varied as measles and leukemia, the model we propose makes the most sense for asexually reproducing invaders, like cancer cells or bacteria. […] So, while the general phenomenon of network invasion seems to apply to viruses as well, this model is not well suited to describe their dynamics.”
We also draw a closer connection to cancer dynamics in the Discussion: “On the other hand, it is tempting to speculate that this regime of nearly neutral fitness may be more relevant to cancer development. […] This is also consistent with the suggestion of (Williams and Bjerknes, 1972); the shape of tumors in the model most closely resembled that of real tumors when the fitness of the invaders was only slightly above to neutral.”
4) It would also be of interest to discuss the question of timevarying N (either growing N, or declining N as a result of the pathogen killing host cells). However, following consultation, it was felt that anything more than discussion of this question is beyond the scope of the current work.
We are also interested in the effect of timevarying N, but also agree that a full investigation is outside of the scope of this paper. To give a preliminary sense of what might occur, we have added a new section in the Materials and methods, “Influence of nonstatic population,” as well as an additional figure (Figure 9). These show the results of an initial exploration of the effects of a nonstatic population on the complete graph. We specifically examine cases where the resident population has a chance at growing at every time step, shrinking at every time step, and randomly growing or shrinking at each time step.
Reviewer #1:
[…] My primary concerns surround the relatively limited connection back to the biology. The authors consider multiple different possibilities for withinhost network connectivity and provide some biological justification for the different topologies (e.g., structured tissue = 3d lattice, epithelium = 2d lattice, etc.); however:
1) The analysis of the Moran model focuses on the time to fixation of the invading allele (i.e. 100% infection of the modeled network); however, immune activation does not wait until 100% of the population is infected but rather activates at a much lower threshold (admittedly again in a stochastic fashion). One of the key findings of this work is that a rightskewed distribution arises due to the "coupon collector problem" where the last few uninfected nodes in the graph can take a long time to be infected, particularly for highdimensional topologies. However, if the incubation period were to end when, say, any 1% of cells were infected rather than 100%, then the coupon collector problem seems less (or perhaps not at all) relevant.
Please see our response to Essential revisions comment 1.
2) Infection / reproduction in a spatial situation will often happen in parallel rather than serially. The argument that highdimensional topology may lead to a slow down of the final infections appears predicated on the idea that the rate of infection is constant due to the serial nature of the Moran model – but the overall infection rate presumably increases as a function of the surface area of the infected volume (and analogously for other topologies). Does this really not matter for the distribution of time to fixation? This area is not my specialty, but the theory surrounding Fisher's wave of advance may be relevant.
In the models presented, the overall infection rate does indeed grow with the surface area of the infected population. This has been clarified in Materials and methods, “Birthdeath, other solvable networks,” with the following sentences: “For a twodimensional square lattice, it is more difficult to produce analytical results that are both rigorous and exact. But by making an approximation based on the geometry of the lattice, and using the fact that the population growth rate is proportional to its surface area (see the Appendix, "Normally distributed fixation times for 2D lattice"), we can make a nonrigorous analytical guess about the distribution of the fixation times T.”
This is also elaborated upon in the Appendix, "Normally distributed fixation times for 2D lattice," with: “Assuming this to be true, recall the basic dynamics of infiniter Bd with N nodes and m invaders. First, we uniformly select one node out of the population of invaders, which is always a probability of 1/m per node. Then we replace one of the invader's neighbors, uniformly at random. However, only invaders on the surface of the cluster even have a chance at replacing a healthy node!
Given sufficient regularity of the boundary of the cluster, this means that the probability of an invader replacing a healthy node is proportional to
qm=1m⋅
3) Many pathogens cause infections in a bursting fashion (e.g., lytic viruses) for which it is not obvious if the Moran model assumption of one random birth / one random death at a time is relevant.
We agree, and we have now clarified (in Results, “Mathematical model”) that the model probably does not apply to viruses: “While Sartwell's law has been applied to diseases as varied as measles and leukemia, the model we propose makes the most sense for asexually reproducing invaders, like cancer cells or bacteria. […] So, while the general phenomenon of network invasion seems to apply to viruses as well, this model is not well suited to describe their dynamics.”
Reviewer #2:
OttinoLoeffler et al. propose a simple and elegant solution, based on invasion dynamics in structured population, to the interesting observation that incubation periods for a variety of diseases are rightskewed. I found the idea to be very elegant, the text wellwritten, and overall the arguments to be compelling and easy to follow. I especially liked the interpretation for the dispersal coefficients naturally ranging between certain limits. However, I have a few concerns/questions that I would like to see addressed:
1) In order to apply the ideas of evolutionary graph theory the authors assume that the populations are finite (all graphs have only N nodes); however, this is not always true (certainly not for all the diseases that the authors mention in the abstract and introduction) and I would have liked to have seen from the authors at the very least an acknowledgement of this strong assumption and a discussion of how relaxing this assumption might affect the conclusions. It would be even better (and really interesting) if the authors could pick one example of dynamicallygrowing network (this should be doable at least for the complete graph) and see how that affects their predictions.
Please see our response to Essential revisions comment 4.
2) The Deathbirth (DB) dynamic that the authors employ is actually different from the DB dynamic proposed by Ohtsuki et al. 2006 (unlike the BD dynamic which is the same). Ohtsuki et al. consider death to be random (all nodes have probability 1/N) and birth competition to be among the neighbors, proportional to their fitness. Of course, it's no problem proposing a new variant; I'm just curious whether the authors had a biological reason for choosing this variant of the DB update rule. Especially since they find in their Materials and methods (section on truncation) that the update rule actually matters a lot, a fact that has been observed in evolutionary graph theory more broadly. On that note, I thought this result was sufficiently important that it deserved at least a couple of sentences in the Discussion (rather than just being mentioned in the Materials and methods); I had the same reaction to all the results in that section ("Testing robustness to update rule, fitness, and truncation"), which I thought deserved some mention in the main Discussion.
To clarify this, we have added a feature box (Box 2, “Nomenclature for the Moran Model”), which includes an explanation and table describing and naming the most common Moran models. In particular, we would call the model described by the reviewer “dB,” as opposed to the Db model we used. The relevant sentences are: “To avoid confusion, we use standard abbreviations to distinguish the different models, as illustrated by the table below. […] For example, dB refers to the update rule where the first step uniformly selects a node from the entire population to die, and then one of its neighbors is selected, with probability proportional to fitness, to replace it.”
3) My final point is more a question than a concern: the authors apply this method to inhost dynamics and incubation periods; however, it seems like it could apply to epidemiological questions as well (e.g. spread of flu in a population). Have the authors considered the parallels? Are there any data analogous to incubation periods that could be employed to show the applicability of this model to epidemiological questions as well?
We find this possibility interesting, and have previously considered the parallels. However, a full investigation into the particulars of epidemic dynamics would be beyond the scope of the current paper. Nonetheless a connection can likely be made via evolutionary graph theory, which we mention in a new paragraph added to the Discussion: “Aside from their possible application to incubation processes, our results also shed light on a broader theoretical question in evolutionary dynamics: when a mutant invades a structured population of residents, how does the distribution of mutant fixation times depend on the network structure of the population? […] We hope that our exact results for disparate topologies and dynamics will stimulate further investigations of these important questions in evolutionary biology.”
Reviewer #3:
[...] I am very much in favor of publication of these findings. I do have some questions or concerns regarding the biological interpretation of the paper.
What is the relationship between the model, specifically time to fixation in a Moran process on a graph, and the biological phenomenon of an incubation period. Clarification about the biological motivation for this theoretical framework and for the different update rules could strengthen the paper. In particular, which of the several diseases mentioned in the paper does the model apply to, and why?
Please see our response to Essential revisions comment 3.
Also, clarification regarding the choice of fixation time to quantify the length of the incubation period would strengthen the paper. Lowering the threshold to below 100 percent appears to remove the skew in some instances.
Please see our response to Essential revisions comment 1.
Second, might the distribution of clock times differ from the distribution of step times in the process.
To clarify this important point, we have added a new paragraph to the Results section, “Mathematical model.” The relevant paragraph reads: “Our notion of time in this model is linked directly to the biology of invasion of a reproducing asexual pathogen that divides and replaces other cells sequentially. […] Future iterations of this model could consider deriving an exact scaling between physical time and this biological eventbased updating of time.”
Third, certain model choices produce symmetric, rather than skewed, distributions of times to a threshold. I am curious if a more complete investigation of this observation could shed light on the appearance of skewness in different evolutionary scenarios.
This is a fascinating question. A universal condition to ensure the appearance of symmetric distributions is beyond the scope of this paper, but the current conditions appear to involve infinite invader fitness combined with either a lowdimensional network or a truncation condition that avoids the Coupon Collector’s effect.
A final small, technical clarification would help to understand the derivation of skewness when r = 1 at the end of the paper. How is the number of steps which do not change the number of mutants being counted?
For the sake of generality, the specific calculation mentioned here calculates the skew that arises from the random walk alone, not including the waiting times. In the Appendix section “Asymptotic skew of conditioned random walk,” the following sentences have been added to clarify and justify this choice: “We will use the waitomitted time n in this section as a firstorder approximation of the true takeover time. […] Moreover, scaling and numerical arguments based on the results here can show that the bulk of the final distribution is defined by this randomwalk process.”
Further, at the end of the fourth paragraph of the subsection “Asymptotic skew of conditioned random walk”, should An(1) be Mn(1)?
Thank you, this has been corrected.
Again I wish to emphasize the great interest and novelty of this work. While some issues associated with the interpretation of the paper remain unclear, I think that the authors will be able to address them by pointing to biological scenarios, perhaps even outside of infection, in which their calculations provide valuable insight.
We have added a paragraph about the model’s wider relevance to evolutionary dynamics: “Aside from their possible application to incubation processes, our results also shed light on a broader theoretical question in evolutionary dynamics: when a mutant invades a structured population of residents, how does the distribution of mutant fixation times depend on the network structure of the population? […] We hope that our exact results for disparate topologies and dynamics will stimulate further investigations of these important questions in evolutionary biology.”
https://doi.org/10.7554/eLife.30212.020Article and author information
Author details
Funding
National Science Foundation (DMS1513179)
 Steven H Strogatz
National Institutes of Health (Loan Repayment Grant)
 Jacob G Scott
National Science Foundation (Graduate Student Fellowship DGE1650441)
 Bertrand OttinoLoffler
National Science Foundation (CCF1522054)
 Steven H Strogatz
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank David Aldous, Rick Durrett, Remco van der Hofstad, Lionel Levine, Piet Van Mieghem, and Steve Schiff for comments.
Reviewing Editor
 Ben Cooper, Mahidol Oxford Tropical Medicine Research Unit, Thailand
Version history
 Received: July 6, 2017
 Accepted: November 29, 2017
 Version of Record published: December 21, 2017 (version 1)
Copyright
© 2017, OttinoLoffler et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 8,158
 Page views

 711
 Downloads

 18
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Neuroscience
Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinatelocalized twophoton imaging data from cerebellar Crus II in mice undergoing 'Go/Nogo' reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive errorbased learning, reward processing, and inhibition of erroneous behaviors after a Nogo cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of AldolaseC/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire newgeneration AI designs.

 Computational and Systems Biology
 Genetics and Genomics
Tissue fibrosis affects multiple organs and involves a masterregulatory role of macrophages which respond to an initial inflammatory insult common in all forms of fibrosis. The recently unravelled multiorgan heterogeneity of macrophages in healthy and fibrotic human disease suggests that macrophages expressing osteopontin (SPP1), associate with lung and liver fibrosis. However, the conservation of this SPP1^{+} macrophage population across different tissues, and its specificity to fibrotic diseases with different etiologies remain unclear. Integrating 15 single cell RNAsequencing datasets to profile 235,930 tissue macrophages from healthy and fibrotic heart, lung, liver, kidney, skin and endometrium, we extended the association of SPP1^{+} macrophages with fibrosis to all these tissues. We also identified a subpopulation expressing matrisomeassociated genes (e.g., matrix metalloproteinases and their tissue inhibitors), functionally enriched for ECM remodelling and cell metabolism, representative of a matrisomeassociated macrophage (MAM) polarization state within SPP1^{+} macrophages. Importantly, the MAM polarization state follows a differentiation trajectory from SPP1^{+} macrophages and is associated with a core set of regulon activity. SPP1^{+} macrophages without the MAM polarization state (SPP1^{+}MAM^{}) show a positive association with ageing lung in mice and humans. These results suggest an advanced and conserved polarization state of SPP1^{+} macrophages in fibrotic tissues resulting from prolonged inflammatory cues within each tissue microenvironment.