Improved inference of population histories by integrating genomic and epigenomic data

Thibaut Sellinger; Frank Johannes; Aurélien Tellier

doi:10.7554/eLife.89470.3

eLife assessment

This important study extends existing sequentially Markovian coalescent approaches to include the combined use of SNPs and hypervariable loci such as epimutations. This is an intriguing addition to infer population size history in the recent past, and the authors provide solid validation of their methods via simulation and analysis of empirical data in Arabidopsis thaliana. Given the increasing availability of such data, this work is a timely contribution and represents a foundation for further developments to explore when and where these methods will be best used.

https://doi.org/10.7554/eLife.89470.3.sa3

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

With the availability of high quality full genome polymorphism (SNPs) data, it becomes feasible to study the past demographic and selective history of populations in exquisite detail. However, such inferences still suffer from a lack of statistical resolution for recent, e.g. bottlenecks, events, and/or for populations with small nucleotide diversity. Additional heritable (epi)genetic markers, such as indels, transposable elements, microsatellites or cytosine methylation, may provide further, yet untapped, information on the recent past population history. We extend the Sequential Markovian Coalescent (SMC) framework to jointly use SNPs and other hyper-mutable markers. We are able to 1) improve the accuracy of demographic inference in recent times, 2) uncover past demographic events hidden to SNP-based inference methods, and 3) infer the hyper-mutable marker mutation rates under a finite site model. As a proof of principle, we focus on demographic inference in A. thaliana using DNA methylation diversity data from 10 European natural accessions. We demonstrate that segregating Single Methylated Polymorphisms (SMPs) satisfy the modelling assumptions of the SMC framework, while Differentially Methylated Regions (DMRs) are not suitable as their length exceeds that of the genomic distance between two recombination events. Combining SNPs and SMPs while accounting for site- and region-level epimutation processes, we provide new estimates of the glacial age bottleneck and post glacial population expansion of the European A. thaliana population. Our SMC framework readily accounts for a wide range of heritable genomic markers, thus paving the way for next generation inference of evolutionary history by combining information from several genetic and epigenetic markers.

Introduction

A central goal in population genetics is to reconstruct the evolutionary history of populations from patterns of genetic variation observed in the present. Relevant aspects of these histories include past demographic changes as well as signatures of selection. Inference methods based on Deep Learning (DL, [38]), Approximate Bayesian Computation (ABC, [9]) or Sequential Markovian Coalescent (SMC, [40, 58]) aim to infer this information directly from full genome sequencing data, which is becoming rapidly available for many (non-model) species due to decreasing costs. The SMC, in particular, offers an elegant theoretical framework that builds on the classical Wright-Fisher and the backward-in-time Kingman coalescent stochastic models (e.g. [36, 13, 75]). Both models conceptualize Mendelian inheritance as generating the genealogy of a population (or a sample), that is, the unique history of a fragment of DNA passing from parents to offspring. When this genealogy includes the effect of recombination, it is called the Ancestral Recombination Graph (ARG, [27, 79]).

Under the Kingmann coalescent model, the true genealogy of a population (or sample) is defined by its topology and branch length, and contains the information on past demographic changes and life history traits [50, 63, 68, 70] as well as selective events [13, 75]. The genealogical and the mutational processes of any heritable marker can therefore be disentangled, and the frequency of any given marker state is given by the shape of the genealogy in time (see Figure 1A). A central assumption about heritable genomic markers is that they are generated by two homogeneous Poisson mutation processes along the genome as well as through time. This entails that mutations in different genealogies are independent due to the effect of recombination [79, 47], and that there are no time periods with a large excess, or a severe lack, of mutations along a genealogy (mutations are independently distributed in time within a DNA fragment). In other words, the frequency of polymorphisms at DNA markers observed across a sample of sequences are constrained by, as well as inform on, the underlying genealogy at this locus (Figure 1A). To clarify these assumptions, we present a schematic representation of a marker 1 (yellow in Figure 1) which fulfills both homogeneous Poisson processes in time and along the genome. We also present cases applicable to a second genomic marker 2 that violates the model assumptions, namely by not being heritable (Figure 1B) or not following a non-homogeneous Poisson process in the genome (Figure 1C) or in time (Figure 1D).

Schematic distribution of two markers along the genealogy and four genomes.
A) Schematic distribution of marker 1 (yellow star) and marker 2 (green star) along the genealogies in a sample of four genomes both following a homogeneous Poisson process. B) The green marker 2 is not heritable, so that its distribution is independent from the genealogy. C) The green marker 2 is spatially structured along the genome, violating the distribution of the Poisson process along the genome and conflicting with the genealogy. D) The green marker 2 does not follows Poisson process through time, *e.g.* burst of mutations at a specific time point represented by given branches of the genealogies in green. The yellow marker 1 has an identical Poisson process along the genome and the genealogy in all four panels, and for readability, marker 2 exhibits light and dark green states.

Despite the power of the SMC, well-known model violations such as variation in recombination and mutation rates along the genome [5, 4] or pervasive selection [61, 31, 30] can compromise the accuracy of demographic and selective inference [24, 64]. There are two other important issues that have received less attention in the literature. The first issue occurs when the population recombination rate (ρ) is higher than the population mutation rate (θ). In such cases, inferences can be biased if not erroneous [71, 64, 63], because several recombination events cannot be inferred due to the lack of Single Nucleotide Polymorphisms (SNPs for point mutations). This problem affects many species, though interestingly not humans which have a ratio ρ/θ ≈ 1. A second issue occurs when the mutational process along the genealogy is too slow be informative about sudden and strong variation in population size (i.e. population bottlenecks), such as during colonization events of novel habitats. The typical low mutation rate of 10⁻⁹ up to 10⁻⁸ (per base, per generation) found in most species therefore places strong limitations on SMC analysis of recent bottleneck events (up to ca. 10⁻⁴ generations ago) when inference is based solely on SNP data. Indeed, bottlenecks are often either not found, or when inferred, their timing and magnitude are not well estimated (inferred smoother than in reality, [31, 64]), even when a large number of samples is used. A typical example is the large uncertainty of the timing and magnitude of the population size bottleneck during the Last Glacial Maximum (LGM) and post-LGM expansion in A. thaliana European populations based on several studies using different accessions and SMC inference methods [2, 19].

Nonetheless, current SMC, DL or ABC inference methods making use of full genome sequence data rely almost exclusively on SNPs for inference [58, 71, 63, 9, 37]. There are both practical and theoretical reasons for using SNPs: They are easily detectable from short-read re-sequencing data and their mutational process is well approximated by the infinite site model [13, 75], simplifying the inference of the underlying genealogy. However, other heritable genomic markers exists whose mutation rates can be several orders of magnitude higher than that of SNPs, and could thus be more informative about recent demographic events. These include microsatellites, insertions, deletions and transposable elements (TEs). Although those heritable markers are not necessarily neutral (such as TEs which are likely to be under weak purifying selection) they contain information on the evolutionary history of the population. Current technological limitations still impede the easy detection and estimation of allele frequencies for many of these markers [81, 53, 76]. For example, identifying insertion/excision variation of transposable elements or copy number variation of microsatellites requires a high quality reference genome and ideally long-read sequencing approaches [53]. In addition to these genomic markers, DNA cytosine methylation is emerging as a potentially useful epigenetic marker for phylogenetic inference in plants [83, 84]. Stochastic gains and losses of DNA methylation at CG dinucleotides, in particular, arise at a rate of ca. 10⁻⁴ up to 10⁻⁵ per site per generation (that is 4 to 5 orders of magnitude faster than DNA point mutations, [73]), and can be inherited across generations [54, 78]. These so-called spontaneous epimutations are likely neutral at the genome-wide scale ([74, 29], but see [49, 54]), and can be easily detected from bisulphite converted short read sequencing data [41, 60]. Recent work suggests that CG methylation data can be used as a molecular clock for timing divergence between pairs of lineages over timescales ranging from years to decades [84].

However, theoretical integration of the above-mentioned (epi)genomic markers into a population genomics and SMC inference framework is not trivial. Because of the high mutation rate, the mutational process at these (hyper-mutable) markers is reversible and more consistent with a finite site, rather than infinite site, model, which can result in extensive homoplasy (as known for microsatellite markers, [20]). Indeed, classic expectations of population genetics diversity statistics, mostly build for SNPs, need to be revised for these hyper-mutable markers [14, 77]. Here we develop the theoretical and methodological inference framework named SMCtheo for the inclusion of additional (potentially hyper-mutable) markers into the SMC. We showcase our model using extensive simulations as well as application to published DNA cytosine methylation data (in genic regions) from local populations of A. thaliana [60, 74]. We demonstrate that integration of hyper-mutable genomic markers into SMC models significantly improves the inference accuracy of past variation of population size, or can even uncover demographic events not uncovered using SNPs alone. Our proof-of-principle approach opens up novel avenues for studying population genetic processes over time-scales that have been largely inaccessible using traditional SNP-based approaches. This may prove particularly useful when exploring recent demographic changes of endangered species as a way to assess their potential for extinction in the context of biodiversity loss and global change.

Results

Theoretical results with two markers underlying the SMC computations

We study polymorphic sites across genomes of several sampled individuals which exhibit several possible markers (DNA nucleotides, methylation, TEs, indels, microsatellites,…). We define any marker by 1) its maximum number of possible states (nb_s), for example nucleotide sites have four states (A, T, C and G) while a methylation site has two states (methylated or unmethylated), and 2) its mutation rate µ, i.e. the rate at which the state of a marker changes into another state per position and per generation [3] (for simplicity we assume an equal mutation rates between all bases, known as the Jukes-Cantor model). More specifically, we are interested in two rates: the DNA mutation rate for changes in DNA nucleotides, and epimutation rate for change in methylation state. Furthermore, we assume that at each position on the genome only one type of marker can occur and be observed. We obtain as a first theoretical result the probability for a given site in the genome to be identical (P (id)) or segregating (P (seg)) (i.e. polymorphic) in a sample of size two (n = 2, two sampled chromosomes are compared):

This probability is a function of the time to the most recent common ancestor (TMRCA in text and t_M in equation 1, details in Supplementary Text). The probability for a mutation to occur for a given marker increases with an increased TMRCA [13, 75], but under high mutation rates (and high effective population size) the marker may not be polymorphic in the sample as mutations may be reversed (so-called homoplasy, [20, 14]). In Supplementary Figure S1 we illustrate these properties by computing the probability 1 for different mutation rates. The inference of recent demographic events and bottlenecks relies on the presence of polymorphic sites to detect recent coalescent events (TMRCA), and should be improved by using markers with high (or fast) mutation rate (e.g. hyper mutable).

In the following, we simulate data under different demographic scenarios using the sequence simulator program msprime [6, 33], which generates the ARG of n sampled diploid individuals (set to n = 5 throughout this study, leading to 10 haploid genomes). This ARG contains the genealogy of a given sample at each position of the simulated chromosomes. We then process the ARG to create DNA sequences according to the model parameters and the type of marker considered. We first assume a set of genomic markers obtained for a sample size n, and mutating according an homogeneous Poisson process along the genome and in time (along the genealogy) as in Figure 1A. To simulate the sequence data, we define the number of marker types (any number between 1 and the sequence length) and the proportion of sites of each marker type in the sequence. Each marker is characterized by both parameters nb_s and µ. For simplicity, we simulate sequences with two markers, but note that the method can be easily extent to additional markers. Marker 1 represents 98% of the sequence, and has a per site mutation rate µ₁ = 10⁻⁸ mimicking nucleotide SNP markers under an infinite site model (thus considered as bi-allelic at a given DNA site, [82]). By contrast, marker 2 composes the complementary 2% of the sequence length, with a per site mutation rate of µ₂ = 10⁻⁴ per generation between two possible states. Marker 2 is thus hyper-mutable compared to marker 1 and mimics methylation/epimutation sites. Note, that mutation events in Marker 1 and 2 are simulated under a finite site model.

We use different SMC-based methods throughout this study. These methods include: 1) MSMC2 used as a reference method [45], 2) SMCtheo is an extension of the PSMC’ [40, 58] accounting for any number of heritable theoretical markers, and 3) eSMC2 which is equivalent to SMCtheo but accounting only for SNPs markers [64] (to avoid any bias in implementation differences between SMCtheo and MSMC2). All methods are Hidden Markov Models (HMM) derived from the Pairwise Sequentially Markovian Coalescent (PSMC’) [58] and assume neutral evolution and a panmictic population. The hidden states of these methods are the coalescence time of a sample of size two at a position on the sequence. From the distribution of the hidden states along the genome, all methods can infer population size variation through time as well as the recombination rate [58, 45, 64].

The inclusions of hyper-mutable genomic markers improves demographic inference

We assume that the mutation rate of marker 1 is µ₁ = 10⁻⁸ per generation per bp. We use this information to estimate the mutation rate of marker 2, which we vary from µ₂ = 10⁻⁸ to µ₂ = 10⁻² per generation per bp. The estimation results based on simulated data under a constant population size of N = 10, 000 are displayed in Table 1. We find that our approach is capable of inferring µ₂ with high accuracy for rates up to µ₂ = 10⁻⁴. However, when the mutation rate µ₂ is 10⁻², our approach underestimates it by a factor three, suggesting the existence of an accuracy limit. To demonstrate that information can be gained by integrating marker 2 (with µ₂ = 10⁻⁴), we compared the ability of several inference methods to recover a recent bottleneck (Figure 2A). All methods correctly infer the amplitude of population size variation. When accounting only for marker 1 (with µ₁ = 10⁻⁸, MSMC2 and eSMC2 fail to infer accurately the sudden variation of population size. However, with the inclusion of hyper-mutable marker 2, our SMCtheo approach correctly infers the rapid change of population size of the bottleneck (Figure 2A, green). It is encouraging that an accurate estimation of the demography is obtained, even when the mutation rate of marker 2 is unknown (Figure 2A, blue).

Performance of SMC approaches using different markers.
Estimated demographic history of a bottleneck (black line) by SMC approaches using two genomic markers. In orange and red, are the estimates by MSMC2 and eSMC2 based on only marker 1. Estimates from SMCtheo integrating both markers are in green (with known µ₂), and in blue with unknown µ₂. The demographic scenarios are A) 10-fold recent bottleneck with an ancestral population size N = 10, 000, B) 10-fold recent bottleneck with an ancestral population size N = 1, 000, C) 10-fold bottleneck with an ancestral population size N = 10, 000, and D) a very severe (1,000 fold) and very recent bottleneck with incomplete size recovery. In A, B and D, we assume *r/µ*₁ = 1 (with r = µ₁ = 10⁻⁸, µ₂ = 10⁻⁴ per generation per bp) and in C, *r/µ*₁ = 10 (with r = 10⁻⁷, µ₁ = 10⁻⁸, and µ₂ = 10⁻⁴ per generation per bp). In all cases (A, B, C and D) 10 sequences (5 diploid indivudals) of 100 Mb were used as input.

Average estimated values of the mutation rate of marker 2 (µ₂), knowing that of marker 1. We use 10 sequences (5 diploid individuals) of 100 Mb (r = µ₁ = 10⁻⁸ per generation per bp) under a constant population size fixed at N = 10, 000. The coefficient of variation over 10 repetitions is indicated in parentheses.

Furthermore, some species or populations might feature small effective population sizes (ca. N = 1, 000), potentially resulting in reduced genomic diversity. In such cases the inclusion of hyper-mutable markers should also improve demographic inference. We present the results of such a scenario in Figure 2B, where the population size was divided by a factor 10 compared to the previous scenario in Figure 2A. We find that in the absence of the hyper-mutable marker 2, no approach can correctly infer the variation of population size. From the shape of the inferred demography, methods using only marker 1 do not suggest the existence of a bottle-neck followed by recovery (the "U-shaped" demographic scenario is not apparent with the orange and red lines, Figure 2 B). Yet, when integrating both markers, the population size can be recovered, even if the mutation rate of marker 2 is not a priori known. In both Figure 2A and B, we assume that the marker 2 occurs at a frequency of 2% in the genome. This percentage may be unrealistically high depending on the marker and the species. To test the impact of reducing marker 2 frequency, we repeat the simulations shown in Figure 2A, but set its frequency to as low as 0.1% (a 20-fold reduction). We find that the inclusion of the hyper-mutable marker 2 continues to improve inference accuracy in very recent times, albeit less pronounced than in Figure 2A (see Supplementary Figure 2). This suggests that a very small proportion of hyper-mutable genomic sites is sufficient to significantly improve the accuracy of inferences.

All full genome inference methods, especially SMC approaches, display lower accuracy when the population recombination rate (ρ = 4Nr) is larger than the population mutation rate of marker 1 (θ₁ = 4Nµ₁). We simulate sequence data under a bottleneck scenario slightly more ancient than in Figure 2 A and assume that ρ/θ₁ = r/µ₁ = 10 and ρ/θ₂ = r/µ₂ = 10⁻³. Our results show that by integrating the genomic marker 2 which mutation rate is larger than the recombination rate, estimates of the recombination rate as well as past population size variation are substantially improved (Table 2, Figure 2C). Indeed, analyzing only marker 1, eSMC2 and MSMC2 identify the bottleneck (albeit smoothed) and only slightly overestimate recent population size (Figure 2D). By integrating the hyper-mutable marker 2, our SMCtheo approach correctly infers the strength and time of the bottleneck when µ₁ and µ₂ are known (Figure 2D, green line), while the timing of the bottleneck is slightly shifted in the past when µ₂ is unknown and estimated by our method (Figure 2D, blue line). When µ₂ is unknown, SMCtheo additionally infers a spurious sudden variation of population size between 10,000 and 100,000 generations ago. Using only marker 1, the estimates of the recombination rate are inaccurate (Table 2). To complete the visual representation and provide a quantitative assessment of inference accuracy, we compute the root mean square error (RMSE) values for demographic inference (Supplementary Table 1). We further improve the accuracy of estimation by optimizing the likelihood (LH) to estimate the recombination rate and demography compared to the classically used Baum-Welch (BW) algorithm (Table 2 and Supplementary Figure S3). Our results demonstrate that SNPs are limiting and insufficient for accurate inferences in recent times and that the inclusion of an additional marker with mutation rate higher than the recombination rate generates significant improvements in demographic inference. However, by directly optimizing the likelihood the true recombination rate can be well recovered even with marker 1 only.

Estimates of recombination rates with one or both markers. For SMCtheo, BW stands for the use of the Baum-Welch algorithm to infer parameters, and LH to the use of the likelihood. We use 10 sequences of 100 Mb with r = 10⁻⁷, µ₁ = 10⁻⁸ and µ₂ = 10⁻⁴ per generation per bp in a population with a past bottleneck event. The coefficient of variation over 10 repetitions is indicated in brackets.

Integrating DNA methylation improves the accuracy of inference

Definition of the theoretical model for DNA methylation

Following the previously encouraging results of demographic inference with SNPs and an hyper-mutable marker under the specific assumptions of Figure 1A, we develop a specific SMCm method to jointly analyse SNPs and CG methylation as an epigenetic hyper-mutable marker. Since our SMCm stems from the eSMC [63, 68] it corrects for the effect of self-fertilization when appliying to A. thaliana. We focus here on methylation located in CG contexts within genic regions as these have been found to evolve neutrally [74, 83, 84]. The methylation of individual CG dinucleotides produces a biallelic heritable marker with a finite number of (epi)mutable sites (Figure 3). In a sample of several sequences from a population, variation in the methylation status of individual CGs is known as single methylation polymorphism (SMP, Figure 3A) which could be used for demographic and divergence inference [73, 74]. However, CG methylation sites can also be organized in spatial clusters (of similar state) due to region level epimutation (Figure 3B, [78, 18, 49]). Region level epimutations can have different epimutation rates than individual CG sites. Population-level variation in the methylation status of these clusters is known as differentially methylated regions (DMRs). Furthermore, when integrating SMP and DMR epimutational processes (i.e. what we here call region level epimutation), the methylation status of CG sites is therefore affected by the superposition of both processes. Therefore the simulation and modeling of epimutational processes of SMPs is more complex than in our previous model as we need to account for the effect of region methylation as well as for methylation and demethylation epimutation rates to be different and asymmetrical [73, 18].

Schematic representation of site and region epimutations
Schematic representation of a sequence undergoing epimutation at A) the cytosine site level, and B) at the region level. A methylated cytosine in CG context is indicated in black and an unmethylated cytosine in white.

To make our simulations realistic, we use the A. thaliana genome sequence as a starting point, and focus on CG dinucleotides within genic regions. To that end, we selected random 1kb regions within genes and choose only those CG sites that are clearly methylated or unmethylated in A. thaliana natural populations based on whole genome bisulphite sequencing (WGBS) mesaurements from the 1001G project (SI text). Our simulator for CG methylation is built in a similar way as the one described above but the epimutation rates are allowed to be asymmetric with the per-site methylation rate (µ_SM) and demythylation (µ_SU). Region-level epimutations are also implemented, setting the region length to either 1kb [49] or 150 bp [18]. The region level methylation and demethylation rates are defined as µ_RM and µ_RU, respectively. We assume that site-level and region-level epimutation processes are independent. Making this assumption explicit later allow us to test if it is violated in comparisons with actual data. Our simulator also assumes that DNA mutations and epimutations are independent of one another. That is, for simplicity we ignore the fact that methylated cytosines are more likely to transition to thyamines as a result of spontaneous deamination [28]. We also ignore the possibility that new DNA mutations could act as CG methylation quantitative trait loci and affect CG methylation patterns in both cis and trans. Such events are extremely rare so that the above assumptions should hold reasonably well over short evolutionary time-scales. As the goal is to apply our approach to A. thaliana, we simulate sequence data for a sample size n = 10 (but considering A. thaliana haploid) from a population displaying 90% selfing [63?] under a recent severe population bottleneck demographic scenario. We simulate data assuming previously estimates of the rates of recombination [56], DNA mutation [52], and site- and region-level methylation [73, 18].

As guidance for future analyses of demographic inference using SNPs and DNA methylation data, the theoretical and empirical analysis of A. thaliana methylomes consist of the following five steps: 1) assessing the relevance of region-level methylation (DMRs) for inference, 2) inference of site and region epimutation rates, 3) comparing statistics for the SNPs, SMPs and DMRs distributions, 4) demographic inference using SNPs with SMPs or DMRS, and 5) demographic inference using SNPs with SMPs and DMRs.

Step 1: assessing the relevance of region-level methylation (DMRs) for inference

We determine our ability to detect the existence of spatial correlations between epimutations. That is, we asked if site-specific epimutations can lead to region-level methylation status changes across a range of epimutation rates (assuming two sequences of 100 Mb, r = µ₁ = 10⁻⁸ per generation per bp and a constant population size N = 10, 000, results in Supplementary Table 2). If site-specific epimutations are independently distributed, the probability of a given site to be in a given (methylated or unmethylated) state should be independent from the state of nearby sites (knowing the epimutation rate per site). Conversely, if there is a region effect on epimutation (DMRs), two consecutive sites along the genome would exhibit a positive correlation in their methylated states. We therefore calculate from the per-site (de)methylation rates µ_SM and µ_SD the probability that two successive cytosine positions are identical in their methylation assuming they are independent. This probability can be compared to the one observed from methylation data (here simulated) so that we obtain a statistical test for the existence of a positive correlation in the methylation status of nearby sites, interpreted as a regional-level epimutation process (p-value = 0.05) according to Figure 1A. A small p-value of the test (<0.05) suggests the existence of a region effect for methylation/demethylation affecting neighbouring cytosines, contrary to a high p-value indicating no spatial structure of methylation distribution. We find that when region epimutation rates are higher than (or similar to) site-level epimutation rates, namely µ_RM ⪆ µ_SM and µ_RU ⪆ µ_SU), the existence of regions of consecutive cytosines is detected with high accuracy. However, when site-level epimutation rates are higher (µ_SU > µ_RU and µ_SM > µ_RM) than region-level epimutation rates, region-level changes cannot be readily detected (Supplementary Table 2). When methylated regions are detected, we can further determine their length using a specifically developed Hidden Markov Model (HMM) using all pairs of genomes (similarly to [65, 18, 69]). While the length of the methylated region is pre-determined in our simulations (1kb or 150bp), site-level epimutation occur which can change the distribution of methylation states in that region and across individuals, thus DMR regions can vary in length along the genome and between pairs of chromosomes.

Step 2: inference of site- and region-level epimutation rates

As the epimutation rates of most plant species remain unknown, we assess the accuracy of SMCm to infer epimutation rates at the site- and region-level directly from simulated data. We first assume that either only site- or only region epimutations can occur, and infer their respective rates (see Supplementary Table 3 and 4). Our SMCm approach can accurately recover these rates except when these are higher than 10⁻⁴. Next, we assess the accuracy of our approach to simultaneously infer site- and region-level epimutation rates assuming that region and site epimutation rates are equal (Supplementary Table 5 and Supplementary Figure 4). Similar to our previous observation, we find that when the epimutation rates are very high (e.g. close to 10⁻²), accuracy is lost compared to slower epimutation rates. Nonetheless, our average estimated rates are off from the true value by less than an factor 10. Hence, under our model assumptions, we are able to recover the correct order of magnitude for site- and region-level methylation and demethylation rates.

Step 3: distribution of statistics for SNPs, SMPs and DMRs

To gain insights on the distribution of epimutations under the described assumptions, we look at key statistics from our simulations: the distribution of distance between two recombination events versus the distribution of the length of estimated DMR regions (Figure 4A), and the LD decay for SMPs (in genic regions) and SNPs (in all contexts) (Figure 4C, D). In our simulations DMRs regions have a maximum fixed size, but their length depends on the interaction between the region- and site-level epimutation rates. As mentioned in step 1, the methylated/demethylated regions are detected using the binomial test and their length estimated by the HMM. Therefore, while variation exists for the length of these regions (Figure 4A), regions are on average shorter than the span of genealogies along the genome, which are defined by the frequency of recombination events along the genome (r = 3.5 × 10⁻⁸ as in A. thaliana). There is is virtually no linkage disequilibrium (LD) between epimutations due to the high epimutation rate (Figure 4C), while the LD between SNPs can range over few kbp (Figure 4D, as observed in A. thaliana [12, 60]). Note however, that the region methylation process in itself does not generate LD because this measure can only be computed if SMPs are present in frequency higher than 2/n in the sample, i.e. there is no LD measure defined for monomorphic methylated/unmethylated regions. In other words, our simulator generates SNPs, SMPs and DMRs which fulfill the three key assumptions of Figure 1A. We note that by using a constant population size N = 10, 000, the LD decay for SNPs is higher than in the A. thaliana data which exhibit an effective population size of ca. N = 250, 000 [12] and past changes in size.

Key statistics for epimutations and mutations.
A) Histogram of the length between two recombination events (genomic span of a genealogy) and DMRs size in bp of the simulated data. B) Histogram of genealogy span and DMRs size in bp from the *A. thaliana* data (10 German accessions). C) Linkage desequilibrium decay of epimutations in our samples of *A. thaliana* (red) and simulated data (blue). D) Linkage desequilibrium decay of mutations in our *A. thaliana* samples (red) and simulated data (blue). The simulations reproduce the outcome of a recent bottleneck with sample size n = 5 diploid of 100 Mb, the rates per generation per bp are r = 3.5 × 10⁻⁸, µ₁ = 7 × 10⁻⁹, *µ_SM* = 3.5 × 10⁻⁴, *µ_SU* = 1.5 × 10⁻³, and per 1kb region *µ_RM* = 2 × 10⁻⁴ and *µ_RU* = 1 × 10⁻³.

Step 4: demographic inference based on SNPs with SMPs or DMRs

We test the usefulness of either SMPs or DMRs for demographic inference. Simulations under the demographic model from steps 1-3 assume DNA mutations (SNPs) and only site epimutations (SMPs), i.e. no region-level methylation (µ_RM = µ_RU = 0). We perform inference of past demographic history under different amount of potentially methylated sites with and without a priori knowledge of the methy-lation/demythylation rates (Figure 5A, B). When the site epimutation rates are a priori known, the sharp decrease of population size can be accurately detected. When epimutation rates are unknown, the shape of the past demographic history is also well inferred except for a scaling issue (a shift along the x- and y-axes similar to that in Figure 5D). When we vary the amount of potentially methylated sites (2%, 10% and 20%) our inference results remain largely unchanged. This suggests that having methylation measurements for as low as 2% of all CG sites being epimutable in the genome is entirely sufficient to improved SNP-based demographic inference (eSMC2 in Figure 5A). The RMSE values for demographic inference are computed for all cases in Figure 5 to provide an additional quantitative understanding of our results (Supplementary Table 6).

Performance of SMC approaches using site epimutations (SMPs) and mutations (SNPs) under a bottleneck scenario.
Estimated demographic history by eSMC2 (blue) and SMCm assuming the epimutation rate is known (B and D) or not (A and C) where the percentage of CG sites with methylated information varies between 20% (red), 10% (orange) and 2% (green) using 10 sequences of 100 Mb in A and B (with 10 repetitions) and 10 sequences of 10 Mb in C and D (three repetitions displayed) under a recent severe bottleneck (black). The parameters are: r = 3.5 × 10⁻⁸ per generation per bp, mutation rate µ₁ = 7 × 10⁻⁹, methylation rate to *µ_SM* = 3.5 × 10⁻⁴ and demethylation rate to *µ_SU*= 1.5 × 10⁻³ per generation per bp.

The amount of sequence data used in Figure 5A and B is fairly large compared to real datasets (10 haploid genomes of length 100 Mb). We therefore ran the SMCm and eSMC2 on sequence data simulated under the same scenario but with a reduced sequence length of 10 Mb (n = 5 diploid, Figure 5C and D, only 3 repetitions are presented for visibility). In this case, we found that inference is significantly affected when using only SNPs (eSMC2 in blue), as we are unable to correctly recover the demographic scenario. However, incorporating SMPs with known site-level epimutations into the model leads to substantial inference improvements (Figure 5C and D, Supplementary Table 6).

We additionally quantify the accuracy gain in ARG inference by inferring the expected coalescent time (TMRCA) at each position in the genome by the three approaches (eSMC2, SMCm with unknown epimutation rates and SMCm with known epimutation rates) under the same scenario from Figure 5. The RMSE values of the TMRCA inference are presented in Supplementary Table 7. We confirm our intuition that integrating epimutations slightly improves the accuracy of TMRCA when the epimutation rates are known, but does not when the rates are unknown.

To quantify the effect of DMRs on inference, we simulate data under the same demographic scenario, but assume only region level epimutations (DMRs, µ_SM = µ_SU = 0). The results for DMR region sizes 1kb and 150bp are displayed in Supplementary Figure S5 and S6, respectively. As in Figure 5, we observed a gain of accuracy in inference when region-level epimutation rates are known, while the length of the region (1kb or 150bp) does not seem to affect the result. However, no significant gain of information is observed when integrating DMR data with unknown epimutation rates (Supplementary Figure 5 and 6). In summary, CG methylation SMPs and to a lesser extend DMRs, can be used jointly with SNPs to improve demographic inference (Supplementary Table 8 presents the corresponding RMSE values for demographic inference shown in Supplementary Figure 5 and 6), especially in recent times (Supplementary Table 6 and 8).

Step 5: demographic inference based on SNPs with SMPs and DMRs

Since site- and region-level methylation processes can occur in real data, we run SMCm on simulated data under the same demographic scenario, but now using both site (SMPs) and region (DMRs) epimutations and accounting for both mutation processes (with rates similar to the one found in Arabidopsis thaliana). Inference results are displayed in Supplementary Figure 7 (RMSE values in Supplementary Table 9). When the epimutations rates are unknown, we observe a gain of accuracy when integrating epimutations, especially in the recent times. However when epimutation rates are a priori known we observe a loss of accuracy when accounting for epimutations. This loss of accuracy is due to the mislabeling of the methylation region status (in step 1) when site and region-level epimutations occur jointly at similar rates (as there will be methylated sites in unmethylated regions and unmethylated sites in methylated regions).

Finally, we assess the inference accuracy when using SNPs and SMPs but ignoring in SMCm the region methylation effect (DMRs), even though this latter process takes place (Supplementary Figure 8, RMSE values in Supplementary Table 10). The inference accuracy decreases compared to the previous results (Supplementary Figure 5-7), and while the sudden variation of population is somehow recovered, the estimates of the time and magnitude of size change are not well recovered in recent time. Hence those results demonstrate the importance of accounting for site and region level epimutations processes in steps 1 to 5.

We demonstrate that our SMCm can exhibit, to some extend, an improved statistical power for demographic inference using SNPs and SMPs while accounting for site and region-level methylation processes under the assumptions of Figure 1A. We show that 1) using SMPs we can unveil past demographic events hidden by limitations in SNPs, 2) the correct demography can be uncovered irrespective of knowing a priori the epimutation rates, 3) ignoring site or region-level processes can decrease the accuracy of inference, and 4) knowing the epimutation rates may improve the estimate of demography compared to simultaneously estimating them with SMCm.

Joint use of SNPs and SMPs improves the inference of recent demographic history in A. thaliana

Step 1: assessing the strength of region-level methylation process in A. thaliana

We apply our inference model to genome and methylome data from 10 A. thaliana plants from a German local population [12]. We start by assessing the strength of a region effect on the distribution of methylated CG sites along the genome. As expected from [18], for all 10 individual full methylomes we reject the hypothesis of a binomial distribution of methylated and unmethylated sites along the genomes, suggesting the existence of region effect methylation (yielding DMRs) meaning that CG are more likely to be methylated if in a highly methylated region, and conversely for unmethylated CG. This is consistent with the autocorrelations in mCG found in [18, 11, 43]. As a first measure of methylated region length, we test the independence between two annotated CG methylation given a minimum genomic distance between them (within one genome). We observe an average p-value smaller than 0.05 for distances up to 2,000bp but then the p-value rapidly increases (>0.4) (Supplementary Figure 9). As a second measure, our HMM (based on pairs of genomes) yields a DMR average length of 222 bp (distribution in Figure 4B).

We conclude that the minimum distance for epimutations to be independent along a genome is over 2kb and spans larger distance than the typically proposed DMR size (ca. 150 bp in [18] and 222bp in our analysis) and can therefore cover the size of a gene (see [49, 11]). The simulations and data from A. thaliana indicate that the epimutation processes that produces DMRs at the population level in plants cannot simply results from the cumulative action of single-site epimutations. This insights is consistent with recent analyses of epimutational processes in gene bodies, which seems to indicate that the autocorrelation in CG methylation is a function of cooperative methylation maintenance and the distribution of histone modifications [11, 43].

Step 2: site- and region-level epimutation rates

We use the rates empirically estimated in A thaliana and taken in the above simulations (µ_SM= 3.5 × 10⁻⁴ and µ_SU = 1.5 × 10⁻³ per bp per generation and µ_RM= 2 × 10⁻⁴ and µ_RU = 1 × 10⁻³ per region per generation, [73, 18]).

Step 3: distribution statistics for SNPs, SMPs and DMRs in A. thaliana

Since our SMC model assumes that DNA, SMP and DMR polymorphisms are determined by the underlying population/sample genealogy, DMR which span long genomic regions may spread across multiple genealogies and thus violates our modelling assumptions. We thus further investigate the potential discrepancies between the data and our model (Figure 4). We infer the DMR sizes from all 10 A. thaliana accessions using our ad hoc HMM, and measure the bp distance between a change in the expected hidden state (i.e. coalescent time) along the genome, which we interpret as recombination events (called the genomic span of a genealogy). The resulting distributions are found in Figure 4B. We observe that both distributions have a similar shape but DMRs are on average twice as large as the inferred genomic genealogy span: average length of 222 bp (DMR) vs 137 bp (genealogy) and median length of 134 bp (DMR) vs 62 bp (genealogy). This means that on average DMRs are larger than the average distance between two recombination events, thus violating the homogeneous distribution of epimutations along the genome (Figure 1C).

To further unveil potential non-homogeneity of the distribution of epimutations, we assess the decay of LD of mutations (SNPs) and epimutations (SMPs) (Figure 4C and D) confirming the results in [60]. We find the LD between SMPs in the data to be high (and higher than LD between SNPs) for distance smaller than 100 bp (red line in Figure 4C and D). The LD decay of SMPs is much faster than for SNPs (no linkage disequilibrium between epimutations for distances > 100bp), likely stemming from 1) epimutation rates being much higher than the DNA mutation rate, and 2) the high per site recombination rate in A. thaliana. Moreover, the LD between SMPs at distance smaller than 100bp in A. thaliana being much higher compared to our simulations (Figure 4C), we suggest that additional local mechanisms of epimutation processes may not be accounted for in our model of the region-level methylation process.

Step 4: demographic inference for A. thaliana based only on SNPs and SMPs

Finally, we apply the SMCm approach to data from the German accessions of A. thaliana. When using SNP data only, the demographic results are similar to those previously found [63, 68] (Figure 6 purple lines), with no strong evidence for an expansion post-Last Glacial Maximum (LGM) [12]. We then sub-sample and analyze segregating SMPs, which exhibit both methylated and unmethylated states in our sample (as in [73]). Here we ignore DMRs and account only for SMPs. When we use as input the methylation and demethylation rates that have been inferred experimentally [73], a mild bottleneck post-LGM is followed by recent expansion (Figure 6 blue lines). By contrast, letting our SMCm estimate the epimutations rates, we find in recent times a somehow similar but stronger demographic change post-LGM. We find a strong bottleneck event occurring between ca. 5,000 and 10,000 generations ago followed by an expansion until today (Figure 6 green lines). The inferred site epimutation rates are 10,000 faster than the DNA mutation rate (Supplementary Table 11) which is close to the expected order of magnitude from experimental measures with and without DMR effects [73, 18]. Both estimates thus yield a post-LGM bottleneck followed by a recent population expansion.

Integrating epimutations and mutations on German accessions of *A. thaliana*.
Estimated demographic history of the German population by eSMC2 (only SNPs, purple) and SMCm when keeping polymorphic methylation sites (SMPs) only: green with epimutation rates estimated by SMCm, blue with epimutation rates fixed to empirical values. The region epimutation effect is ignored. The parameters are r = 3.6 × 10⁻⁸, µ₁ = 6.95 × 10⁻⁹, and when assumed known, the site methylation rate is *µ_SM* = 3.5 × 10⁻⁴ and demethylation rate is *µ_SU* = 1.5 × 10⁻³.

These results indicate that the inclusion of DNA methylation data can aid in the accurate reconstruction of the evolutionary history of populations, particularly in the recent past where SNPs reach their resolution limit. This is made possible by the fact that the DNA methylation status at CG dinucleotide undergoes stochastic changes at rates that are several orders of magnitude higher than the DNA mutation rate, and can be inherited across generations similar to DNA mutations.

Step 5: demographic inference correcting for DMRs in A. thaliana

To assess the robustness of our inference results, we run SMCm using all cytosines (CG) sites with an annotated methylation status (segregating or not) while accounting or not for DMRs (Supplementary Figure 10). We fix epimutation rates to the empirically estimated values, and confirm the estimates from Figure 6. When the region-level methylation process is ignored the inferred demography (blue lines in Supplementary Figure 10) is similar to the estimates from SMPs with fixed rates in Figure 6 (blue lines). When the region-level methylation process is taken into account (orange lines in Supplementary Figure 10), the inferred demography is similar to that of the Figure 6 (green lines). In the case where we infer the epimutation rates (sites and region) the demographic history inference is not improved compared to that estimated using SNPs only (Supplementary Figure 10, green and red lines) while the inferred epimutation rates are smaller than expected (Supplementary Table 11 and 12), but the ratio of site to region epimutation rates is consistent with empirical estimates [18].

Discussion

Current approaches analyzing whole genome sequences rely on statistics derived from the distribution of ancestral recombination graphs [23, 64, 37, 68, 10, 80, 66, 34]. In this study we present a new SMC method that combines SNP data with other types of genomic (TEs, microsatallites) and epigenomic (DNA methylation) markers. We focus mainly on the inclusion of genomic markers whose mutation rates exceed the DNA point mutation rate, as such (hyper-mutable) markers can provide increased temporal resolution in the recent evolutionary past of populations, and aid in the identification of demographic changes (e.g. population bottle-necks). We demonstrate that by integrating multiple heritable genomic markers, the population size variation in very recent time can be more accurately recovered (outperforming any other methods given the amount of data used in this study [71, 66]). Our results indicate that correctly integrating multiple genomic marker can improve TRMCA inference, which is becoming a field of high interest [37, 26, 44]. Our simulations demonstrate that if the SNP mutation rate is known, the mutation rate of other markers can be recovered (under the condition that the marker follow all hypotheses described in Figure 1). Moreover, our method accounts for the finite site problem that arises at reversible (hyper-mutable) markers and/or where effective population size is high [70, 72]. Overall, the simulator and SMC methods presented here therefore pave the way for a rigorous statistical framework to test if a common ARG can explain the observed diversity patterns under the model hypotheses laid out in Figure 1. We find that comparisons of LD for different markers along the genome is a useful way to assess violations of our model assumptions.

As proof of principle, we apply our approach on data originating from whole genome and methylome data of A. thaliana natural accessions (focusing on CG context in genic regions, as in [74, 83, 84]). Indeed, A. thaliana presents the largest genetic and epigenetic data-set of high quality. Additionally the methylation states in CG context has been proven mainly heritable and is well documented [18, 25, 73]. We first investigate the distribution of epimuations along the genomes. Our model-based approach provides strong evidence that DMRs cannot simply emerge from spontaneous site-level epimutations that arise according to a Poisson processes along genome. Instead, stochastic changes in region-level methylation states must be the outcome of spontaneous methylation and demethylation events that operate at both the site- and region-level (as corroborated by [54, 11, 43]). Our epimutation model cannot fully describe the observed diversity of epimutations along the genome [18], meaning that the epimutation processes may indeed be more complex than expected [18, 25, 11, 43]. We observe non-independence between annotated methylation sites spanning genomic regions larger than the span of the underlying genealogy (determined by recombination events) which no model can currently describe. Additionally, we find high LD between SMPs over short distances which does not appear in our simulations (simulation performed under the current measures of epimutation rates). Thus, methylation probably violate the assumptions of a Poisson process distribution along the genome and in time (i.e. Figure 1), in line with recent functional studies [54, 25, 42, 43]. We thus further caution against conclusions on the role of natural (purifying) selection [49] or its absence [74] based on population epigenomic data due to the violation of the above mentioned assumptions. Additionally we suspect those model violations to explain the discrepancy between epimutation rates we inferred and the ones measured experimentally [73, 18]. To solve this discrepancy, one would need to develop a theoretical epimutation model capable of describing the observed diversity at the evolutionary time scale and then use this model to reanalyse the sequence data from the biological experiment to re-estimate the epimutation rates. We thus suggest a possible way forward for modeling epimutations through an Ising model [86] to account for the heterogeneous methylation process. However, our preliminary work and the simulation results in [11], indicate that such model generates non-homogeneous mutation process in space (i.e along the genome) and time, violating our current SMC assumptions (Figure 1C and D). Hence, there is a need to develop a more realistic methylation model for epimutations. A model accounting for heterogeneous rates would probably need to rely on a more sophisticated HMM (e.g. continuous time Markov chains [35] for SMC approaches) than what is presented here or to use other full genome inference methods (see [37]) which are not constrained by the SMC assumptions (Figure 1) but depends on simulations.

Interestingly, the distance of LD decay for SMPs matches quite well the estimated distance between recombination events (Figure 4). In addition to our theoretical results in Table 2, this observation reinforces the usefulness of using SMPs (or any hyper-mutable marker) to improve estimates of the recombination rate along the genome in species where the per site DNA mutation rate (µ) is smaller than the per site recombination rate (r) as in A. thaliana.

Nonetheless, we find that a restricted focus on segregating SMPs in genic regions could meet our model assumptions reasonably well, and thus provides a promising way forward. Using these segregating SMPs, we recover a past demographic bottleneck followed by an expansion which could fit the post-Last Glacial Maximum (LGM) colonization of Europe (although caution must be taken concerning the reliability of those results as pointed above), a hypothesized scenario [21] which could not be clearly identified using SNPs only from European (relic and non-relic) accessions [12]. Currently strong evidence from inference methods are lacking ([12], Figure 4 in [19]). Indeed, beyond the limits of using SNPs only, current results are limited by theoretical frameworks unable to simultaneously account (and disentangle) for extensive background selection (reinforced by very high selfing), population structure and variation in molecular rates (e.g. mutation rates, [48]), which are all known to be present in A. thaliana. Those various forces are known to bias inference results when non-accounted for [15, 55], and may explain the variance in our demographic estimates. We note also that using CG methylated sites in genic regions may be problematic as the typical genealogies at these loci could be shorter than the genome average due to the presence of background selection, thus making the inference of such short TMRCA more difficult (even with SMPs) than in non-coding regions (which do not harbour desirable CG methylation sites, [73, 74, 83]).

We suggest that simultaneously accounting for multiple heritable markers can help disentangle between different evolutionary forces, such as between selection and variation in mutation rate: selection has a local effect on the population genealogy, while the mutation rate variation would only locally affect that given marker but not the genealogy [15]. The absence of conflicting demography inferred from SNPs and from methylation confirms at the time scale of thousands of generations, CG methylation sites are mainly heritable and can be modeled using population genetics theory [14, 74] (but see [54]) and used to estimate divergence between lineages [84, 83]. In other words fast ecological local adaptation [59] and response to stresses [67] may likely not be prominent forces endlessly reshaping CG methylation patterns (non-heritability in Figure 1B).

Overall, our results demonstrate that our approach can be used in different cases. If the epimutations/genomic markers evolutionary mechanisms are not well understood [54, 11, 43], our approach provides inference tools to study the markers’ rates and distribution process along the genome, without requiring additional experimental data. If the evolution of epimutations/genomic markers are well understood (including a measure of the mutation rates) and can be modeled to described the observed intra-population diversity, these can be integrated to improve the SMC performance. Hence when applying our approach to genome-wide genetic and epigenetic data, it is advisable to use accurately annotated markers with, if possible, information regarding their inheritance and mutational properties. Regarding methylation specifically, while the set of gene body methylated genes previously used [74, 84] are likely the optimal choice [83], these are too few and too scattered across the genome to maximize the statistical power of SMC methods. We therefore use methylation sites at all genic regions. Yet, despite the wealth of functional studies and data on methylation in A. thaliana, the distribution of epimutations is not fully understood [25, 54], but independent rates for sites and region-level have been estimated [73, 18, 84]. We note here the promising methylation modelling framework by [11, 43], albeit it does not yet consider evolutionary processes at the population level. Our results shed light on the inference accuracy in presence of site and region-level epimutations when occurring at similar rates (Supplementary Figure 7). When accounting for region-level epimutations, our algorithm requires to first infer via an HMM the methylation status of a region in order to later-on compute the epimutation probabilities (i.e the emission matrix of the SMC HMM). Hence, in presence of site and region-level epimutations occurring at similar rates, recovering the region methylation status becomes harder as methylated sites are observed in the unmethylated regions (and unmethylated sites observed in the methylated regions). The mislabelling of the region methylation status lead to accuracy loss due to the use of the wrong emission probability at the later steps of the SMC inference (Forward-Backward algorithm). In the case where epimutation rates are freely inferred, their values are based on the estimated methylation region status. Therefore, even if the inferred rates are incorrect, these are sufficiently consistent with the inferred region methylation status to contain information and slightly improve inference accuracy. Additionally, extra care must be taken when dealing with epigenomic data in other species as the SMP calling might not be as simple as for Arabidopsis thaliana due to potential difference of methylation between different tissues or pool of cells. Similarly, we ignore here the potential dependence between SNPs and SMPs, as more empirical evidence (and modelling) is required to quantify the potential interaction between both mutational processes.

On a brighter note, with the release of new sequencing technology [39], long and accurate reads are becoming accessible, leading to the availability of high quality reference genomes for model and non-model species alike [51, 7]. Additionally, the quality of re-sequencing (population sample) genome data and their annotations is enhanced so that additional markers such as transposable elements, insertion, deletion or microsatellites can be called with increasing confidence. These accurate genomes will provide access to new classes of genomic markers that span the entire mutational spectrum. We therefore suspect in the near future an improvement in our understanding of the heritability of many markers besides SNPs. Adding other genomic markers besides SNPs will improve full genome approaches, which are currently limited by the observed nucleotide diversity [34, 66, 62]. Additionally, the potential complexity resulting by integrating multiple independent markers could be tackled by the use of continuous time Markov chains for the emission matrix. We predict that our results pave the way to improve the inference of 1) biological traits or recombination rate through time [17, 68], 2) multiple merger events [37], and 3) recombination and mutation rate maps [5, 4]. Our method also should help to dissect the effect of evolutionary forces on genomic diversity [32, 31], and to improve the simultaneous detection, quantification and dating of selection events [1, 8, 30].

Hence, there is no doubt that extending our work, by simultaneously integrating diverse types of genomic markers into other theoretical framework (e.g. ABC approaches), likely represents the future of population genomics, especially to study species for which many thousands of samples cannot be obtained. We believe our approach helps to develop more general classes of models capable of leveraging information from any type and amount of diversity observed in sequencing data, and thus to challenge our current understanding of genome evolution.

Materials and Methods

Simulating two genomic markers

The sequence is written as a sequence of markers with a given state. Each site is annotated as MXSY, where X indicates the marker type and Y the current state of that marker: for example M1S1 indicate at this position a marker of type 1 in the state 1. To simulate sequence of theoretical marker we start by simulating an ARG which is then split in a series of genealogies (i.e. a sequence of coalescent trees) along the chromosome and create an ancestral sequence (based on equilibrium probability of marker states). Mutation events (nucleotides or epimutations for methylable cytosine) are then added when going along the sequence, i.e. along the series of genealogies. The ancestral sequence is thus modified by mutation event assuming a finite site model [82] conditioned to the branch length and topology of the genealogies. Each leaf of the genealogy is one of the n samples. Our model has thus two important features: 1) markers are independent from one another, and 2) a given marker has a polymorphism distribution between samples (frequencies of alleles) determined by one given genealogy. The simulator can be found in the latest version of eSMC2 R package (https://github.com/TPPSellinger/eSMC2).

Simulating methylome data

We now focus on methylation data located at cytosine in CG context within genic regions. Only, CG sites in those regions are considered "methylable", and CG sites outside those defined genic regions do not have a methylation status and are considered "unmethylable". We vary the percentage of CG site with methylation state annotated from 2 to 20% of the sequence length. The simulator can in principle simulate epimutations in different methylation context and different rates [41, 16, 87, 85]. We simulate epimutations as described above but with asymmetric rates: the methylation rate per site is µ_SM = 3.5 × 10⁻⁴, and the demethylation rate per site is µ_SM = 1.5 × 10⁻³ [73, 18]. For simplicity and computational tractability, we assume that when an epimutation occurs, it occurs on both DNA strands which then present the same information. In other words, for a haploid individual, a cytosine site can only be methylated or unmethylated (as in [69]). For region level epimutations, the region length is either 1kbp [49] or 150 bp [18]. The region level methylation and demethylation rates are set to µ_RM = 2 × 10⁻⁴ and µ_RU = 10⁻³ respectively (similar to rates measured in A. thaliana, [18]). In addition to this, unlike for theoretical marker described above, mutations, site and region epimutations can occur at the same position of the sequence.

To simulate methylation data, we start with an ancestral sequence of random nucleotide and then randomly select regions in which CG sites have their methylation state annotated (representing the genic regions). Cytosine in CG context in those regions are either methylated or unmethylated (noted as M or U). Cytosine in other context or regions are considered as unmethylabe (and noted as C). The ancestral methylation state is then randomly attributed according to the equilibrium probabilities. Our simulator then introduces DNA mutations, site- and region-epimutations in a similar way as described above.

SMC Methods

All three methods (eSMC2, SMCtheo and SMCm) are based on the same mathematical foundations and implemented in a similar way within the eSMC2 R package (https://github.com/TPPSellinger) [68, 37, 64]. This allows to specifically quantify the accuracy gained by accounting for multiple genomic markers.

SMC optimization function

All current SMC approach rely on the Baum-Welch (BW) algorithm for parameter estimation in order to reduce computational load (as described in [71]). Yet, the Baum-Welch algorithm is an Expectation-Maximization algorithm, and can hence fall in local extrema when optimizing the likelihood. We alternatively extend SM-Ctheo to estimate parameters by directly optimizing the likelihood (LH) at the greater cost of computation time (even when using the speeding techniques described in [57]). We run this approach on a sub-sample of size six haploid genomes to limit the required computational time.

eSMC2 and MSMC2

SMC methods based on the PSMC’ [58], such as eSMC2 and MSMC2, focus on the coalescent events between two individuals (i.e. two haploid genomes or one diploid genome). The algorithm moves along the sequence and estimates the coalescence time at each position by assessing whether the two sequences are similar or different at each position. If the two sequences are different, this indicates a mutation took place in the genealogy of the sample. The intuition being that the absence of mutations (i.e. the two sequences are identical) is likely due to a recent common ancestor between the sequences, and the presence of several mutations likely reflects that the most recent common ancestor of the two sequences is distant in the past. In the event of recombination, there is a break in the current genealogy and the coalescence time consequently takes a new value according to the model parameters [46, 58]. A detailed description of the algorithm can be found in [45, 63].

SMCtheo based on several genomic markers

Our SMCtheo approach is equivalent to PSMC’ but take as input a sequence of several genomic markers. The algorithm goes along a pair of haploid genomes and checks at each position which marker is observed and then if both states of the marker are identical or not. The approach is identical to the one described above, except that the probability of both sequences to be identical at one site depends on the mutation rate of the marker at this site (equation 1). While the mutation rates for many heritable genomic markers are unknown, there is an increasing amount of measures of the DNA (SNP) mutation rate for many species. Our SMCtheo approach is able to leverage the information from the distribution of one theoretical marker (e.g. mutations for SNPs) to infer the mutation rate of the other marker 2 (assuming both mutation rates to be symmetrical). If more than 1% of sites are polymorphic in a sequence we use the finite site assumption. If not, then from the diversity observed, the different mutation rates can be recovered by simply comparing Waterson’s theta (θ_W) between the reference marker (i.e. with known rate) and the marker with the unknown rates. For example, if the diversity (θ_W) at marker 2 is smaller by a factor ten than the reference marker 1 (and no marker violates the infinite site hypothesis), the mutation rate of marker 2 is inferred to be ten times smaller (corrected by the number of possible states). However, if the marker 2 violates the infinite site hypothesis, a Baum-Welch algorithm is run to infer the most likely mutation rates under the SMC to overcome this issue (the Baum-Welch algorithm description can be found in [63]).

SMCm

When integrating epimutations, the number of possible observations increases compare to eSMC2. As in eSMC2, if the two nucleotides (DNA mutation) at one position are identical at a non methylable site, we indicate this as 0. If the two nucleotides are different, it is indicated as 1 (i.e. a DNA mutation occurred). When assuming site-level epimutation only, three possible observations are possible at a given methylable posisiton: 1) if the two cytosines from the two chromosomes are unmethylated, it is indicated as a 2, 2) if the two cytosines are methylated, it is indicated as a 3, and 3) if at a position a cytosine is methylated and the other one unmethylated, it is indicated as a 4. Depending on the mutation, methylation and, demethylation rates, different frequencies of these states are possible in the sample of sequences, which provide information on the emission rate in the SMC method. When both site- and region-level methylation processes occur, the methylation state is conditioned by the region level methylation state (increasing the number of possible observation to 9)

To choose the appropriate settings for SMCm (i.e. if there are region level epimutations), we test if the methylation state are distributed independently from one another along one genome. In absence of region methylation effect, the probability at each site (position) to be methylated or unmethylated should be independent from the previous position (or any other position). Conversely, if there is a region effect on epimutation, two consecutive sites along one genome would exhibit a positive correlation in their methylated states (and across pairs of sequences). We therefore calculate the probability that two successive positions with an annotated methylation state would be identical under a binomial distribution of methylation along a given genome. We then compare theoretical expectations to the observed data and build the statistical test based on a binomial distribution of probabilities. If existence of region level epimutation is detected, the regions level methylation states are recovered through a hidden markov model (HMM) similarly to [65, 18, 69]. Note that this HMM model does not include information from epimutation rates known from empirical studies. The complete description of the mathematical models and probabilities are in the supplementary material Text S1.

We postulate that the epimutation rates remain unknown in most species, while the DNA mutation rate may be known (or approximated based on a closely related species). Hence, we develop an approach based on the SMC capable of leveraging information from the distribution of DNA mutations to infer the epimutation rates (similar to what is described above). Our approach first tests if epimutations violates or not the infinite site assumptions. If less than 1% of sites with their methymation state annotated are polymorphic in a sequence we use the infinite site assumption: the site and region level epimutation rates can be recovered straight-forwardly from the observed diversity (θ_W, see above) . Otherwise, a Baum-Welch algorithm is run to infer the most likely epimutation rates (site rate for SMP, and region rates for DMRs) [73, 74, 69].

Calculation of the root mean square error (RMSE)

To quantify the accuracy of each demographic inference we evaluate the root mean square error (RMSE). To do so we choose a hundred points uniformly spread across the time window (in log₁₀ scale), and compare the actual population size and the one estimated by a given method at each of these points. We thus have the following formula:

where y_i is the true population size at the time point i, and y^∗ is the estimated population size at the time point i.

Inference of the Time to the Most Recent Common Ancestor (TM-RCA)

To infer the TMRCA at each position of the genome we use an approach similar to the PSMC’ described in [58]. We first run a forward and backward algorithm on our sequence data (see appendix of [63, 71] for computation details). From the output results we calculate the probability to be in each hidden state at each position of the genome (note that the output product of the forward backward algorithm is rescaled so that the sum of probability is one), which we use to compute the expected coalescent time at each position on the genome using the following formula:

with i is the position on the genome, j is the hidden state index, n is the number of hidden state, fo is the output from the forward algorithm, ba is the output from the backward algorithm, , and Tc is a vector containing all the hidden states (i.e. coalescent times).

Sequence data of Arabidopsis thaliana

We download genome and methylome data of A. thaliana from the 1001 genome project [12]. We select 10 individuals from the German accessions respectively corresponding to the accession numbers: 9783, 9794, 9808, 9809, 9810, 9811, 9812, 9816, 9813, 9814. We only keep methylome data in CG context and in genic regions [74, 18]. The genic regions are based on the current reference genome TAIR 10.1. The SNPs and epimutations are called according to previously published pipeline [69, 18]. As in previous studies [63, 22, 19], we assume A. thaliana data to be haploid due to high homozygosity (caused by high selfing rate). The resulting files are available on GitHub at https://github.com/TPPSellinger. To perform analysis we chose µ = 6.95 × 10⁻⁹ per generation per bp as the DNA mutation rate [52] and r = 3.6 × 10⁻⁸ as the recombination rate [56] per generation per bp. In order to have the most realistic model, we assume that the methylome of A. thaliana undergoes both region (RMM) and site (SMM) level epimutations [18]. When fixed, we respectively set the site methylation and demethylation rate to µ_SM = 3.48 × 10⁻⁴ and µ_SU = 1.47 × 10⁻³ per generation per bp according to [73]. We additionally set the region level methylation and demethylation rate to µ_RM = 1.6 × 10⁻⁴ and µ_RU = 9.5 × 10⁻⁴ per generation per bp according to [18]. Because we do not account for the effect of variable mutation or recombination rate along the genome, we cut the five chromosome of A. thaliana into eight smaller scaffolds [4, 5]. By doing this we remove centromeric regions and limit the effect the variation of mutation and recombination rate along the genome. The selected regions and the SNP density (from the German accessions) are represented in Supplementary Figures 11 to 15.

Supporting information

SI Figures and Tables

Data Availability

eSMC2 R package can be found at : https://github.com/TPPSellinger/eSMC2. The input files created from Arabidopsis thaliana sequence data are available on GitHub at : https://github.com/TPPSellinger/Arabidopsis_thaliana_methylation.

Acknowledgements

We thank Zhilin Zhang and Rashmi Hazarika for giving and processing the data of Arabidopsis thaliana. TS is supported by the Deutsche Forschungsgemeinschaft, project number 317616126 (TE809/7-1) to AT, and the Austrian Science Fund (project no. TAI 151-B) to Anja Hörger.

References

[1]
1. Albers P. K.
2. McVean G.
2020Dating genomic variants and shared ancestry in population-scale sequencing dataPLOS BIOLOGY 18https://doi.org/10.1371/journal.pbio.3000586 Google Scholar
[2]
1. Alonso-Blanco C.
2. Andrade J.
3. Becker C.
4. Bemm F.
5. Bergelson J.
6. Borgwardt K. M.
7. Cao J.
8. Chae E.
9. Dezwaan T. M.
10. Ding W.
11. et al.
20161,135 genomes reveal the global pattern of polymorphism in arabidopsis thalianaCell 166:481–491Google Scholar
[3]
1. Anzai T.
2. Shiina T.
3. Kimura N.
4. Yanagiya K.
5. Kohara S.
6. Shigenari A.
7. Yamagata T.
8. Kulski J. K.
9. Naruse T. K.
10. Fujimori Y.
11. et al.
2003Comparative sequencing of human and chimpanzee mhc class i regions unveils insertions/deletions as the major path to genomic divergenceProceedings of the National Academy of Sciences 100:7708–7713Google Scholar
[4]
1. Barroso G. V.
2. Dutheil J. Y.
2021Mutation rate variation shapes genome-wide diversity in Drosophila melanogasterhttps://doi.org/10.1101/2021.09.16.460667 Google Scholar
[5]
1. Barroso G. V.
2. Puzovic N.
3. Dutheil J. Y.
2019Inference of recombination maps from a single pair of genomes and its application to ancient samplesPLOS Genetics 15https://doi.org/10.1371/journal.pgen.1008449 Google Scholar
[6]
1. Baumdicker F.
2. Bisschop G.
3. Goldstein D.
4. Gower G.
5. Ragsdale A. P.
6. Tsambos G.
7. Zhu S.
8. Eldon B.
9. Ellerman E. C.
10. Galloway J. G.
11. Gladstein A. L.
12. Gorjanc G.
13. Guo B.
14. Jeffery B.
15. Kretzschumar W. W.
16. Lohse K.
17. Matschiner M.
18. Nelson D.
19. Pope N. S.
20. Quinto-Cortes C. D.
21. Rodrigues M. F.
22. Saunack K.
23. Sellinger T.
24. Thornton K.
25. van Kemenade H.
26. Wohns A. W.
27. Wong Y.
28. Gravel S.
29. Kern A. D.
30. Koskela J.
31. Ralph P. L.
32. Kelleher J.
2022Efficient ancestry and mutation simulation with msprime 1.0GENETICS 220https://doi.org/10.1093/genetics/iyab229 Google Scholar
[7]
1. Beichman A. C.
2. Huerta-Sanchez E.
3. Lohmueller K. E.
4. Futuyma DJ
2018Annual Review of Ecology, Evolution, and Systematics, VOL 49Annual Review of Ecology Evolution and Systematics pp. 433–456https://doi.org/10.1146/annurev-ecolsys-110617-062431 Google Scholar
[8]
1. Bisschop G.
2. Lohse K.
3. Setter D.
2021Sweeps in time: leveraging the joint distribution of branch lengthsGENETICS 219https://doi.org/10.1093/genetics/iyab119 Google Scholar
[9]
1. Boitard S.
2. Rodríguez W.
3. Jay F.
4. Mona S.
5. Austerlitz F.
Inferring population size history from large samples of genome-wide molecular data - an approximate bayesian computation approach12:e1005877https://doi.org/10.1371/journal.pgen.1005877 Google Scholar
[10]
1. Brandt D. Y. C.
2. Wei X.
3. Deng Y.
4. Vaughn A. H.
5. Nielsen R.
2022Evaluation of methods for estimating coalescence times using ancestral recombination graphsGENETICS 221https://doi.org/10.1093/genetics/iyac044 Google Scholar
[11]
1. Briffa A.
2. Hollwey E.
3. Shahzad Z.
4. Moore J. D.
5. Lyons D. B.
6. Howard M.
7. Zilberman D.
2023Millennia-long epigenetic fluctuations generate intragenic dna methylation variance in arabidopsis populationsCell Systems Google Scholar
[12]
1. Cao J.
2. Schneeberger K.
3. Ossowski S.
4. Günther T.
5. Bender S.
6. Fitz J.
7. Koenig D.
8. Lanz C.
9. Stegle O.
10. Lippert C.
11. et al.
2011Whole-genome sequencing of multiple Arabidopsis thaliana populationsNature Genetics 43:956–U60https://doi.org/10.1038/ng.911 Google Scholar
[13]
1. Charlesworth B.
2. Charlesworth D.
2010Elements of evolutionary geneticsGoogle Scholar
[14]
1. Charlesworth B.
2. Jain K.
3. Purifying Selection Drift
1587and Reversible Mutation with Arbitrarily High Mutation RatesGenetics 198https://doi.org/10.1534/genetics.114.167973 Google Scholar
[15]
1. Charlesworth B.
2. Jensen J. D.
2023Population genetic considerations regarding evidence for biased mutation rates in arabidopsis thalianaMolecular Biology and Evolution 40:msac275Google Scholar
[16]
1. Cokus S. J.
2. Feng S.
3. Zhang X.
4. Chen Z.
5. Merriman B.
6. Haudenschild C. D.
7. Pradhan S.
8. Nelson S. F.
9. Pellegrini M.
10. Jacobsen S. E.
2008Shotgun bisulphite sequencing of the arabidopsis genome reveals dna methylation patterningNature 452:215–219Google Scholar
[17]
1. Deng Y.
2. Song Y. S.
3. Nielsen R.
2021The distribution of waiting distances in ancestral recombination graphsTHEORETICAL POPULATION BIOLOGY 141:34–43https://doi.org/10.1016/j.tpb.2021.06.003 Google Scholar
[18]
1. Denkena J.
2. Johannes F.
3. Colome-Tatche M.
2021Region-level epimutation rates in arabidopsis thalianaHEREDITY 127:190–202https://doi.org/10.1038/s41437-021-00441-w Google Scholar
[19]
1. Durvasula A.
2. Fulgione A.
3. Gutaker R. M.
4. Alacakaptan S. I.
5. Flood P. J.
6. Neto C.
7. Tsuchimatsu T.
8. Burbano H. A.
9. Picó F. X.
10. Alonso-Blanco C.
11. et al.
2017African genomes illuminate the early history and transition to selfing in arabidopsis thalianaProceedings of the National Academy of Sciences 114:5213–5218Google Scholar
[20]
1. Estoup A.
2. Jarne P.
3. Cornuet J.-M.
2002Homoplasy and mutation model at microsatellite loci and their consequences for population genetics analysisMolecular ecology 11:1591–1604Google Scholar
[21]
1. François O.
2. Blum M. G. B.
3. Jakobsson M.
4. Rosenberg N. A.
2008Demographic history of european populations of arabidopsis thalianaPLOS Genetics 4:1–15https://doi.org/10.1371/journal.pgen.1000075 Google Scholar
[22]
1. Fulgione A.
2. Koornneef M.
3. Roux F.
4. Hermisson J.
5. Hancock A. M.
2018Madeiran Arabidopsis thaliana Reveals Ancient Long-Range Colonization and Clarifies Demography in EurasiaMolecular Biology and Evolution 35:564–574https://doi.org/10.1093/molbev/msx300 Google Scholar
[23]
1. Gattepaille L.
2. Guenther T.
3. Jakobsson M.
1191Inferring Past Effective Population Size from Distributions of Coalescent TimesMolecular Biology and Evolution 204https://doi.org/10.1534/genetics.115.185058 Google Scholar
[24]
1. Gattepaille L. M.
2. Jakobsson M.
3. Blum M. G. B.
2013Inferring population size changes with sequence and SNP data: lessons from human bottlenecksHeredity 110:409–419https://doi.org/10.1038/hdy.2012.120 Google Scholar
[25]
1. Hazarika R. R.
2. Serra M.
3. Zhang Z.
4. Zhang Y.
5. Schmitz R. J.
6. Johannes F.
2022Molecular properties of epimutation hotspotsNature Plants 8:146–156Google Scholar
[26]
1. Hubisz M. J.
2. Williams A. L.
3. Siepel A.
2020Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graphPLOS GENETICS 16https://doi.org/10.1371/journal.pgen.1008895 Google Scholar
[27]
1. Hudson R.
1983Properties of a neutral allele model with intragenic recombinationTheoretical Population Biology 23:183–201https://doi.org/10.1016/0040-5809(83)90013-8 Google Scholar
[28]
1. Johannes F.
2019DNA methylation makes mutational historyNature Plants 5:772–773https://doi.org/10.1038/s41477-019-0491-z Google Scholar
[29]
1. Johannes F.
2. Schmitz R. J.
2019Spontaneous epimutations in plantsNew Phytologist 221:1253–1259Google Scholar
[30]
1. Johri P.
2. Charlesworth B.
3. Jensen J. D.
2020Toward an evolutionarily appropriate null model: Jointly inferring demography and purifying selectionGENETICS 215:173–192https://doi.org/10.1534/genetics.119.303002 Google Scholar
[31]
1. Johri P.
2. Riall K.
3. Becher H.
4. Excoffier L.
5. Charlesworth B.
6. Jensen J. D.
2021The impact of purifying and background selection on the inference of population history: Problems and prospectsMOLECULAR BIOLOGY AND EVOLUTION 38:2986–3003https://doi.org/10.1093/molbev/msab050 Google Scholar
[32]
1. Johri P.
2. Aquadro C. F.
3. Beaumont M.
4. Charlesworth B.
5. Excoffier L.
6. Eyre-Walker A.
7. Keightley P. D.
8. Lynch M.
9. McVean G.
10. Payseur B. A.
11. Pfeifer S. P.
12. Stephan W.
13. Jensen J. D.
2022Recommendations for improving statistical inference in population genomicsPLOS Biology 20:e3001669https://doi.org/10.1371/journal.pbio.3001669 Google Scholar
[33]
1. Kelleher J.
2. Etheridge A. M.
3. McVean G.
2016Efficient Coalescent Simulation and Genealogical Analysis for Large Sample SizesPLOS Computational Biology 12https://doi.org/10.1371/journal.pcbi.1004842 Google Scholar
[34]
1. Kelleher J.
2. Wong Y.
3. Wohns A. W.
4. Fadil C.
5. Albers P. K.
6. McVean G.
2019Inferring whole-genome histories in large population datasets (vol 51, pg 1330, 2019)Nature Genetics 51:1660https://doi.org/10.1038/s41588-019-0523-7 Google Scholar
[35]
1. Ki C.
2. Terhorst J.
2020Exact decoding of the sequentially Markov coalescenthttps://doi.org/10.1101/2020.09.21.307355 Google Scholar
[36]
1. Kingman J.
2. Coalescent The
1982Stochastic Processes and their Applications13Google Scholar
[37]
1. Korfmann K.
2. Sellinger T. P. P.
3. Freund F.
4. Fumagalli M.
5. Tellier A.
2022Simultaneous inference of past demography and selection from the ancestral recombination graph under the beta coalescentbioRxiv Google Scholar
[38]
1. Korfmann K.
2. Gaggiotti O. E.
3. Fumagalli M.
2023Deep Learning in Population GeneticsGenome Biology and Evolution 15https://doi.org/10.1093/gbe/evad008 Google Scholar
[39]
1. Lang D.
2. Zhang S.
3. Ren P.
4. Liang F.
5. Sun Z.
6. Meng G.
7. Tan Y.
8. Li X.
9. Lai Q.
10. Han L.
11. Wang D.
12. Hu F.
13. Wang W.
14. Liu S.
2020Comparison of the two up-to-date sequencing technologies for genome assembly: Hifi reads of pacific biosciences sequel ii system and ultralong reads of oxford nanoporeGIGASCIENCE 9https://doi.org/10.1093/gigascience/giaa123 Google Scholar
[40]
1. Li H.
2. Durbin R.
2011Inference of human population history from individual whole-genome sequencesNature :475–7357https://doi.org/10.1038/nature10231 Google Scholar
[41]
1. Lister R.
2. O’Malley R. C.
3. Tonti-Filippini J.
4. Gregory B. D.
5. Berry C. C.
6. Millar A. H.
7. Ecker J. R.
2008Highly integrated single-base resolution maps of the epigenome in ArabidopsisCell 133:523–536https://doi.org/10.1016/j.cell.2008.03.029 Google Scholar
[42]
1. Lyons D. B.
2. Briffa A.
3. He S.
4. Choi J.
5. Hollwey E.
6. Colicchio J.
7. Anderson I.
8. Feng X.
9. Howard M.
10. Zilberman D.
2022Extensive de novo activity stabilizes epigenetic inheritance of cg methylation in arabidopsis transposonsbioRxiv https://doi.org/10.1101/2022.04.19.488736
[43]
1. Lyons D. B.
2. Briffa A.
3. He S.
4. Choi J.
5. Hollwey E.
6. Colicchio J.
7. Anderson I.
8. Feng X.
9. Howard M.
10. Zilberman D.
2023Extensive de novo activity stabilizes epigenetic inheritance of cg methylation in arabidopsis transposonsCell Reports 42Google Scholar
[44]
1. Mahmoudi A.
2. Koskela J.
3. Kelleher J.
4. Chan Y.-b
5. Balding D.
2022Bayesian inference of ancestral recombination graphsPLOS Computational Biology 18:e1009960Google Scholar
[45]
1. Malaspinas A.-S.
2. Westaway M. C.
3. Muller C.
4. Sousa V. C.
5. Lao O.
6. Alves I.
7. Bergström A.
8. Athanasiadis G.
9. Cheng J. Y.
10. Crawford J. E.
11. et al.
2016A genomic history of aboriginal australiaNature 538:207–214Google Scholar
[46]
1. Marjoram P.
2. Wall J.
2006Fast “coalescent” simulationBMC Genetics 7https://doi.org/10.1186/1471-2156-7-16 Google Scholar
[47]
1. McVean G.
2. Cardin N.
2005Approximating the coalescent with recombinationPhilosophical Transactions of the Royal Society B-Biological Sciences :360–1459https://doi.org/10.1098/rstb.20053.1673 Google Scholar
[48]
1. Monroe J. G.
2. Srikant T.
3. Carbonell-Bejerano P.
4. Becker C.
5. Lensink M.
6. Exposito-Alonso M.
7. Klein M.
8. Hildebrandt J.
9. Neumann M.
10. Kliebenstein D.
11. Weng M.-L.
12. Imbert E.
13. Agren J.
14. Rutter M. T.
15. Fenster C. B.
16. Weigel D.
7895Mutation bias reflects natural selection in arabidopsis thalianaNATURE 602:101https://doi.org/10.1038/s41586-021-04269-6 Google Scholar
[49]
1. Muyle A.
2. Ross-Ibarra J.
3. Seymour D. K.
4. Gaut B. S.
2021Gene body methylation is under selection in arabidopsis thalianaGenetics 218:iyab061Google Scholar
[50]
1. Nordborg M.
2000Linkage disequilibrium, gene trees and selfing: An ancestral recombination graph with partial self-fertilizationMolecular Biology and Evolution 154:923–929Google Scholar
[51]
1. Nurk S.
2. Walenz B. P.
3. Rhie A.
4. Vollger M. R.
5. Logsdon G. A.
6. Grothe R.
7. Miga K. H.
8. Eichler E. E.
9. Phillippy A. M.
10. Koren S.
2020Hicanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long readsGENOME RESEARCH 30:1291–1305https://doi.org/10.1101/gr.263566.120 Google Scholar
[52]
1. Ossowski S.
2. Schneeberger K.
3. Lucas-Lledo J. I.
4. Warthmann N.
5. Clark R. M.
6. Shaw R. G.
7. Weigel D.
8. Lynch M.
2010The Rate and Molecular Spectrum of Spontaneous Mutations in Arabidopsis thalianaScience 327:92–94https://doi.org/10.1126/science.1180677 Google Scholar
[53]
1. Ou S.
2. Su W.
3. Liao Y.
4. Chougule K.
5. Agda J. R. A.
6. Hellinga A. J.
7. Lugo C. S. B.
8. Elliott T. A.
9. Ware D.
10. Peterson T.
11. Jiang N.
12. Hirsch C. N.
13. Hufford M. B.
2019Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipelineGENOME BIOLOGY 20https://doi.org/10.1186/s13059-019-1905-y Google Scholar
[54]
1. Pisupati R.
2. Nizhynska V.
3. Mollá Morales A.
4. Nordborg M.
2023On the causes of gene-body methylation variation in arabidopsis thalianaPLoS genetics 19:e1010728Google Scholar
[55]
1. Rodriguez W.
2. Mazet O.
3. Grusea S.
4. Arredondo A.
5. Corujo J. M.
6. Boitard S.
7. Chikhi L.
2018The IICR and the non-stationary structured coalescent: towards demographic inference with arbitrary changes in population structureHeredity 121:663–678https://doi.org/10.1038/s41437-018-0148-0 Google Scholar
[56]
1. Salome P. A.
2. Bomblies K.
3. Fitz J.
4. Laitinen R. A. E.
5. Warthmann N.
6. Yant L.
7. Weigel D.
2012The recombination landscape in Arabidopsis thaliana F-2 populationsHeredity 108:447–455https://doi.org/10.1038/hdy.2011.95 Google Scholar
[57]
1. Sand A.
2. Kristiansen M.
3. Pedersen C. N. S.
4. Mailund T.
2013zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithmBMC Bioinformatics 14https://doi.org/10.1186/1471-2105-14-339 Google Scholar
[58]
1. Schiffels S.
2. Durbin R.
2014Inferring human population size and separation history from multiple genome sequencesNature Genetics 46:919–925https://doi.org/10.1038/ng.3015 Google Scholar
[59]
1. Schmid M. W.
2. Heichinger C.
3. Coman Schmid D.
4. Guthörl D.
5. Gagliardini V.
6. Bruggmann R.
7. Aluri S.
8. Aquino C.
9. Schmid B.
10. Turnbull L. A.
11. et al.
2018Contribution of epigenetic variation to adaptation in arabidopsisNature Communications 9:1–12Google Scholar
[60]
1. Schmitz R. J.
2. Schultz M. D.
3. Urich M. A.
4. Nery J. R.
5. Pelizzola M.
6. Libiger O.
7. Alix A.
8. McCosh R. B.
9. Chen H.
10. Schork N. J.
11. et al.
2013Patterns of population epigenomic diversityNature 495:193–198Google Scholar
[61]
1. Schraiber J. G.
2. Akey J. M.
2015Methods and models for unravelling human evolutionary historyNature Reviews Genetics 16:727–740https://doi.org/10.1038/nrg4005 Google Scholar
[62]
1. Schweiger R.
2. Durbin R.
2023Ultra-fast genome-wide inference of pairwise coalescence timesbioRxiv Google Scholar
[63]
1. Sellinger T. P. P.
2. Abu Awad D.
3. Moest M.
4. Tellier A.
2020Inference of past demography, dormancy and self-fertilization rates from whole genome sequence dataPLOS Genetics 16https://doi.org/10.1371/journal.pgen.1008698 Google Scholar
[64]
1. Sellinger T. P. P.
2. Abu-Awad D.
3. Tellier A.
2021Limits and convergence properties of the sequentially markovian coalescentMOLECULAR ECOLOGY RESOURCES 21:2231–2248https://doi.org/10.1111/1755-0998.13416 Google Scholar
[65]
1. Shahryary Y.
2. Symeonidi A.
3. Hazarika R. R.
4. Denkena J.
5. Mubeen T.
6. Hofmeister B.
7. van Gurp T.
8. Colome-Tatch M.
9. Verhoeven K. J. F.
10. Tuskan G.
11. Schmitz R. J.
12. Johannes F.
2020Alphabeta: computational inference of epimutation rates and spectra from high-throughput dna methylation data in plantsGENOME BIOLOGY 21https://doi.org/10.1186/s13059-020-02161-6 Google Scholar
[66]
1. Speidel L.
2. Forest M.
3. Shi S.
4. Myers S. R.
2019A method for genome-wide genealogy estimation for thousands of samplesNature Genetics 51:1321https://doi.org/10.1038/s41588-019-0484-x Google Scholar
[67]
1. Srikant T.
2. Drost H.-G.
2021How stress facilitates phenotypic innovation through epigenetic diversityFrontiers in Plant Science 11:606800Google Scholar
[68]
1. Strütt S.
2. Sellinger T.
3. Glémin S.
4. Tellier A.
5. Laurent S.
2023Joint inference of evolutionary transitions to self-fertilization and demographic history using whole-genome sequencesElife 12:e82384Google Scholar
[69]
1. Taudt A.
2. Roquis D.
3. Vidalis A.
4. Wardenaar R.
5. Johannes F.
6. Colome-Tatche M.
2018Methimpute: imputation-guided construction of complete methylomes from wgbs dataBMC GENOMICS 19https://doi.org/10.1186/s12864-018-4641-x Google Scholar
[70]
1. Tellier A.
2. Laurent S. J. Y.
3. Lainer H.
4. Pavlidis P.
5. Stephan W.
2011Inference of seed bank parameters in two wild tomato species using ecological and genetic dataProceedings of the National Academy of Sciences of the United States of America 108:17052–17057https://doi.org/10.1073/pnas.1111266108 Google Scholar
[71]
1. Terhorst J.
2. Kamm J. A.
3. Song Y. S.
2017Robust and scalable inference of population history froth hundreds of unphased whole genomesNature Genetics 49:303–309https://doi.org/10.1038/ng.3748 Google Scholar
[72]
1. Upadhya G.
2. Steinrücken M.
2021Robust Inference of Population Size Histories from Genomic Sequencing Datahttps://doi.org/10.1101/2021.05.22.445274
[73]
1. van der Graaf,
2. et al.
2015Rate, spectrum, and evolutionary dynamics of spontaneous epimutationsProceedings of the National Academy of Sciences of the United States of America 112:6676–6681https://doi.org/10.1073/pnas.1424254112 Google Scholar
[74]
1. Vidalis A.
2. Zivkovic D.
3. Wardenaar R.
4. Roquis D.
5. Tellier A.
6. Johannes F.
2016Methylome evolution in plantsGenome Biology 17https://doi.org/10.1186/s13059-016-1127-5 Google Scholar
[75]
1. Wakeley J.
2008Coalescent theory: an introduction. roberts and company. Green-wood VillageWayne AF, Maxwell MA, Ward CG, Vellios CV, Wilson I, Wayne JC, Williams MR (2015)Sudden and rapid decline of the abundant marsupial Bettongia penicillata in Australia. Oryx 49:175185WebbGoogle Scholar
[76]
1. Wang C.
2. Liang C.
2018Msipred: a python package for tumor microsatellite instability classification from tumor mutation annotation data using a support vector machineSCIENTIFIC REPORTS 8https://doi.org/10.1038/s41598-018-35682-z Google Scholar
[77]
1. Wang J.
2. Fan C.
2015A neutrality test for detecting selection on dna methylation using single methylation polymorphism frequency spectrumGENOME BIOLOGY AND EVOLUTION 7:154–171https://doi.org/10.1093/gbe/evu271 Google Scholar
[78]
1. Weigel D.
2. Colot V.
2012Epialleles in plant evolutionGenome biology 13:1–6Google Scholar
[79]
1. Wiuf C.
2. Hein J.
1999Recombination as a point process along sequencesTheoretical Population Biology 55:248–259https://doi.org/10.1006/tpbi.1998.1403 Google Scholar
[80]
1. Wohns A. W.
2. Wong Y.
3. Jeffery B.
4. Akbari A.
5. Mallick S.
6. Pinhasi R.
7. Patterson N.
8. Reich D.
9. Kelleher J.
10. McVean G.
2022A unified genealogy of modern and ancient genomesSCIENCE 375:836https://doi.org/10.1126/science.abi8264 Google Scholar
[81]
1. Yang R.
2. Van Etten J. L.
3. Dehm S. M.
2018Indel detection from dna and rna sequencing data with transindelBMC GENOMICS 19https://doi.org/10.1186/s12864-018-4671-4 Google Scholar
[82]
1. Yang Z.
1996Statistical properties of a DNA sample under the finite-sites modelGenetics 144:1941–1950Google Scholar
[83]
1. Yao N.
2. Schmitz R. J.
3. Johannes F.
2021Epimutations define a fast-ticking molecular clock in plantsTrends in Genetics 37:699–710Google Scholar
[84]
1. Yao N.
2. Zhang Z.
3. Yu L.
4. Hazarika R.
5. Yu C.
6. Jang H.
7. Smith L. M.
8. Ton J.
9. Liu L.
10. Stachowicz J. J.
11. Reusch T. B. H.
12. Schmitz R. J.
13. Johannes F.
2023An evolutionary epigenetic clock in plantsScience 381:1440–1445Google Scholar
[85]
1. Zhang X.
2. Yazaki J.
3. Sundaresan A.
4. Cokus S.
5. Chan S. W.-L.
6. Chen H.
7. Henderson I. R.
8. Shinn P.
9. Pellegrini M.
10. Jacobsen S. E.
11. et al.
2006Genome-wide high-resolution mapping and functional analysis of dna methylation in arabidopsisCell 126:1189–1201Google Scholar
[86]
1. Zhang Y.
2. Wang S.
3. Wang X.
2018Data-driven-based approach to identifying differentially methylated regions using modified 1d ising modelBIOMED RESEARCH INTERNATIONAL :2018https://doi.org/10.1155/2018/1070645 Google Scholar
[87]
1. Zilberman D.
2. Gehring M.
3. Tran R. K.
4. Ballinger T.
5. Henikoff S.
2007Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcriptionNature Genetics 39:61–69https://doi.org/10.1038/ng1929 Google Scholar

Article and author information

Author information

Thibaut Sellinger
Department of Environment and Biodiversity, Paris Lodron University of Salzburg, Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich
Frank Johannes
Professorship for Plant Epigenomics, Department of Molecular Life Sciences, Technical University of Munich
Aurélien Tellier
Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich
ORCID iD: 0000-0002-8895-0785
- Corresponding author, aurelien.tellier@tum.de

Version history

Preprint posted: May 16, 2023
Sent for peer review: May 19, 2023
Reviewed Preprint version 1: August 24, 2023
Reviewed Preprint version 2: February 27, 2024
Reviewed Preprint version 3: August 5, 2024
Version of Record published: September 12, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.89470. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Reviewing Editor
Vincent Castric
Université de Lille, Lille, France
Senior Editor
Yamini Dalal
National Cancer Institute, Bethesda, United States of America

Reviewer #1 (Public Review):

The authors developed an extension to the pairwise sequentially Markov coalescent model that allows to simultaneously analyze multiple types of polymorphism data. In this paper, they focus on SNPs and DNA methylation data. Since methylation markers mutate at a much faster rate than SNPs, this potentially gives the method better power to infer size history in the recent past. Additionally, they explored a model where there are both local and regional epimutational processes.

Integrating additional types of heritable markers into SMC is a nice idea which I like in principle. However, a major caveat to this approach seems to be a strong dependence on knowing the epimutation rate. In Fig. 6 it is seen that, when the epimutation rate is known, inferences do indeed look better; but this is not necessarily true when the rate is not known. (See also major comment #1 below about the interpretation of these plots.) A roughly similar pattern emerges in Supp. Figs. 4-7; in general, results when the rates have to be estimated don't seem that much better than when focusing on SNPs alone. This carries over to the real data analysis too: the interpretation in Fig. 7 appears to hinge on whether the rates are known or estimated, and the estimated rates differ by a large amount from earlier published ones.

Overall, this is an interesting research direction, and I think the method may hold more promise as we get more and better epigenetic data, and in particular better knowledge of the epigenetic mutational process.

https://doi.org/10.7554/eLife.89470.3.sa2

Reviewer #2 (Public Review):

A limitation in using SNPs to understand recent histories of genomes is their low mutation frequency. Tellier et al. explore the possibility of adding hypermutable markers to SNP based methods for better resolution over short time frames. In particular, they hypothesize that epimutations (CG methylation and demethylation) could provide a useful marker for this purpose. Individual CGs in Arabidopsis tends to be either close to 100% methylated or close to 0%, and are inherited stably enough across generations that they can be treated as genetic markers. Small regions containing multiple CGs can also be treated as genetic markers based on their cumulative methylation level. In this manuscript, Tellier et al develop computational methods to use CG methylation as a hypermutable genetic marker and test them on theoretical and real data sets. They do this both for individual CGs and small regions. My review is limited to the simple question of whether using CG methylation for this purpose makes sense at a conceptual level, not at the level of evaluating specific details of the methods. I have a small concern in that it is not clear that CG methylation measurements are nearly as binary in other plants and other eukaryotes as they are in Arabidopsis. However, I see no reason why the concept of this work is not conceptually sound. Especially in the future as new sequencing technologies provide both base calling and methylating calling capabilities, using CG methylation in addition to SNPs could become a useful and feasible tool for population genetics in situations where SNPs are insufficient.

https://doi.org/10.7554/eLife.89470.3.sa1

Author response:

The following is the authors’ response to the previous reviews.

Public Reviews:

Reviewer #1 (Public Review):

The authors developed an extension to the pairwise sequentially Markov coalescent model that allows to simultaneously analyze multiple types of polymorphism data. In this paper, they focus on SNPs and DNA methylation data. Since methylation markers mutate at a much faster rate than SNPs, this potentially gives the method better power to infer size history in the recent past. Additionally, they explored a model where there are both local and regional epimutational processes. Integrating additional types of heritable markers into SMC is a nice idea which I like in principle. However, a major caveat to this approach seems to be a strong dependence on knowing the epimutation rate. In Fig. 6 it is seen that, when the epimutation rate is known, inferences do indeed look better; but this is not necessarily true when the rate is not known. (See also major comment #1 below about the interpretation of these plots.) A roughly similar pattern emerges in Supp. Figs. 4-7; in general, results when the rates have to be estimated don't seem that much better than when focusing on SNPs alone. This carries over to the real data analysis too: the interpretation in Fig. 7 appears to hinge on whether the rates are known or estimated, and the estimated rates differ by a large amount from earlier published ones.

Overall, this is an interesting research direction, and I think the method may hold more promise as we get more and better epigenetic data, and in particular better knowledge of the epigenetic mutational process. At the same time, I would be careful about placing too much emphasis on new findings that emerge solely by switching to SNP+SMP analysis.

Major comments:

- For all of the simulated demographic inference results, only plots are presented. This allows for qualitative but not quantitative comparisons to be made across different methods. It is not easy to tell which result is actually better. For example, in Supp. Fig. 5, eSMC2 seems slightly better in the ancient past, and times the trough more effectively, while SMCm seems a bit better in the very recent past. For a more rigorous approach, it would be useful to have accompanying tables that measure e.g. mean-squared error (along with confidence intervals) for each of the different scenarios, similar to what is already done in Tables 1 and 2 for estimating $r$.

We believe this comment was addressed in the previous revision (Sup Table 6-10) by adding Root Mean Square Errors for the demographic estimates (and RMSE for recent versus past portions of the demography).

- 434: The discussion downplays the really odd result that inputting the true value of the mutation rate, in some cases, produces much worse estimates than when they are learned from data (SFig. 6)! I can't think of any reason why this should happen other than some sort of mathematical error or software bug. I strongly encourage the authors to pin down the cause of this puzzling behaviour. (Comment addressed in revision. Still, I find the explanation added at 449ff to be somewhat puzzling -- shouldn't the results of the regional HMM scan only improve if the true mutation rate is given?)

We do understand that our results and explanation can appear counter-intuitive. As acknowledged by the reviewer, in the previous round of revision we have at length clarified this puzzling behaviour by the discrepancy in assessing methylation regions using the HMM method which then differs from the HMM for the SMC inference. We are happy to clarify further in response to the new question of reviewer 1:

If the Reviewer #1 means the SNP mutations (e.g. A → T), knowing the true mutation rate does not help the HMM to recover the region level methylation status.

If the Reviewer #1 means the epimutations (whether it is the region, site or both), knowing the true epimutations rates could theoretically help the HMM to recover the region level methylation status. However, at present, our method does not leverage information from epimutation rates to infer the region level methylation status. As inferring the epimutations rates is one of the goals of this study in the SMC inference, and that region level methylation status is required to infer those rates, we suspect that using epimutations rates to infer the region level methylation status could be statistically inappropriate (generating some kind of circular estimations). Instead, our HMM uses only the proportion of methylated and unmethylated sites (estimated from the genome) to determine whether or not a region status is most-likely to be methylated or unmethylated. We now explicit this fact in the HMM for methylation region in the method section.

We acknowledge that our HMM to infer region level methylation status could be improved, but this would be a complete project and study on its own (due to the underlying complexity of the finite site and the lack of a consensus model for epimutations at evolutionary time scale). We believe our HMM to have been the best compromise with what was known from methylation and our goals when the study was conducted, and future work is definitely worth conducting on the estimation of the methylation regions.

- As noted at 580, all of the added power from integrating SMPs/DMRs should come from improved estimation of recent TMRCAs. So, another way to study how much improvement there is would be to look at the true vs. estimated/posterior TMRCAs. Although I agree that demographic inference is ultimately the most relevant task, comparing TMRCA inference would eliminate other sources of differences between the methods (different optimization schemes, algorithmic/numerical quirks, and so forth). This could be a useful addition, and may also give you more insight into why the augmented SMC methods do worse in some cases. (Comment addressed in revision via Supp. Table 7.).

- A general remark on the derivations in Section 2 of the supplement: I checked these formulas as best I could. But a cleaner, less tedious way of calculating these probabilities would be to express the mutation processes as continuous time Markov chains. Then all that is needed is to specify the rate matrices; computing the emission probabilities needed for the SMC methods reduces to manipulating the results of some matrix exponentials. In fact, because the processes are noninteracting, the rate matrix decomposes into a Kronecker sum of the individual rate matrices for each process, which is very easy to code up. And this structure can be exploited when computing the matrix exponential, if speed is an issue.

We believe this comment was acknowledged in the previous revision (line 649), and we thank the reviewer for this interesting insight.

- Most (all?) of the SNP-only SMC methods allow for binning together consecutive observations to cut down on computation time. I did not see binning mentioned anywhere, did you consider it? If the method really processes every site, how long does it take to run?

We believe this comment was addressed in the previous revision and was added to the manuscript in the methods Section (subsection : SMC optimization function).

- 486: The assumed site and region (de)methylation rates listed here are several OOM different from what your method estimated (Supp. Tables 5-6). Yet, on simulated data your method is usually correct to within an order of magnitude (Supp. Table 4). How are we to interpret this much larger difference between the published estimates and yours? If the published estimates are not reliable, doesn't that call into question your interpretation of the blue line in Fig. 7 at 533? (Comment addressed in revision.)

Reviewer #2 (Public Review):

A limitation in using SNPs to understand recent histories of genomes is their low mutation frequency. Tellier et al. explore the possibility of adding hypermutable markers to SNP based methods for better resolution over short time frames. In particular, they hypothesize that epimutations (CG methylation and demethylation) could provide a useful marker for this purpose. Individual CGs in Arabidopsis tends to be either close to 100% methylated or close to 0%, and are inherited stably enough across generations that they can be treated as genetic markers. Small regions containing multiple CGs can also be treated as genetic markers based on their cumulative methylation level. In this manuscript, Tellier et al develop computational methods to use CG methylation as a hypermutable genetic marker and test them on theoretical and real data sets. They do this both for individual CGs and small regions. My review is limited to the simple question of whether using CG methylation for this purpose makes sense at a conceptual level, not at the level of evaluating specific details of the methods. I have a small concern in that it is not clear that CG methylation measurements are nearly as binary in other plants and other eukaryotes as they are in Arabidopsis. However, I see no reason why the concept of this work is not conceptually sound. Especially in the future as new sequencing technologies provide both base calling and methylating calling capabilities, using CG methylation in addition to SNPs could become a useful and feasible tool for population genetics in situations where SNPs are insufficient.

We thank again the reviewer #2 for his positive comments.

Reviewer #3 (Public Review):

I very much like this approach and the idea of incorporating hypervariable markers. The method is intriguing, and the ability to e.g. estimate recombination rates, the size of DMRs, etc. is a really nice plus. I am not able to comment on the details of the statistical inference, but from what I can evaluate it seems reasonable and in principle the inclusion of highly mutable sties is a nice advance. This is an exciting new avenue for thinking about inference from genomic data. I remain a bit concerned about how well this will work in systems where much less is understood about methylation,

The authors include some good caveats about applying this approach to other systems, but I think it would be helpful to empiricists outside of thaliana or perhaps mammalian systems to be given some indication of what to watch out for. In maize, for example, there is a nonbimodal distribution of CG methlyation (35% of sites are greater than 10% and less than 90%) but this may well be due to mapping issues. The authors solve many of the issues I had concerns with by using gene body methylation, but this is only briefly mentioned on line 659. I'm assuming the authors' hope is that this method will be widely used, and I think it worth providing some guidance to workers who might do so but who are not as familiar with these kind of data.

We thank the reviewer #3 for his positive comments. And we agree with Reviewer #3 concerning the application to data and that our approach needs to be carefully thought before applied. Our results clearly show that methylation processes are not well enough understood to apply our approach as we initially (maybe naively) designed it. Further investigations need to be conducted and appropriate theoretical models need to be developed before reliable results can be obtained. And we hope that our discussion points this out. However, our approach, the theoretical models and the additional tools contained in this study can be used to help researchers in their investigations to whether or not use different genomic markers to build a common (potentially more reliable) ancestral history. We enhanced the discussion in this second revision by clarifying also the use of the methylation from genic regions to avoid confusion (lines 700-731).

Recommendations for the authors:

Reviewer #1 (Recommendations For The Authors):

In added Supp. Table 7, I don't think these are in log10 units as stated in the caption.

Well Spotted! Indeed, the RMSE is not in log10 scale, we corrected the caption. We also added that the TMRCA used for MRSE calculations is in generations units to avoid potential confusion.

Reviewer #3 (Recommendations for The Authors):

I very much appreciate the authors' attention to previous questions. I would ask that a bit more is spent in the discussion on concerns/approaches empiricists should keep in mind -- I am wary of this being uncritically applied to data from non-model species. It was not clear to me, for example (only mentioned on line 659 in the discussion) that the thaliana data is only using gene-body methylation. This poses potential issues with background selection that the authors acknowledge appropriately, but also assuages many of my concerns about using genome-wide data. I think text with recommendations for data/filtering/etc or at least cautions of assumptions empiricists should be aware of would help.

We apologize for the confusion at line 659. As written in the other section of the manuscript we meant CG sites in genic regions (and not only gene body methylated regions).

Due to the manuscript’s structure, the data from Arabidopsis thaliana is only described at the very end of the manuscript (line 900+). However, a brief description could also be found line 291-296. We however added a sentence in the introduction (line 128) for clarity.

We however agree with the comment made by reviewer #3 concerning the application to data. We pointed in the discussion the risk of applying our approach on ill-understood (or illprepared) data and stressed the current need of studies on the epimutations processes at evolutionary time scale ( i.e. at Ne time scale) (line 700-703).

https://doi.org/10.7554/eLife.89470.3.sa0