Nucleosome wrapping energy in CpG islands and the role of epigenetic base modifications

Rasa Giniūnaitė; Rahul Sharma; John H. Maddocks; Skirmantas Kriaučionis; Daiva Petkevičiūtė-Gerlach

doi:10.7554/eLife.98468.1

eLife assessment

This valuable simulation study proposes a new coarse-grained model to explain the effects of CpG methylation on nucleosome wrapping energy and nucleosome positioning. The evidence to support the claims in the paper looks solid, although the novelty of the findings should be discussed in connection with the previous works. This work will be of interest to the researchers working on gene regulation and mechanisms of DNA methylation.

https://doi.org/10.7554/eLife.98468.1.sa3

Significance of findings

valuable: Findings that have theoretical or practical implications for a subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

The majority of vertebrate promoters have a distinct DNA composition, known as a CpG island. Cytosine methylation in promoter CpG islands is associated with a substantial reduction of transcription initiation. We hypothesise that both atypical sequence composition, and epigenetic base modifications may affect the mechanical properties of DNA in CpG islands, influencing the ability of proteins to bind and initiate transcription. In this work, we model two scalar measures of the sequence-dependent propensity of DNA to wrap into nucleosomes: the energy of DNA required to assume a particular nucleosomal configuration and a measure related to the probability of linear DNA spontaneously reaching the nucleosomal configuration. We find that CpG density and modification state can alter DNA mechanics by creating states more or less compatible with nucleosome formation.

Introduction

CpG islands (CGIs) are regions in vertebrate genomes with a higher frequency of CpG dinucleotide steps Bird et al. (1985); Gardiner-Garden and Frommer (1987) than surrounding DNA. This is a reflection of the general depletion of CpGs outside CGIs, where CpGs are observed at around one fifth of the randomly expected frequency International Human Genome Sequencing Consortium (2001). Most vertebrate, including human, genes often have associated CGIs Cooper et al. (1983); Larsen et al. (1992) typically coinciding with sites of transcription initiation and likely contributing to the regulation of gene activity Deaton and Bird (2011). One way CGIs function is by attracting chromatin proteins with the CxxC domain, which recognise epigenetically unmodified CpGs and are instrumental for the establishment of characteristic chromatin modification profiles at CGIs Long et al. (2013a).

The general consensus is that the majority of CGIs are epigenetically unmodified, whereas in the regions outside CGIs most cytosines in the CpG dinucleotides are methylated Ioshikhes and Zhang (2000); Hannenhalli and Levy (2001); Bock et al. (2007); Han and Zhao (2008). Recently, Long et al. (2013b) have experimentally identified regions with non-methylated DNA in seven diverse vertebrates. They called those regions non-methylated islands (NMIs). Long et al. (2013b) demon-strated that in some instances NMIs do not coincide with computationally classified CGIs (Table 1). Furthermore, they showed that NMIs, and not CGIs, are central to the definition of gene promoters in the vertebrates that they studied.

Total number of considered sequences from different regions of the human genome. Sequence listings available at https://github.com/rginiunaite/CGI-NMI-sequences.

For understanding how CGIs and NMIs impact the local chromatin structure and contribute to gene regulation, it is important to know how DNA mechanics is influenced by its sequence and epigenetic modifications. (In this work we are solely concerned with double-stranded or dsDNA, which we therefore just hereafter refer to as DNA.) One of the widely studied properties of DNA are the sequence-dependent effects on nucleosome positioning. A nucleosome comprises 147 base pairs of DNA wrapped around the histone core, and is the elementary unit of DNA packing into chromatin. The positions and dynamics of nucleosomes contribute to DNA transcription, replication, and repair Andrews and Luger (2011); Yasuda et al. (2005); Chen et al. (2010). Various computational models have been developed for predicting nucleosome positioning based on DNA sequence Ioshikhes et al. (2006); Segal et al. (2006); Gupta et al. (2008); Struhl and Segal (2013), physical properties Gabdank et al. (2009, 2010) and deformation free energy Ruscio and Onufriev (2006a); Battistini et al. (2010); Chen et al. (2016); Eslami-Mossallam et al. (2016); Liu et al. (2018). It has been shown that methylation and hydroxymethylation change DNA mechanical properties Pérez et al. (2012); Battistini et al. (2021) and nucleosome forming affinity Buitrago et al. (2021); Choy et al. (2010); Lee and Lee (2012); Lee et al. (2015); Jimenez-Useche and Yuan (2012); Li et al. (2022). For example, Ngo et al. Ngo et al. (2016) demonstrated that methylation of DNA decreases the mechanical stability of a nucleosome, as measured by a fluorescence-force spectroscopy assay. Whereas, multiple studies reveal that DNA methylation induces a more compact and rigid nucleosome structure Choy et al. (2010); Lee and Lee (2012); Lee et al. (2015). Another computational study by Yoo et al. Yoo et al. (2021) showed that DNA methylation of CpG sites can significantly increase the bending energy.

In this work, we compute the free energy, required for DNA to reach a configuration in a nucleosome, as well as the probability density, associated with the optimal nucleosomal configuration of DNA, for ensembles of sequence fragments drawn from different regions across the human genome, and compare with analogous computations on sequence ensembles generated artificially. To model sequence-dependent DNA mechanics we use the cgNA+ model (https://cgdnaweb.epfl.ch Sharma et al. (2023); Bruin and Maddocks (2018)).

In previous work we presented a method for predicting a sequence-dependent configuration and associated free energy of DNA wrapped on a nucleosome Giniūnaitė and Petkevičiūtė-Gerlach (2022). The method is based on minimisation of the cgNA+ model free energy for a given sequence while constraining the positions of phosphates bound to the histone core. The indices and allowed positions of bound phosphates were identified from the cylindrical coordinates of 30 experimental PDB structures of nucleosomes.

In this article we use an improved version of this method to explore the differences in nucleosome wrapping energies and the probability densities for nucleosomal configurations between sequences drawn from inside and outside both CGIs and NMIs. We first show that the nucleosome wrapping energy increases with increasing concentration of CpG dinucleotide steps only when the cytosines in those steps is methylated or hydroxymethylated. Then we investigate intersections and disjunctions of CGI and NMI regions and demonstrate that the intersection of these two sequence ensembles ensures the lowest probability densities of nucleosomal configurations. We also show that the probability densities of nucleosomal configurations decrease with increasing CpG numbers. Finally we investigate the relation between wrapping energies and experimentally observed nucleosome occupancy scores Schwartz et al. (2019); Yazdi et al. (2015).

Methods

The cgNA+ model

The cgNA+ coarse-grain model of dsNA predicts an equilibrium distribution on coordinates w ∈ ℝ^N of a linear fragment of a double helical nucleic acid with arbitrary sequence 𝒮 in the Gaussian, or multidimensional normal, form

Here Z is the normalising constant, or partition function,

We call μ(𝒮, 𝒫) ∈ ℝ^N, with N = 24 n − 18, the expected, or ground, or minimum free energy configuration, which has strong and nonlocal dependence on the n base pair sequence 𝒮. K(𝒮, 𝒫) ∈ ℝ^N×N is a (banded) stiffness, or inverse covariance, matrix scaled such that

is a (free) energy expressed in units of kT. 𝒫 is a parameter set, which in this presentation we restrict to cases describing DNA with arbitrary sequences in the alphabet {A, T, C, G, MpN, HpK}, where MpN and HpK are CpG dinucleotode steps in which the cytosines are either both methylated or both hydroxymethylated, respectively.

The cgNA+ model is an extension in two directions of the precursor cgDNA model Gonzalez et al. (2013); Petkevičiūtė et al. (2014) in which the configuration coordinate w was restricted to rescaled versions of the standard intra and inter base-pair Curves+ Lavery et al. (2009) coordinates which determine the relative rigid body displacements of all the bases in a DNA (and which respect the Tsukuba convention Olson et al. (2001)). For our purposes, the first critical extension of cgDNA was to cgDNA+ Patelli (2019) in which the coordinate vector w was extended to explicitly include the relative rigid body displacements between bases and adjacent phosphate groups, also assumed to be rigid, but only with a parameter set 𝒫 allowing sequences 𝒮 in the standard {A, T, C, G} alphabet. The second crucial extension from cgDNA+ to cgNA+ Sharma et al. (2023); Sharma (2023) was to estimate, and test, parameter sets for other dsNAs and with extended alphabets including epigenetically modified bases. In this presentation we consider only the case of DNA but with a parameter set that distinguishes between unmodified CpG dinucleotide steps, methylated CpG dinucleotide steps (symmetrically so that both cytosines are modified, denoted MpN), and hydroxymethylated CpG dinucleotide steps (again symmetrically and denoted HpK).

cgNA+ parameter sets are estimated by fitting model predictions for first and second moments (or respectively μ(𝒮, 𝒫) and K⁻¹(𝒮, 𝒫)) for a training library of sequences 𝒮_i to statistics drawn directly from large scale, fully atomistic molecular dynamics (or MD) simulations. The MD simulation protocol reflects both assumed physical solvent conditions, such as counter ion species and concentration, and the choice of atomistic MD simulation potentials. The parameter set 𝒫 adopted here was based on simulations with 150mM KCl ions and the AMBER software Pearlman et al. (1995); Case et al. (2005) with the parmbsc1 force field Ivani et al. (2013), explicit TIP3P water Jorgensen et al. (1983) and the Joung and Cheatham ion model Joung and Cheatham III (2008). The additional MD force field parameters for modified cytosines were taken from Pérez et al. Pérez et al. (2012) and Battistini et al. Battistini et al. (2021). MD simulations of twelve 24 base-pair length sequences were used for training model parameters for methylated DNA. These sequences contained methylated CpG steps and combinations of methylated CpG steps in diverse sequence contexts. An analogous training library was used to train hydroxymethylated DNA parameters. The model parameters for unmodified DNA were separately trained on diverse and comprehensive library of 16 sequences containing all possible tetranucleotides at least once.

The predictions of the cgNA+ model were found to be extremely accurate compared to an extensive set of test MD simulations and in good agreement with limited experimental protein-DNA X-ray crystallography data Sharma (2023). Above all, the cgNA+ model is computationally so efficient that predictions of statistics for hundreds of thousands of sequences can be easily handled, which is not feasible with direct MD simulation. Thus, we used cgNA+ free energy for linear fragments as the starting point for developing a method for computing sequence-dependent nucleosome wrapping energies.

Nucleosome wrapping energy for a DNA sequence

A sequence-dependent configuration w_opt of 147 bp of DNA wrapped into a nucleosome is modelled by minimising the cgNA+ free energy U (w; 𝒮, 𝒫) (3):

where

is a set of elastic constraints on the positions p_i(w) of the 28 DNA phosphates, that are closest to the histone core. 𝒮 is any given DNA sequence of length 147 bp and 𝒫 is our cgNA+ model parameter set. The reference positions were obtained from a set of 100 experimental PDB structures of nucleosomes by averaging, and the indices of the 28 phosphates, closest to the nucleosome core, are identified as in our previous work Giniūnaitė and Petkevičiūtė-Gerlach (2022). The penalty coefficients c_i are set through numerical experiments to keep the distances within the ranges observed in the PDB structures, while avoiding steric clashes in DNA.

The energy minimisation (4) is performed numerically using the fminunc function of Matlab with provided gradient and Hessian values. An averaged configuration of 100 experimental structures of DNA in nucleosomes is used as the initial or starting configuration for the optimisation procedure for all sequences. The optimisation for a 147 bp DNA sequence takes approximately 30 seconds.

The energy minimisation algorithm used in this work improves its previous versionGiniūnaitė and Petkevičiūtė-Gerlach (2022) which was sensitive to the starting configuration and reached to minimum energies with inflated magnitude while keeping the trends similar to experimental observations. This development improves those shortcomings by incorporating two main changes. Firstly, the constraints (5) are elastic, in contrast to previously used hard intervals. In addition, rather then performing the optimisation (4) in the cgNA+ coordinates and computing the absolute positions of the constrained phosphates after every minimisation step, we use a mixed coordinate vector, with absolute positions of constrained phosphates, absolute positions and orientations of their adjacent bases and all the base pairs, while the rest of the configuration is described in the cgNA+ coordinates. Although the conversion to the full cgNA+ coordinates for evaluating the energy is still necessary after every optimisation step, this approach provides the possibility to derive the gradient vector and the Hessian matrix for the constrained optimisation problem (4), which significantly improves the performance of the algorithm. As a consequence of these modifications, nucleosome wrapping energies are similar in magnitude as well as in trends to those observed in experiments (as discussed in results).

In this work we compare sequence-dependent energy values U (w_opt; 𝒮, 𝒫) (3) with units kT as well as the natural logarithms of the optimal nucleosomal configuration probability density (1)

for different sequences 𝒮. The probability densities can be regarded as proportional to probabilities of DNA spontaneously reaching the configuration w_opt, when these probabilities are estimated in a small domain around w_opt, with the same domain volume for all the sequences.

A more detailed mathematical description of the computational method will be published separately, and the Matlab code is available at https://github.com/daivaaviad/optDNA_nucleosome.

Experimental data

Computationally predicted CGI regions from the human genome are obtained from the UCSC genome browser Kent et al. (2002), whereas experimentally identified NMIs for human liver cells are taken from Long et al. (2013b). The human genome version used in these studies is Genome Reference Consortium Human Build 37 (GRCh37). Note that to make the necessary computations feasible, for each specific sequence in an ensemble such as CGIs, NMIs or their intersections or complements, we only consider one specific central 147 bp sequence per region. The exact sequences used in our analysis are available at https://github.com/rginiunaite/CGI-NMI-sequences. Data for nucleosome occupancy scores for HeLa cells was taken from Schwartz et al. (2019) and for human genome embryonic stem cells from Yazdi et al. (2015)

Results

The spread of predicted DNA nucleosomal configurations is similar to that of experimental structures

We first compare our predicted sequence-dependent optimal DNA nucleosomal configurations (4) for 100 human genome sequences with 100 experimental configurations from the Protein Data Bank Berman et al. (2000). In Figure 1a we observe an orderly positioning of phosphates in the aligned experimental structures. Note that, because the structures are aligned, the phosphates adjacent to the nucleosome dyad fall into the same spatial cluster despite the variation in sequence length across the PDB structures. For each helical turn, we choose one phosphate cluster that is closest to the nucleosome center (points coloured in red) and use the index of that cluster to define the constraints in (5). Figure 1c shows the analogous scatter plot for the configurations w_opt(𝒮, 𝒫) (4) predicted for the first 100 non-methylated CGI sequences in our human genome sample. The positioning of phosphates in Figure 1c is rather similar to the one in Figure 1a, and the clusters of phosphates are of comparable sizes in both plots, even though there seems to be more variation in the experimental structures. This difference can be explained by the diversity of experimental settings, such as differences in ion concentration, the presence of histone modifications, additional ligands and other experimental conditions, that are not captured in our model. Another difference between the two plots is the unwrapping of approximately five base-pairs at each end of the predicted configurations. While in our model there are no restrictions for this behaviour, in the experimental setting there could be other factors, such as histone tails, keeping the DNA ends closer to the nucleosome core.

Left column: locations of the Watson strand phosphates for 100 aligned nucleosome structures, projected to a plane perpendicular to the nucleosome central axis. Top row corresponds to 100 experimental PDB nucleosome structures (not all with independent sequences). Red points are phosphates with local minima of radial distance used to identify bound indices. Bottom row analogous data over 100 predicted minimal energy nucleosomal configurations for sequences drawn from human genome CpG islands. The phosphates with bound indices that are constrained during the optimisation are coloured in red. Right panels: standard deviations over sequence of radial distance of all phosphates against index along the Watson strand. Top PDB structures, bottom model computations. Bound indices are marked with solid red vertical lines. Dashed black vertical lines mark indices of bound complementary (Crick) strand phosphates.

The spread of phosphate positions in each cluster is quantified by the standard deviations of phosphate distances to the nucleosome central axis, plotted in Figure 1b and Figure 1d. As already seen in the scatter plots on the left, the spread of the predicted configurations is smaller and more regular than that of the experimental structures. The main conclusion here is that the standard deviation reaches its local minima for the phosphate clusters closest to the nucleosome core (indices marked by solid red vertical lines, corresponding to the red points on the left plots). Interestingly, the local minima of standard deviations is also reached for positions corresponding to the histone touching phosphates on the complementary (Watson) strand (marked by dashed black vertical lines). This observation holds for both experimental and predicted nucleosomal configurations and indicates that the phosphates chosen to be constrained in our optimisation method are also constrained (bound to the histone core) in the experimental nucleosomes.

CpG step (hydroxy)methylation affects DNA nucleosome wrapping energy and the probability density of nucleosomal configuration

To assess the sequence dependence of the nucleosome wrapping energy and of the probability density of the optimal nucleosomal configuration, we initially perform a computational experiment in which we generate four sets of sequences of length 147 bp, each containing a thousand sequences with a varying number of CpG dinucleotide steps, ranging from 0 to 4, from 5 to 14, from 15 to 24 and from 25 to 34. Each sequence is first generated with equal probabilities for each base, and then if the desired density of CpG steps needs to be increased, dinucleotide steps in random positions are replaced by CpGs. Similarly, if the density needs to be decreased, a base in a CpG dinucleotide is replaced by another, all in a randomised way. From these sequence ensembles, we also create another eight sets of sequences, first by symmetrically methylating (MpN), and second by hydroxymethylating (HpK), both cytosines in all the instances of the CpG dinucleotides.

We then use our optimisation algorithm to compute the energies required for these sequences to wrap onto nucleosomes. The resulting energy values are shown in Figure 2. The average of the predicted nucleosome wrapping energy over all the 4K unmodified random sequences is 86.12 kT. As expected, this value is higher than the energy prediction for the synthetic nucleosome positioning sequence Widom 601 Lowary and Widom (1998) (76.23 kT) and the naturally occurring sequence 5S, known to have a high nucleosome forming affinity Simpson and Stafford (1983) (83.76 kT). An opposite extreme, the 147 bp poly-A sequence, has a high predicted wrapping energy of 95.08 kT. Above examples illustrate that the modeling matches expectations for some known DNA sequences. When we vary unmodified CpG density, only minor differences of wrapping energy are observed (Figure 2a). However, the average energy increases substantially when cytosines are methylated or hydroxymethylated to obtain MpN or HpK steps. These results can be well-associated with the findings that suggest that methylation increases DNA stiffness Lee and Lee (2012); Pérez et al. (2012); Ngo et al. (2016). The effects of hydroxymethylation and methylation are quite similar.

Spectra of nucleosome wrapping energies and logarithms of probability densities for the optimal nucleosomal configurations for 147 bp sequences (a,c) generated randomly and (b,d) drawn from the human genome, grouped by the indicated ranges of numbers of CpG dinucleotide steps: dots averages, bars standard deviation in sequence. For methylated and hydroxymethylated data all CpG steps are symmetrically modified.

The changes in nucleosome wrapping energy due to CpG methylation or hydroxymethylation can be explained not only by altered DNA stiffness, but also by modifications in its equilibrium configuration. For example, the roll, twist and slide inter base-pair coordinates are strongly affected when DNA wraps onto a nucleosome Giniūnaitė and Petkevičiūtė-Gerlach (2022) and they are all substantially modified in the linear ground state when cytosines are methylated or hydroxymethylated (Figure 3a). The linear ground state coordinates of the phosphates also change both when wrapping onto a nucleosome and with cytosine modification (Figure 3b), but this change is more dependent on the sequence context Sharma (2023). The same observation holds for intra base-pair coordinates (Suplementary Figure S7).

Effects of sequence context and epigenetic base modifications on the ground state shape of CpG steps. Bar plots of the ground state values of (a) six inter base-pair step and (b) six Watson phosphate coordinates for CpG steps i) averaged over sequence context with standard deviations in thin lines and ii) the extreme case of poly(CpG) (in hatch). In each case three versions corresponding to unmodified, methylated and hydroxymethylated steps. The standard deviations highlight the crucial role of non-local sequence dependence in the equilibrium structure of CpG/MpN/HpK steps. Analogous plots for the remaining intra base-pair coordinates and Crick phosphate coordinates are shown in Supplementary Figure S7.

We then compare the values of the logarithm of the probability density of the optimal nucleosomal configurations (6). The probability density is proportional to probability of DNA spontaneously acquiring its optimal nucleosomal configuration, estimated in a small domain around that configuration. It can also be regarded as a measure of DNA mechanical affinity to form nucleosomes, which includes the (negative) nucleosome wrapping and also approximates entropic effects or thermal fluctuations.

For our set of random sequences, the log probability density decreases with the growing number of unmodified CpG steps (Figure 2c). Cytosine methylation weakens the trend while also increasing the average log densities within each range of CpG count. In contrast, cytosine hydroxymethylation leads to a faster decrease in log densities with the growing CpG count.

To verify whether our observations for randomly generated sequences also hold for biologically more realistic sequences, we perform the same analysis for sequences obtained from the human genome. We consider four sub-ensembles of our human sequence fragments grouped by their numbers of CpG dinucleotides falling in the intervals that correspond to constrained numbers of CpG steps in our randomized sequence ensembles. Figure 2b demonstrates that for the human sequence ensembles, just as for the random sequence ensembles (Figure 2a), the nucleosome wrapping energy is not strongly affected with the number of unmodified CpG dinucleotide steps. Cytosine (hydroxy)methylation also increases the nucleosome wrapping energy for human genome sequences. However, some differences can be observed between the two ensembles. For most of the human sequence sub-ensembles, there are somewhat higher nucleosome wrapping energies and a sharper drop in log probability densities than for the comparable random ensembles. This observation remains unchanged after sub-sampling human genome sequences to have 1K data points in each CpG range, the same number as for random sequences (Supplementary Figure S1). We first hypothesised that different clustering features of CpG dinucleotides might explain these differences. To investigate this hypothesis, we looked at the distances along the sequences between CpG dinucleotides. But we did not observe any significant differences in the distributions of these distances between human genome and random sequence ensembles (Supplementary Figure S2)

These differences might instead be associated with non-random distribution of other dinucleotide steps in the two sets of sequence ensembles. Figure 4 gives the average number of all of the 16 possible dinucleotide steps in the random and human ensembles in the case of 10 CpGs for random and [5, 14] CpGs for human ensembles (other cases are provided in Supplementary Figures S3-S5). The distribution can be seen to be highly non-uniform for the human genome sequences. For example, one striking feature of the [5, 14] human sequence ensemble is the small number of ApT and TpA dinucleotides which are found to be most stiff and flexible dimer step in both experiments and simulationsYoung et al. (2022); Sharma (2023). It may well be that this depletion is a result of promoter sequences avoiding mechanistically extreme dimer steps.

Average number of instances of the 16 different dinucleotide steps for (a) 1000 random 147 bp sequences and for (b) our 147 bp human genome sequence ensemble, with [5, 14] CpGs (different colours correspond to fragments taken from different chromosomes). Dinucleotide steps are ordered next to their complements, with self-complementary steps listed on the right.

We further tested whether the non-uniform dinucleotide counts, as opposed to the specific arrangement of dinucleotides, is the key reason for the difference in energies and nucleosomal configuration probability densities between the human and random sequences. To this end, we explored the scenario in which we keep the same count of each dinucleotide step in each sequence in the [5, 14] human genome sequence ensemble, but we reordered the dinucleotide steps using the Altschul-Erickson dinucleotide shuffle algorithm Altschul and Erickson (1985). We observe that in this scenario the resulting distributions of nucleosome wrapping energies and of nucleosomal configuration log probability densities remain significantly more similar to that of the unshuffled human ensemble than to the analogous random sequence ensemble (Supplementary Figure S6). This observation suggests that the non-uniform count of dinucleotides is central in explaining the differences in wrapping energies and log probability densities between random and human genome sequence ensembles.

In fact the ground state configuration of the DNA in each junction has a quite strong dependence on sequence context beyond the junction dinucleotide. This phenomenon has been observed in MD simulation Pasi et al. (2014); Balaceanu et al. (2019) and crystallography experimentsYoung et al. (2022). It is also encapsulated in the cgNA+ model. It has further been observed Sharma (2023) that epigenetic base modifications lead to larger changes in the ground state configuration within CpG junctions when the two flanking bases in the tetramer context are C/G rich (also Figure 3). For instance, in an average context, hydroxy(methylation) of CpG step reduces its twist significantly. In contrast, when a poly-CpG sequence is hydroxy(methylated), the predicted twist of the CpG steps increases (Figure 3). Therefore, for assessing the effect of sequence shuffling on the ground shape of DNA, it is of interest to investigate the flanking context of the CpG dinucleotides. The tetranucleotide sequence logos over all CpG steps in three of our four sub-ensembles of human sequences are in fact rich in C/Gs, as shown in the sequence logos of Figure 5a, where the amount of the flanking enrichment depends on the four cases of ranges of numbers of CpG dinucleotide steps. It is also the case that the constraints on the elevated number of CpG steps in the fragments are strong enough that the tetranucleotide sequence logos remain essentially unaltered in each of the four cases for the sequence ensembles that arise after the dinucleotide step sequence shuffling algorithm is applied, Figure 5b. Nevertheless when comparing the logos in panels a) and b) in detail, there is a signal indicating that the flanking C/G enrichment is slightly stronger in the original human sequence ensemble, than it is after shuffling.

Sequence logos for tetramer flanking context of CpG dinucleotide steps for (a) all four sequence ensembles from the human genome with varying numbers of CpG junctions, and (b) all four sequence ensembles from the human genome after dinucleotide shuffling (but respecting the numbers of dinucleotide steps). Just specifying the numbers of CpG dinucleotide steps is a strong enough constraint to leave the tetramer sequence context logos largely unchanged after shuffling. The sequence logos in panel a) for the human sequence ensemble before sequence shuffling, suggest a slightly stronger C/G flanking enrichment than after shuffling.

Overlap of CpG islands and NMIs leads to the lowest probability densities of nucleosomal configurations

In this section we split the human genome into four regions based on data from Long et al. (2013b): A) Intersection of CpG islands (CGIs) and NMIs; B) NMIs that do not intersect with CGIs; C) CGIs that do not intersect with NMIs; D) Regions that intersect neither with CGIs nor with NMIs. The numbers of sequences in each sub-ensemble listed in Table 1.

The data presented in Figure 6a reveals that the nucleosome wrapping energies have similar distributions in all four regions, if we do not include methylation (round dots and solid error bars). If we include methylation everywhere in not NMIs (i.e. respecting the definition of NMI), there is an increase in the wrapping energy for sequences that are CGIs that are not NMIs (triangle dots and dashed error bars in Figure 6a red). Wrapping energies for sequences that belong neither to CGIs nor to NMIs, do not exhibit such a significant change upon methylation (green).

Spectra of (a) nucleosome wrapping energies and (b) log probability densities of the optimal nucleosomal configurations for 147 bp sequences drawn from four different regions of the human genome: (A) intersection of CGI and NMI, (B) NMI and not CGI, (C) CGI and not NMI, (D) not CGI and not NMI (Table 1). Dots averages, error bars standard deviation over sequence, solid and circles when CpG dinucleotides are not methylated, dashed and triangles when CpGs are methylated.

The log probability density of the optimal nucleosomal configuration has the lowest average value for sequences in the intersection of CGIs and NMIs. Even though methylation of the sequences that are CGIs but not NMIs increases the log probability density values, the highest densities are for sequences that are not CGIs but are NMIs (blue) or in the regions outside CGIs and NMIs (green).

It is important to note that the number of sequences drawn from the four different regions is not equidistributed. Table 1 shows that there are fewer sequences that are CGIs but not NMIs, i.e. they are methylated CGIs, than in the other three categories. Nevertheless, approximately 30% of CGIs are methylated, so it is reasonable to consider methylated CGIs as a separate category. Note that for practical restrictions on total computational resources we compute wrapping energies for only one 147 bp representative sequence drawn from each occurrence of each of the four types of regions over the entire genome. Table 1 reports the resulting numbers of fragments, i.e. the number of instances of each of the four types of regions. But the numbers in Table 1 do not reflect the total number of bp covered by each of the four types of region. In reality the number of base pairs in each occurrence of the regions that are neither CGI nor NMI is much higher than in the other three types, so that the union of all not CGI and not NMI regions covers by far most of the genome.

Nucleosome wrapping energies and probability densities of the optimal nucleosomal configurations compared with nucleosome occupancy scores

We now compare our wrapping energy predictions and DNA nucleosomal configuration log probability density predictions with experimentally measured nucleosome occupancy scores as reported in Schwartz et al. (2019) and Yazdi et al. (2015). We have extracted their reported occupancy scores for each of our selected 147 bp fragments and first grouped the data by the methylated and non-methylated regions (NMIs and not NMIs), then within each region according to the number of CpGs in the corresponding sequences.

Figure 7 shows that nucleosome occupancy is decreasing with increasing CpG count for both NMI and not NMI regions, with one exception of passing from [0;4] to [5;14] in the Yazdi et al. data. This trend is compatible with the increase of nucleosome wrapping energy for methylated sequences in Figure 2b and the decrease of log probability density for nucleosomal configurations in Figure 2d.

Spectra of nucleosome ocupancy scores for our 86,874 selected sequences, grouped by the genomic regions (NMI and *not* NMIs) and by indicated ranges of numbers of CpG dinucleotide steps: dots averages, error bars standard deviation in sequence. The number of sequences in each group is listed in Table 2. See also Figure 2d.

According to Yazdi et al. data, for the CpG count falling into the middle intervals, from 5 to 14 and form 15 to 24, methylated sequences have a higher average occupancy than unmethylated sequences. This difference is also observed in our log probability density predictions in Figure 2d. For the remaining two CpG count intervals and all the Schwartz et al. data, the occupancy for methylated sequences is lower than or very similar to unmethylated ones.

Numbers of human genome sequence fragments of length 147 bp taken from CGIs, non CGIs, NMIs and non NMIs grouped by the number of CpG dinucleotide steps in each of four intervals.
As expected CGI fragments have relatively more CpG junctions than non CGI fragments. NMIs also have more CpGs than non NMIs.

We then grouped our sequences according to the four genomic sub-regions. Figure 8a reveals that for the data extracted from Schwartz et al. (2019), methylated CGIs (red) have a higher average nucleosome occupancy than unmethylated CGIs (black), but smaller than the non-CGI regions. For Yazdi et al. (2015) data, the distribution of nucleosome occupancy scores is lowest for the intersection of CGIs and NMIs (black), and is highest for the intersection of CGIs and not NMIs (red), i.e. both the lowest and highest occupancy distributions arise for sequences drawn from CpG islands, with the lowest occupancies in unmethylated fragments and the highest in methylated fragments.

Normalised frequencies (each of the four histograms in each plot normalised independently) of experimental nucleosome occupancy scores for our 86,874 selected sequences grouped by each of the four types of regions in the genome (cf. Table 1). Average score for each region is indicated by a vertical dashed line of appropriate colour. The black and red (but not blue or green) histograms have significant spikes reflecting many instances of zero occupancy in the experimental data.

Boths sets of experimental data indicate that in CGIs the highest occupancies arise for the fragments that have methylated CpG dinucleotides and therefore higher nucleosome wrapping energies. This conclusion, in particular, apparently runs counter to the (perhaps naive) intuition that high nucleosome forming affinity should arise for fragments with low wrapping energy.

In order to further probe this observation we selected a 50K run of bp in the human genome. (Specifically from chromosome I, between genomic positions 850K and 900K, as this range contains the largest number of CGI and NMI intersections.) We then computed the the probability density of an optimal nucleosomal configuration for every possible 147 bp window at the resolution of 1 bp shifts. (These computations are quite intensive, requiring around 900 hours of CPU time for the relatively short 50K bp segment, which is why longer subsequences were not considered.) The resulting data is plotted in Figure 9 panel (a), with the CGIs indicated with magenta underlining and NMIs in cyan. On average, the lowest log probability densities arise at the intersections of CGIs and NMIs: the mean value of log probability density is 468.61 kT over the intersection of CGI and NMI regions, and 476.10 kT in the complementary regions.

(a) Predicted log probability density for an optimal nucleosomal configuration (b) nucleosome occupancy scores from ***Schwartz et al. (2019***) and (c) nucleosome occupancy scores from ***Yazdi et al. (2015***) for sequence positions 850K-900K of human chromosome I. In the regions corresponding to the intersection of CGIs and NMIs, both the mean log probability density (468.61) and mean scores (2.62 and 139.53) are smaller than outside of the intersection regions (476.10, 5.89 and 212.00 respectively).

Panels (b) and (c) of Figure 9 provide analogous plots for occupancy scores, again taken from Schwartz et al. (2019) and Yazdi et al. (2015) respectively. Again the lowest average values arise for sequences in the intersection of CGIs and NMIs: the average scores are 2.62 and 139.53 in the intersection of CGIs and NMIs, versus 5.89 and 212.53 outside of the intersection regions.

The observations about nucleosome occupancy should be regarded as preliminary, and be treated with caution, as they are based on experimental data obtained for the cancerous HeLa cells Schwartz et al. (2019) and human genome embryonic stem cells Yazdi et al. (2015), while for the classification of NMI and not NMI we use the data of Long et al. (2013b) obtained from human liver cells. With this caveat, we reach the conclusion that the lowest occupancies arise for sequences with the lowest log probability densities.

CONCLUSIONS and DISCUSSION

Conclusions

In this work, we studied the computed sequence-dependent mechanical nucleosome wrapping energy, required to deform a linear 147 bp DNA fragment to a configuration, where the appropriate 28 phosphates can bind to the histone core, as well as the probability density function, that can be regarded as proportional to the probability of linear DNA spontaneously reaching the nucleosomal configuration.

We explored sequence dependence of the energy and the probability density corresponding to our predicted optimal nucleosomal DNA configurations. Our analysis includes the effects of both methylation and hydroxymethylation epigenetic modifications of CpG dinucleotides. To achieve this, we used the newly developed computational method to solve the constrained minimisation problem (4) in terms of the cgNA+ energy (3) subject to constraints on the phosphates binding to histones in given ranges of configurations. The fact that the cgNA+ model includes an explicit description of the phosphate group configurations allows for a comparatively simple description of the DNA-histone binding site constraints, which we believe to be a significant improvement over prior rigid base-pair coarse grain DNA models used for nucleosome wrapping energy prediction Eslami-Mossallam et al. (2016); Chen et al. (2016); Liu et al. (2018); Neipel et al. (2020). We believe that our minimisation algorithm delivers an accurate ordering of sequence dependent wrapping energies and probability densities, given the accuracy of the cgNA+ energy (3). The cgNA+ probability density function (1) is itself known to deliver highly accurate sequence-dependent statistics of linear fragments compared to MD simulations carried out with the same protocol as the cgNA+ parameter set training data. However, a MD protocol perfectly emulating experimental conditions (which are often different in different experiments) is challenging and therefore, some approximations must be made. For example, the parameter set used here models DNA in 150mM KCl solution, whereas both ion type and concentration might be different in both experiment and in vivo.

Nucleosome wrapping energies, the corresponding optimal configurations and their probability densities could also be computed via approaches that adopt MD simulations directly, e.g. Ruscio and Onufriev (2006b); Ngo et al. (2016); Battistini et al. (2021). Along with accurate treatment of sequence-dependent mechanics of DNA, the key advantage of our coarse-grained approach is that it is computationally much more efficient, so that large numbers of sequences can be considered. For example, when epigenetic sequence variants are included, the data described in this article involves approximately 400K solves of the minimisation problem (4). And analogous numbers of MD simulations are currently unfeasible.

The minimisation principle (4) delivers not only a wrapping energy and a probability density, but also the detailed configuration w_opt realising the minimal wrapping energy. We compared our computed optimal configurations of DNA in a nucleosome with the experimental PDB structures and found significant similarities between the two configuration ensembles. Further and more detailed analysis is both feasible and interesting. For example, the roll and slide (inter base-pair coordinates) are strongly affected when DNA wraps onto a nucleosome Giniūnaitė and Petkevičiūtė-Gerlach (2022), and they are both substantially modified in the linear ground state when cytosines are methylated or hydroxymethylated Sharma (2023). The linear ground state coordinates of the phosphates also change with cytosine modification, but this change is more dependent on the sequence context Sharma (2023).

We then computed spectra of wrapping energies and the nucleosomal configuration probability densities for ensembles of 147 bp fragments with differing numbers of CpG dinucleotides, with sequences both generated artificially and drawn from the human genome. We concluded that for increasing numbers of CpG steps the wrapping energies increased substantially, but only for epigenetically modified CpGs. The effects on the wrapping energies of the two epigenetic modifications of methylation and hydroxymethylation are very similar. The nucleosomal configuration probability densities decreased with increasing CpG counts both for unmodified and (hydroxy)methylated DNA. However, for each CpG count interval, methylation increased and hydroxymethylation decreased the average probability densities.

As discussed fully in the main text, these trends were similar in both the artificial and human genome sequence ensembles, although there are perceptible differences, perhaps because of local and non-local sequence dependence in DNA. Notably, the two data sets have different flanking contexts, for example, the human genome sequences have a small bias towards having more C/G flanking bases in the tetramer context to central CpG dinucleotides, along with some highly nonuniform distributions of other dinucleotides, e.g. very low occurrences of ApT and TpA steps.

We then compared nucleosome wrapping energies, in both epigenetically unmodified and modified versions, for ensembles of DNA sequences constructed by drawing one representative from each instance in the human genome of the four region types CGI and NMI, CGI and not NMI, not CGI and NMI, and finally not CGI and not NMI. We were motivated to consider four types of region by the work of Long et al. (2013b) who demonstrated that NMIs cannot be reliably identified by CGIs algorithms and NMIs may have more biological significance. They also found that NMIs are consistent across species, and in warm-blooded organisms these regions coincide with transcription initiation sites. The assumption that CGIs never have epigenetically modified CpG dinucleotides is often made when analysing CGIs Ioshikhes and Zhang (2000); Hannenhalli and Levy (2001); Bock et al. (2007); Han and Zhao (2008), although the current definitions of CGIs do not actually entail this information, so that the studies often lack detail in this respect Long et al. (2013b). Accordingly we considered all four possibilities of intersections and disjunctions between CGIs and NMIs. Our main conclusion from studying wrapping energy spectra from the four types of region is that the lowest probability densities of nucleosomal configurations arise precisely for unmodified CGI sequences, that is sequences that are both CGI and NMI.

The restriction to drawing one representative from each instance of each of the four types of region was dictated merely to limit the necessary computations to a feasible magnitude. We did verify that our results were not sensitive to precisely how we chose the 147 bp representative from each region. Another limitation dictated by available computational resources is the focus on human genome data only. It would be interesting to explore the same data (CGIs and NMIs) for other warm and cold-blooded organisms which were also provided by Long et al. (2013b). That data might provide deeper insights because the regions of interest and their intersections differ vastly across different organisms.

Discussion

We believe that our predictive computational model of nucleosome wrapping energies and the nucleosomal configuration probability densities is (subject to the aforementioned caveats) both sufficiently accurate and efficient to explore biologically pertinent ensembles of sequences and compare model predictions with experimental observations. It is presumably the case that nucleosome wrapping energy will make a significant contribution to predicting nucleosome binding affinities at a particular site. Both stiffness and groundstate of DNA fragment (which are accurately captured in the cgNA+ model Sharma (2023); Sharma et al. (2023)) contribute to the sequence dependence of wrapping energy. At the same time, differences in stiffness also contribute to sequence dependent differences in fluctuations about the minimal energy wrapped configuration w_opt. Thus we believe that sequence (including epigenetic modifications) dependent entropy-like corrections are necessary to be able to accurately predict binding affinities from wrapping energies, and computing the probability densities of the optimal nucleosomal configurations is a way to account for those corrections.

Furthermore, the process of comparing the predicted densities with the nucleosome occupancy scores is fraught with many potential sources of inaccuracy. Firstly, any computation involving only the DNA takes no account of the possibly sequence-dependent contributions of the histone tails, epigenetically modified or not. Secondly, the probability densities are not probabilities of DNA wrapping into a nucleosomal configurations, but could be regarded as proportional to such, assuming that these probabilities can be approximated by a one-point integral over a small domain of the same volume for all the sequences. The validity of this assumption is not completely obvious.

Generally there have been opposing views in the literature about the relationship between nucleosome occupancy scores and sequence induced mechanical properties of DNA. Yoo et al. Yoo et al. (2021) claimed that nucleosome occupancy scores anticorrelate with the wrapping energy. Whereas, Collings et al. Collings and Anderson (2017) demonstrated that methylated regions, which are known to have high wrapping energy, are among the highest nucleosome occupied elements in the genome. It has also been shown that CGIs are five-fold depleted for observed nucleosome coverage Valouev et al. (2011) which suggests a positive correlation between nucleosome binding energy and nucleosome occupancy scores.

In this work, we contribute to this discussion by investigating the relations between our probability density predictions and the experimentally observed nucleosome occupancy scores from Schwartz et al. (2019) and Yazdi et al. (2015). Our predictions agree with both sets of data in concluding that methylation of CpG islands increase the probability of nucleosome formation. However, the precise ordering of the four genomic regions of CGI and NMI groups by nucleosome occupancy is different in all three cases (two experimental data sets and our predictions). This might be due to different methylation patterns for cancerous HeLa cells in Schwartz et al. (2019), human embryonic stem cells in Yazdi et al. (2015) and liver cells in Long et al. (2013b), used for identifying non methylated regions for our computations. Matched DNA modification to nucleosome occupancy experimental data and investigation of different cell-types will likely reveal more accurately how cells evolve nucleotide composition and modification patterns to reach optimal nucleosome occupancy in different genomic regions.

Data availability

CGI regions from the human genome can be obtained from the UCSC genome browser Kent et al. (2002), whereas experimentally identified NMIs for liver cells are provided in the SI of Long et al. (2013b). The human genome version used in these studies is Genome Reference Consortium Human Build 37 (GRCh37). The full listing of sequences used in our analysis are available in the github repository, at https://github.com/rginiunaite/CGI-NMI-sequences. The raw experimental occupancy data can be accessed from the SI of Schwartz et al. (2019) and Yazdi et al. (2015).

Acknowledgements

This work received financial support from the Research Council of Lithuania (LMTLT), agreement number S-MIP-21-5 (for D.P.-G. and R.G.), Marius Jakulis Jason fund (for R.G.), Swiss National Science Foundation, Grant Number 200020_182184 (for J.H.M. and R.S., as well as the EPFL Scitas computational resources used for this work) and Ludwig Cancer Research, Oxford (for S.K.).

References

1. Altschul SF
2. Erickson BW
1985Signiﬁcance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usageMol Biol Evol 2:526–538Google Scholar
1. Andrews AJ
2. Luger K.
2011Nucleosome structure (s) and stability: variations on a themeAnnu Rev Biophys 40:99–117Google Scholar
1. Balaceanu A
2. Buitrago D
3. Walther J
4. Hospital A
5. Dans PD
6. Orozco M.
2019Modulation of the helical properties of DNA : next-to-nearest neighbour effects and beyondNucleic Acids Res 47:4418–4430Google Scholar
1. Battistini F
2. Dans PD
3. Terrazas M
4. Castellazzi CL
5. Portella G
6. Labrador M
7. Villegas N
8. Brun-Heath I
9. González C
10. Orozco M.
2021The Impact of the HydroxyMethylCytosine epigenetic signature on DNA structure and functionPLOS Comput Biol 17:1–24Google Scholar
1. Battistini F
2. Hunter CA
3. Gardiner EJ
4. Packer MJ
2010Structural Mechanics of DNA Wrapping in the NucleosomeJ Mol Biol 396:264–279Google Scholar
1. Berman HM
2. Westbrook J
3. Feng Z
4. Gilliland G
5. Bhat TN
6. Weissig H
7. Shindyalov IN
8. Bourne PE
2000The Protein Data BanNucleic Acids Res 28:235–242Google Scholar
1. Bird A
2. Taggart M
3. Frommer M
4. Miller OJ
5. Macleod D.
1985A fraction of the mouse genome that is derived from islands of nonmethylated, CpG-rich DNACell 40:91–99Google Scholar
1. Bock C
2. Walter J
3. Paulsen M
4. Lengauer T.
2007CpG island mapping by epigenome predictionPLoS Comput Biol 3:e110Google Scholar
1. Bruin LD
2. Maddocks JH
2018cgDNAweb: a web interface to the cgDNA sequence-dependent coarse-grain model of double-stranded DNANucleic Acids Res 46:W5–W10Google Scholar
1. Buitrago D
2. Labrador M
3. Arcon JP
4. Lema R
5. Flores O
6. Esteve-Codina A
7. Blanc J
8. Villegas N
9. Bellido D
10. Gut M
11. Dans PD
12. Heath SC
13. Gut IG
14. Brun Heath I
15. Orozco M.
2021Impact of DNA methylation on 3D genome structureNat Commun 12:3243Google Scholar
1. Case DA
2. Cheatham TE
3. Darden T
4. Gohlke H
5. Luo R
6. Merz KM
7. Onufriev A
8. Simmerling C
9. Wang B
10. Woods RJ
2005The Amber biomolecular simulation programsJ Comput Chem 26:1668–1688Google Scholar
1. Chen W
2. Feng P
3. Ding H
4. Lin H
5. Chou KC
2016Using deformation energy to analyze nucleosome positioning in genomesGenomics 107:69–75Google Scholar
1. Chen W
2. Luo L
3. Zhang L.
2010The organization of nucleosomes around splice sitesNucleic Acids Res 38:2788–2798Google Scholar
1. Choy JS
2. Wei S
3. Lee JY
4. Tan S
5. Chu S
6. Lee TH
2010DNA methylation increases nucleosome compaction and rigidityJ Am Chem Soc 132:1782–1783Google Scholar
1. Collings CK
2. Anderson JN
2017Links between DNA methylation and nucleosome occupancy in the human genomeEpigenetics & chromatin 10:1–19Google Scholar
1. Cooper DN
2. Taggart MH
3. Bird AP
1983Unmethlated domains in vertebrate DNANucleic Acids Res 11:647–658Google Scholar
1. Deaton AM
2. Bird A.
2011CpG islands and the regulation of transcriptionGenes Dev 25:1010–1022Google Scholar
1. Eslami-Mossallam B
2. Schram RD
3. Tompitak M
4. van Noort J
5. Schiessel H.
2016Multiplexing genetic and nucleosome positioning codes: a computational approachPloS one 11:e0156905Google Scholar
1. Gabdank I
2. Barash D
3. Trifonov EN
2009Nucleosome DNA bendability matrix (Celegans). J Biomol Struct 26:403–411Google Scholar
1. Gabdank I
2. Barash D
3. Trifonov EN
2010Single-base resolution nucleosome mapping on DNA sequencesJ Biomol Struct 28:107–121Google Scholar
1. Gardiner-Garden M
2. Frommer M.
1987CpG islands in vertebrate genomesJ Mol Biol 196:261–282Google Scholar
1. Giniūnaitė R
2. Petkevičiūtė-Gerlach D.
2022Predicting the conﬁguration and energy of DNA in a nucleosome by coarse-grain modellingPhys Chem Chem Phys 2022:26124–26133https://doi.org/10.1039/D2CP03553G Google Scholar
1. Gonzalez O
2. Petkevičiūtė D
3. Maddocks JH
2013A sequence-dependent rigid-base model of DNAJ Chem Phys 138Google Scholar
1. Gupta S
2. Dennis J
3. Thurman RE
4. Kingston R
5. Stamatoyannopoulos JA
6. Noble WS
2008Predicting human nucleosome occupancy from primary sequencePLoS Comput Biol 4:e1000134Google Scholar
1. Han L
2. Zhao Z.
2008Comparative analysis of CpG islands in four ﬁsh genomesComp Funct Genomics 2008:565631Google Scholar
1. Hannenhalli S
2. Levy S.
2001Promoter prediction in the human genomeBioinformatics 17:S90–S96Google Scholar
1. International Human Genome Sequencing Consortium
2001Initial sequencing and analysis of the human genomeNature 409:860–921Google Scholar
1. Ioshikhes IP
2. Albert I
3. Zanton SJ
4. Pugh BF
2006Nucleosome positions predicted through comparative genomicsNat Genet 38:1210–1215Google Scholar
1. Ioshikhes IP
2. Zhang MQ
2000Large-scale human promoter mapping using CpG islandsNat Genet 26:61–63Google Scholar
1. Ivani I
2. Dans PD
3. Noy A
4. Perez A
5. Faustino I
6. Hospital A
7. Walther J
8. Pau A
9. Goni R
10. Balaceanu A
11. Portella G
12. Battistini F
13. Gelp JL
14. Gonzalez C
15. Vendruscolo M
16. Laughton CA
17. Harris SA
18. Case DA
19. Orozco M.
2013Parmbsc1: a reﬁned force ﬁeld for DNA simulationsNat Meth 13:55–58Google Scholar
1. Jimenez-Useche I
2. Yuan C.
2012The effect of DNA CpG methylation on the dynamic conformation of a nucleosomeBiophys J 103:2502–2512Google Scholar
1. Jorgensen WL
2. Chandrasekhar J
3. Madura JD
4. Impey RW
5. Klein ML
1983Comparison of simple potential functions for simulating liquid waterJ Chem Phys 79:926–935Google Scholar
1. Joung IS
2. Cheatham III TE
2008Determination of alkali and halide monovalent ion parameters for use in explicitly solvated biomolecular simulationsJ Chem Phys B 112:9020–9041Google Scholar
1. Kent WJ
2. Sugnet CW
3. Furey TS
4. Roskin KM
5. Pringle TH
6. Zahler AM
7. Haussler D.
2002The human genome browser at UCSCGenome Res 12:996–1006Google Scholar
1. Larsen F
2. Gundersen G
3. Lopez R
4. Prydz H.
1992CpG islands as gene markers in the human genomeGenomics 13:1095–1107Google Scholar
1. Lavery R
2. Moakher M
3. Maddocks JH
4. Petkeviciute D
5. Zakrzewska K.
2009Conformational analysis of nucleic acids revisited: Curves+Nucleic Acids Res 37:5917–5929Google Scholar
1. Lee JY
2. Lee J
3. Yue H
4. Lee TH
2015Dynamics of nucleosome assembly and effects of DNA methylationJ Biol Chem 290:4291–4303Google Scholar
1. Lee JY
2. Lee TH
2012Effects of DNA methylation on the structure of nucleosomesJ Am Chem Soc 134:173–175Google Scholar
1. Li S
2. Peng Y
3. Panchenko AR
2022DNA methylation: Precise modulation of chromatin structure and dynamicsCurr Opin Struct Biol 75:102430Google Scholar
1. Liu G
2. Xing Y
3. Zhao H
4. Cai L
5. Wang J.
2018The implication of DNA bending energy for nucleosome positioning and slidingSci Rep 8Google Scholar
1. Long HK
2. Blackledge NP
3. Klose RJ
2013ZF-CxxC domain-containing proteins, CpG islands and the chromatin connectionBiochem Soc Trans 41:727–740Google Scholar
1. Long HK
2. Sims D
3. Heger A
4. Blackledge NP
5. Kutter C
6. Wright ML
7. Grützner F
8. Odom DT
9. Patient R
10. Ponting CP
11. et al.
2013Epigenetic conservation at gene regulatory elements revealed by non-methylated DNA proﬁling in seven vertebrateseLife 2:e00348Google Scholar
1. Lowary P
2. Widom J.
1998New DNA sequence rules for high affinity binding to histone octamer and sequencedirected nucleosome positioningJ Mol Biol 276:19–42Google Scholar
1. Neipel J
2. Brandani G
3. Schiessel H.
2020Translational nucleosome positioning: A computational studyPhys Rev E 101:022405Google Scholar
1. Ngo T
2. Yoo J
3. Dai Q
4. Zhang Q
5. He C
6. Aksimentiev A
7. Ha T.
2016Effects of cytosine modiﬁcations on DNA flexibility and nucleosome mechanical stabilityNature Communications 7:1–9Google Scholar
1. Olson WK
2. Bansal M
3. Burley SK
4. Dickerson RE
5. Gerstein M
6. Harvey SC
7. Heinemann U
8. Lu XJ
9. Neidle S
10. Shakked Z
11. Sklenar H
12. Suzuki M
13. Tung CS
14. Westhof E
15. Wolberger C
16. Berman HM
2001A Standard Reference Frame for the Description of Nucleic Acid Base-pair GeometryJ Mol Biol 313:229–237Google Scholar
1. Pasi M
2. Maddocks JH
3. Beveridge D
4. Bishop TC
5. Case DA
6. Thomas Cheatham I
7. Dans PD
8. Jayaram B
9. Lankas F
10. Laughton C
11. Mitchell J
12. Osman R
13. Orozco M
14. Pérez A
15. Petkevičiūtė D
16. Spackova N
17. Sponer J
18. Zakrzewska K
19. Lavery R.
2014μABC: a systematic microsecond molecular dynamics study of tetranucleotide sequence effects in B-DNANucleic Acids Res 42:12272–12283Google Scholar
1. Patelli AS
2019A sequence-dependent coarse-grain model of B-DNA with explicit description of bases and phosphate groups parametrised from large scale Molecular Dynamics simulationsPhD thesis, EPFL Google Scholar
1. Pearlman DA
2. Case DA
3. Caldwell JW
4. Ross WS
5. Cheatham TE
6. DeBolt S
7. Ferguson D
8. Seibel G
9. Kollman P.
1995AMBER, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of moleculesComput Phys Commun 91:1–41Google Scholar
1. Pérez A
2. Castellazzi CL
3. Battistini F
4. Collinet K
5. Flores O
6. Deniz O
7. Ruiz ML
8. Torrents D
9. Eritja R
10. Soler-López M
11. et al.
2012Impact of methylation on the physical properties of DNABiophys J 102:2140–2148Google Scholar
1. Petkevičiūtė D
2. Pasi M
3. Gonzalez O
4. Maddocks J.
2014cgDNA: a software package for the prediction of sequence-dependent coarse-grain free energies of B-form DNANucleic Acids Res 42:e153–e153Google Scholar
1. Ruscio JZ
2. Onufriev A.
2006A computational study of nucleosomal DNA flexibilityBiophys J 91:4121–4132Google Scholar
1. Ruscio JZ
2. Onufriev A.
2006A Computational Study of Nucleosomal DNA FlexibilityBiophys J 91:4121–4132Google Scholar
1. Schwartz U
2. Németh A
3. Diermeier S
4. Exler JH
5. Hansch S
6. Maldonado R
7. Heizinger L
8. Merkl R
9. Längst G.
2019Characterizing the nuclease accessibility of DNA in human cells to map higher order structures of chromatinNucleic Acids Res 47:1239–1254Google Scholar
1. Segal E
2. Fondufe-Mittendorf Y
3. Chen L
4. Thåström A
5. Field Y
6. Moore IK
7. Wang JPZ
8. Widom J.
2006A genomic code for nucleosome positioningNature 442:772–778Google Scholar
1. Sharma R.
2023cgNA+: A sequence-dependent coarse-grain model of double-stranded nucleic acidsPhD thesis, EPFL Google Scholar
1. Sharma R
2. Patelli AS
3. De Bruin L
4. Maddocks JH
2023cgNA+web: A visual interface to the cgNA+ sequence-dependent statistical mechanics model of double-stranded nucleic acidsJ Mol Biol :167978Google Scholar
1. Simpson RT
2. Stafford DW
1983Structural features of a phased nucleosome core particleProc Natl Acad Sci USA 80:51–55Google Scholar
1. Struhl K
2. Segal E.
2013Determinants of nucleosome positioningNat Struct Mol Biol 20:267–273Google Scholar
1. Valouev A
2. Johnson SM
3. Boyd SD
4. Smith CL
5. Fire AZ
6. Sidow A.
2011Determinants of nucleosome organization in primary human cellsNature 474:516–520Google Scholar
1. Yasuda T
2. Sugasawa K
3. Shimizu Y
4. Iwai S
5. Shiomi T
6. Hanaoka F.
2005Nucleosomal structure of undamaged DNA regions suppresses the non-speciﬁc DNA binding of the XPC complexDNA repair 4:389–395Google Scholar
1. Yazdi PG
2. Pedersen BA
3. Taylor JF
4. Khattab OS
5. Chen YH
6. Chen Y
7. Jacobsen SE
8. Wang PH
2015Nucleosome organization in human embryonic stem cellsPloS one 10:e0136314Google Scholar
1. Yoo J
2. Park S
3. Maffeo C
4. Ha T
5. Aksimentiev A.
2021DNA sequence and methylation prescribe the inside-out conformational dynamics and bending energetics of DNA minicirclesNucleic Acids Res 49:11459–11475Google Scholar
1. Young RT
2. Czapla L
3. Wefers ZO
4. Cohen BM
5. Olson WK
2022Revisiting DNA sequence-dependent deformability in high-resolution structures: Effects of flanking base pairs on dinucleotide morphology and global chain conﬁgurationLife 12:759Google Scholar

Article and author information

Author information

Rasa Giniūnaitė
Department of Applied Mathematics, Kaunas University of Technology, Studentų 50-318, 51368, Kaunas, Lithuania, Institute of Applied Mathematics, Vilnius University, Naugarduko 24, 03225, Vilnius, Lithuania
Rahul Sharma
Institut de Mathématiques, École Polytechnique Fédérale de Lausanne, EPFL SB MATH LCVMM„ Station 8, CH-1015 Lausanne Switzerland
John H. Maddocks
Institut de Mathématiques, École Polytechnique Fédérale de Lausanne, EPFL SB MATH LCVMM„ Station 8, CH-1015 Lausanne Switzerland
Skirmantas Kriaučionis
Ludwig Institute for Cancer Research Ltd, University of Oxford, Nuffield Department of Medicine, Old Road Campus Research Building, Roosevelt Drive, Oxford OX3 7DQ, UK
ORCID iD: 0000-0002-2273-5994
Daiva Petkevičiūtė-Gerlach
Department of Applied Mathematics, Kaunas University of Technology, Studentų 50-318, 51368, Kaunas, Lithuania
ORCID iD: 0000-0002-7646-2442
- For correspondence:⠀daiva.petkeviciute@ktu.lt (D.P.G.)

Version history

Sent for peer review: April 6, 2024
Preprint posted: April 10, 2024
Reviewed Preprint version 1: May 29, 2024
Reviewed Preprint version 2: August 14, 2025
Reviewed Preprint version 3: November 25, 2025
Version of Record published: December 24, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.98468. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Anna Panchenko
Queen's University, Kingston, Canada
Senior Editor
Yamini Dalal
National Cancer Institute, Bethesda, United States of America

Reviewer #1 (Public Review):

Summary:

In this manuscript, the authors used a coarse-grained DNA model (cgNA+) to explore how DNA sequences and CpG methylation/hydroxymethylation influence nucleosome wrapping energy and the probability density of optimal nucleosomal configuration. Their findings indicate that both methylated and hydroxymethylated cytosines lead to increased nucleosome wrapping energy. Additionally, the study demonstrates that methylation of CpG islands increases the probability of nucleosome formation.

Strengths:

The major strength of this method is that the model explicitly includes elastic constraints on the positions of phosphate groups facing a histone octamer, as DNA-histone binding site constraints. The authors claim that their model enhances the accuracy and computational efficiency and allows comprehensive calculations of DNA mechanical properties and deformation energies.

Weaknesses:

A significant limitation of this study is that the parameter sets for the methylated and hydroxymethylated CpG steps in the cgNA+ model are derived from all-atom molecular dynamics (MD) simulations that suggest that both methylated and hydroxymethylated cytosines increase DNA stiffness and nucleosome wrapping energy (Pérez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 2021). It could predispose the coarse-grained model to replicate these findings. Notably, conflicting results from other all-atom MD simulations, such as those by Ngo T in Nat. Commun. 2016, shows that hydroxymethylated cytosines increase DNA flexibility, contrary to methylated cytosines. If the cgNA+ model was trained on these later parameters or other all-atom force fields, different conclusions might be obtained regarding the effects of methylated and hydroxymethylation on nucleosome formation.

Despite the training parameters of the cgNA+ model, the results presented in the manuscript indicate that methylated cytosines increase both DNA stiffness and nucleosome wrapping energy. However, when comparing nucleosome occupancy scores with predicted nucleosome wrapping energies and optimal configurations, the authors find that methylated CGIs exhibit higher nucleosome occupancies than unmethylated ones, which seems to contradict their findings from the same paper which showed that increased stiffness should reduce nucleosome formation affinity. In the manuscript, the authors also admit that these conclusions "apparently runs counter to the (perhaps naive) intuition that high nucleosome forming affinity should arise for fragments with low wrapping energy". Previous all-atom MD simulations (Pérez A, et al. Biophys J. 2012; Battistini, et al. PLOS Comput Biol. 202; Ngo T, et al. Nat. Commun. 20161) show that the stiffer DNA upon CpG methylation reduces the affinity of DNA to assemble into nucleosomes or destabilizes nucleosomes. Given these findings, the authors need to address and reconcile these seemingly contradictory results, as the influence of epigenetic modifications on DNA mechanical properties and nucleosome formation are critical aspects of their study.
Understanding the influence of sequence-dependent and epigenetic modifications of DNA on mechanical properties and nucleosome formation is crucial for comprehending various cellular processes. The authors' study, focusing on these aspects, will definitely garner interest from the DNA methylation research community.

https://doi.org/10.7554/eLife.98468.1.sa2

Reviewer #2 (Public Review):

Summary:

This study uses a coarse-grained model for double-stranded DNA, cgNA+, to assess nucleosome sequence affinity. cgNA+ coarse-grains DNA on the level of bases and accounts also explicitly for the positions of the backbone phosphates. It has been proven to reproduce all-atom MD data very accurately. It is also ideally suited to be incorporated into a nucleosome model because it is known that DNA is bound to the protein core of the nucleosome via the phosphates.

It is still unclear whether this harmonic model parametrized for unbound DNA is accurate in describing DNA inside the nucleosome. Previous models by other authors, using more coarse-grained models of DNA, have been rather successful in predicting base pair sequence-dependent nucleosome behavior. This is at least the case as far as DNA shape is concerned whereas assessing the role of DNA bendability (something this paper focuses on) has been consistently challenging in all nucleosome models, to my knowledge.

It is thus of major interest whether this more sophisticated model is also more successful in handling this issue. As far as I can tell the work is technically sound and properly accounts for not only the energy required in wrapping DNA but also entropic effects, namely the change in entropy that DNA experiences when going from the free state to the bound state. The authors make an approximation here which seems to me to be a reasonable first step.

Of interest is also that the authors have the parameters at hand to study the effect of methylation of CpG-steps. This is especially interesting as it allows us to study a scenario where changes in the physical properties of base pair steps via methylation might influence nucleosome positioning and stability in a cell-type-specific way.

Overall, this is an important contribution to the question of how the sequence affects nucleosome positioning and affinity. The findings suggest that cgNA+ has something new to offer. But the problem is complex, also on the experimental side, so many questions remain open.

Strengths:

The authors use their state-of-the-art coarse-grained DNA model which seems ideally suited to be applied to nucleosomes as it accounts explicitly for the backbone phosphates.

Weaknesses:

(1) According to the abstract the authors consider two "scalar measures of the sequence-dependent propensity of DNA to wrap into nucleosomes". One is the bending energy and the other, is the free energy. Specifically in the latter, the authors take the difference between the free energies of the wrapped and the free DNA. Whereas the entropy of the latter can be calculated exactly, they assume that the bound DNA always has the same entropy (independent of sequence) in its more confined state. The problem is the way in which this is written (e.g. below Eq. 6) which is hard to understand. The authors should mention that the negative of Eq. 6 is what physicists call free energy, namely especially the free energy difference between bound and free DNA.

(2) In Eq. 5 the authors introduce penalty coefficients c_i. They write that values are "set by numerical experiment to keep distances ... within the ranges observed in the PDB structure, while avoiding sterical clashes in DNA." This is rather vague, especially since it is unclear to me what type of sterical clashes might occur. Figure 1 shows then a comparison between crystal structures and simulated structures. They are reasonably similar but standard deviations in the fluctuations of the simulation are smaller than in the experiments. Why did the authors not choose smaller c_i-values to have a better fit? Do smaller values lead to unwanted large fluctuations that would lead to steric clashes between the two DNA turns? I also wonder what side views of the nucleosomes look like (experiments and simulations) and whether in this side view larger fluctuations of the phosphates can be observed in the simulation that would eventually lead to turn-turn clashes for smaller c_i-values.

https://doi.org/10.7554/eLife.98468.1.sa1

Reviewer #3 (Public Review):

Summary:

In this study, the authors utilize biophysical modeling to investigate differences in free energies and nucleosomal configuration probability density of CpG islands and nonmethylated regions in the genome. Toward this goal, they develop and apply the cgNA+ coarse-grained model, an extension of their prior molecular modeling framework.

Strengths:

The study utilizes biophysical modeling to gain mechanistic insight into nucleosomal occupancy differences in CpG and nonmethylated regions in the genome.

Weaknesses:

Although the overall study is interesting, the manuscripts need more clarity in places. Moreover, the rationale and conclusion for some of the analyses are not well described.

https://doi.org/10.7554/eLife.98468.1.sa0

Nucleosome wrapping energy in CpG islands and the role of epigenetic base modifications

Significance of findings

Strength of evidence

Abstract

Introduction

Total number of considered sequences from different regions of the human genome. Sequence listings available at https://github.com/rginiunaite/CGI-NMI-sequences.

Methods

The cgNA+ model

Nucleosome wrapping energy for a DNA sequence

Experimental data

Results

The spread of predicted DNA nucleosomal configurations is similar to that of experimental structures

CpG step (hydroxy)methylation affects DNA nucleosome wrapping energy and the probability density of nucleosomal configuration

Overlap of CpG islands and NMIs leads to the lowest probability densities of nucleosomal configurations

Nucleosome wrapping energies and probability densities of the optimal nucleosomal configurations compared with nucleosome occupancy scores

Numbers of human genome sequence fragments of length 147 bp taken from CGIs, non CGIs, NMIs and non NMIs grouped by the number of CpG dinucleotide steps in each of four intervals.

CONCLUSIONS and DISCUSSION

Conclusions

Discussion

Data availability

Acknowledgements

References

Article and author information

Author information

Rasa Giniūnaitė

Rahul Sharma

John H. Maddocks

Skirmantas Kriaučionis

Daiva Petkevičiūtė-Gerlach

Version history

Cite all versions

Copyright

Peer review process

Editors

Be the first to read new articles from eLife