Integrative analysis of DNA replication origins and ORC binding sites in human cells reveals a lack of overlap between them

  1. Center for Public Health Genomics, University of Virginia School of Medicine, Charlottesville, VA 22908;
  2. Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, VA 22908;
  3. Department of Genetics, University of Alabama at Birmingham, Birmingham, AL 35233;
  4. Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Bruce Stillman
    Cold Spring Harbor Laboratory, Cold Spring Harbor, United States of America
  • Senior Editor
    Kevin Struhl
    Harvard Medical School, Boston, United States of America

Reviewer #1 (Public Review):

In the best genetically and biochemically understood model of eukaryotic DNA replication, the budding yeast, Saccharomyces cerevisiae, the genomic locations at which DNA replication initiates are determined by a specific sequence motif. These motifs, or ARS elements, are bound by the origin recognition complex (ORC). ORC is required for loading of the initially inactive MCM helicase during origin licensing in G1. In human cells, ORC does not have a specific sequence binding domain and origin specification is not specified by a defined motif. There have thus been great efforts over many years to try to understand the determinants of DNA replication initiation in human cells using a variety of approaches, which have gradually become more refined over time.

In this manuscript Tian et al. combine data from multiple previous studies using a range of techniques for identifying sites of replication initiation to identify conserved features of replication origins and to examine the relationship between origins and sites of ORC binding in the human genome. The authors identify a) conserved features of replication origins e.g. association with GC-rich sequences, open chromatin, promoters and CTCF binding sites. These associations have already been described in multiple earlier studies. They also examine the relationship of their determined origins and ORC binding sites and conclude that there is no relationship between sites of ORC binding and DNA replication initiation. While the conclusions concerning genomic features of origins are not novel, if true, a clear lack of colocalization of ORC and origins would be a striking finding. However, the majority of the datasets used do not report replication origins, but rather broad zones in which replication origins fire. Rather than refining the localisation of origins, the approach of combining diverse methods that monitor different objects related to DNA replication leads to a base dataset that is highly flawed and cannot support the conclusions that are drawn, as explained in more detail below.

Methods to determine sites at which DNA replication is initiated can be divided into two groups based on the genomic resolution at which they operate. Techniques such as bubble-seq, ok-seq can localise zones of replication initiation in the range ~50kb. Such zones may contain many replication origins. Conversely, techniques such as SNS-seq and ini-seq can localise replication origins down to less than 1kb. Indeed, the application of these different approaches has led to a degree of controversy in the field about whether human replication does indeed initiate at discrete sites (origins), or whether it initiates randomly in large zones with no recurrent sites being used. However, more recent work has shown that elements of both models are correct i.e. there are recurrent and efficient sites of replication initiation in the human genome, but these tend to be clustered and correspond to the demonstrated initiation zones (Guilbaud et al., 2022).

These different scales and methodologies are important when considering the approach of Tian et al. The premise that combining all available data from five techniques will increase accuracy and confidence in identifying the most important origins is flawed for two principal reasons. First, as noted above, of the different techniques combined in this manuscript, only SNS-seq can actually identify origins rather than initiation zones. It is the former that matters when comparing sites of ORC binding with replication origin sites if a conclusion is to be drawn that the two do not co-localise.

Second, the authors give equal weight to all datasets. Certainly, in the case of SNS-seq, this is not appropriate. The technique has evolved over the years and some earlier versions have significantly different technical designs that may impact the reliability and/or resolution of the results e.g. in Foulk et al. (Foulk et al., 2015), lambda exonuclease was added to single stranded DNA from a total genomic preparation rather than purified nascent strands), which may lead to significantly different digestion patterns (ie underdigestion). Curiously, the authors do not make the best use of the largest SNS-seq dataset (Akerman et al., 2020) by ignoring these authors separation of core and stochastic origins. By blending all data together any separation of signal and noise is lost. Further, I am surprised that the authors have chosen not to use data and analysis from a recent study that provides subsets of the most highly used and efficient origins in the human genome, at high resolution (Guilbaud et al., 2022).

References:

Akerman I, Kasaai B, Bazarova A, Sang PB, Peiffer I, Artufel M, Derelle R, Smith G, Rodriguez-Martinez M, Romano M, Kinet S, Tino P, Theillet C, Taylor N, Ballester B, Méchali M (2020) A predictable conserved DNA base composition signature defines human core DNA replication origins. Nat Commun, 11: 4826

Foulk MS, Urban JM, Casella C, Gerbi SA (2015) Characterizing and controlling intrinsic biases of lambda exonuclease in nascent strand sequencing reveals phasing between nucleosomes and G-quadruplex motifs around a subset of human replication origins. Genome Res, 25: 725-735

Guilbaud G, Murat P, Wilkes HS, Lerner LK, Sale JE, Krude T (2022) Determination of human DNA replication origin position and efficiency reveals principles of initiation zone organisation. Nucleic Acids Res, 50: 7436-7450

Reviewer #2 (Public Review):

Tian et al. perform a meta-analysis of 113 genome-wide origin profile datasets in humans to assess the reproducibility of experimental techniques and shared genomics features of origins. Techniques to map DNA replication sites have quickly evolved over the last decade, yet little is known about how these methods fare against each other (pros and cons), nor how consistent their maps are. The authors show that high-confidence origins recapitulate several known features of origins (e.g., correspondence with open chromatin, overlap with transcriptional promoters, CTCF binding sites). However, surprisingly, they find little overlap between ORC/MCM binding sites and origin locations.

Overall, this meta-analysis provides the field with a good assessment of the current state of experimental techniques and their reproducibility, but I am worried about: (a) whether we've learned any new biology from this analysis; (b) how binding sites and origin locations can be so mismatched, in light of numerous studies that suggest otherwise; and (c) some methodological details described below.

Major comments:

-- Line 26: "0.27% were reproducibly detected by four techniques" -- what does this mean? Does the fragment need to be detected by ALL FOUR techniques to be deemed reproducible? And what if the technique detected the fragment is only 1 of N experiments conducted; does that count as "detected"? Later in Methods, the authors (line 512) say, "shared origins ... occur in sufficient number of samples" but what does *sufficient* mean? Then on line 522, they use a threshold of "20" samples, which seems arbitrary to me. How are these parameters set, and how robust are the conclusions to these settings? An alternative to setting these (arbitrary) thresholds and discretizing the data is to analyze the data continuously; i.e., associate with each fragment a continuous confidence score.

-- Line 20: "50,000 origins" vs "7.5M 300bp chromosomal fragments" -- how do these two numbers relate? How many 300bp fragments would be expected given that there are ~50,000 origins? (i.e., how many fragments are there per origin, on average)? This is an important number to report because it gives some sense of how many of these fragments are likely nonsense/noise. The authors might consider eliminating those fragments significantly above the expected number, since their inclusion may muddle biological interpretation.

-- Line 143: I'm not terribly convinced by the PCA clustering analysis, since the variance explained by the first 2 PCs is only ~25%. A more robust analysis of whether origins cluster by cell type, year etc is to simply compute the distribution of pairwise correlations of origin profiles within the same group (cell type, year) vs the correlation distribution between groups. Relatedly, the authors should explain what an "origin profile" is (line 141). Is the matrix (to which PCA is applied) of size 7.5M x 113, with a "1" in the (i,j) position if the ith fragment was detected in the jth dataset?

-- It's not clear to me what new biology (genomic features) has been learned from this meta-analysis. All the major genomic features analyzed have already been found to be associated with origin sites. For example, the correspondence with TSS has been reported before:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6320713/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547456/

So what new biology has been discovered from this meta-analysis?

-- Line 250: The most surprising finding is that there is little overlap between ORC/MCM binding sites and origin locations. The authors speculate that the overlap between ORC1 and ORC2 could be low because they come from different cell types. Equally concerning is the lack of overlap with MCM. If true, these are potentially major discoveries that butts heads with numerous other studies that have suggested otherwise. More needs to be done to convince the reader that such a mis-match is true. Some ideas are below:

Idea 1) One explanation given is that the ORC1 and ORC2 data come from different cell types. But there must be a dataset where both are mapped in the same cell type. Can the authors check the overlap here? In Fig S4A, I would expect the circles to not only strongly overlap but to also be of roughly the same size, since both ORC's are required in the complex. So something seems off here.

Idea 2) Another explanation given is that origins fire stochastically. One way to quantify the role of stochasticity is to quantify the overlap of origin locations performed by the same lab, in the same year, in the same experiment, in the same cell type -- i.e., across replicates -- and then compute the overlap of mapped origins. This would quantify how much mis-match is truly due to stochasticity, and how much may be due to other factors.

Idea 3) A third explanation is that MCMs are loaded further from origin sites in human than in yeast. Is there any evidence of this? How far away does the evidence suggest, and what if this distance is used to define proximity?

Idea 4) How many individual datasets (i.e., those collected and published together) also demonstrate the feature that ORC/MCM binding locations do not correlate with origins? If there are few, then indeed, the integrative analysis performed here is consistent. But if there are many, then why would individual datasets reveal one thing, but integrative analysis reveal something else?

Idea 5) What if you were much more restrictive when defining "high-confidence" origins / binding sites. Does the overlap between origins and binding sites go up with increasing restriction?

Overall, I have the sense that these experimental techniques may be producing a lot of junk. If true, this would be useful for the field to know! But if not, and there are indeed "unexplored mechanisms of origin specification" that would be exciting. But I'm not convinced yet.

-- It would be nice in the Discussion for the authors to comment about the trade-offs of different techniques; what are their pros and cons, which should be used when, which should be avoided altogether, and why? This would be a valuable prescription for the field.

Reviewer #3 (Public Review):

Summary: The authors present a thought-provoking and comprehensive re-analysis of previously published human cell genomics data that seeks to understand the relationship between the sites where the Origin Recognition Complex (ORC) binds chromatin, where the replicative helicase (Mcm2-7) is situated on chromatin, and where DNA replication actually beings (origins). The view that these should coincide is influenced by studies in yeast where ORC binds site-specifically to dedicated nucleosome-free origins where Mcm2-7 can be loaded and remains stably positioned for subsequent replication initiation. However, this is most certainly not the case in metazoans where it has already been reported that chromatin bindings sites of ORC, Mcm2-7, and origins do not necessarily overlap, likely because ORC loads the helicase in transcriptionally active regions of the genome and, since Mcm2-7 retains linear mobility (i.e., it can slide), it is displaced from its original position by other chromatin-contextualized processes (for example, see Gros et al., 2015 Mol Cell, Powell et al., 2015 EMBO J, Miotto et al., 2016 PNAS, and Prioleau et al., 2016 G&D amongst others). This study reaches a very similar conclusion: in short, they find a high degree of discordance between ORC, Mcm2-7, and origin positions in human cells.

Strengths: The strength of this work is its comprehensive and unbiased analysis of all relevant genomics datasets. To my knowledge, this is the first attempt to integrate these observations and the analyses employed were suited for the questions under consideration.

Weaknesses: The major weakness of this paper is that this comprehensive view failed to move the field forward from what was already known. Further, a substantial body of relevant prior genomics literature on the subject was neither cited nor discussed. This omission is important given that this group reaches very similar conclusions as studies published a number of years ago. Further, their study seems to present a unique opportunity to evaluate and shape our confidence in the different genomics techniques compared in this study. This, however, was also not discussed.

Author Response

Reviewer #1 (Public Review):

.In the best genetically and biochemically understood model of eukaryotic DNA replication, the budding yeast, Saccharomyces cerevisiae, the genomic locations at which DNA replication initiates are determined by a specific sequence motif. These motifs, or ARS elements, are bound by the origin recognition complex (ORC). ORC is required for loading of the initially inactive MCM helicase during origin licensing in G1. In human cells, ORC does not have a specific sequence binding domain and origin specification is not specified by a defined motif. There have thus been great efforts over many years to try to understand the determinants of DNA replication initiation in human cells using a variety of approaches, which have gradually become more refined over time.

In this manuscript Tian et al. combine data from multiple previous studies using a range of techniques for identifying sites of replication initiation to identify conserved features of replication origins and to examine the relationship between origins and sites of ORC binding in the human genome. The authors identify a) conserved features of replication origins e.g. association with GC-rich sequences, open chromatin, promoters and CTCF binding sites. These associations have already been described in multiple earlier studies. They also examine the relationship of their determined origins and ORC binding sites and conclude that there is no relationship between sites of ORC binding and DNA replication initiation. While the conclusions concerning genomic features of origins are not novel, if true, a clear lack of colocalization of ORC and origins would be a striking finding.

Thank you. That is where the novelty of the paper lies.

However, the majority of the datasets used do not report replication origins, but rather broad zones in which replication origins fire. Rather than refining the localisation of origins, the approach of combining diverse methods that monitor different objects related to DNA replication leads to a base dataset that is highly flawed and cannot support the conclusions that are drawn, as explained in more detail below.

We are using the narrowly defined SNS-seq peaks as the gold standard origins and making sure to focus in on those that fall within the initiation zones defined by other methods. The objective is to make a list of the most reproducible origins. Unlike what the reviewer states, this actually refines the dataset to focus on the SNS origins that have also been reproduced by the other methods in multiple cell lines. We will change the last box of Fig. 1A to say: Identify reproducible SNS-seq origins that are contained in IZs defined by Repli-seq, OK-seq and Bubble-seq. These are the “shared origins”. This and the Fig. 2B (as it is) will make our strategy clearer.

Methods to determine sites at which DNA replication is initiated can be divided into two groups based on the genomic resolution at which they operate. Techniques such as bubble-seq, ok-seq can localise zones of replication initiation in the range ~50kb. Such zones may contain many replication origins. Conversely, techniques such as SNS-seq and ini-seq can localise replication origins down to less than 1kb. Indeed, the application of these different approaches has led to a degree of controversy in the field about whether human replication does indeed initiate at discrete sites (origins), or whether it initiates randomly in large zones with no recurrent sites being used. However, more recent work has shown that elements of both models are correct i.e. there are recurrent and efficient sites of replication initiation in the human genome, but these tend to be clustered and correspond to the demonstrated initiation zones (Guilbaud et al., 2022).

These different scales and methodologies are important when considering the approach of Tian et al. The premise that combining all available data from five techniques will increase accuracy and confidence in identifying the most important origins is flawed for two principal reasons. First, as noted above, of the different techniques combined in this manuscript, only SNS-seq can actually identify origins rather than initiation zones. It is the former that matters when comparing sites of ORC binding with replication origin sites if a conclusion is to be drawn that the two do not co-localise.

Exactly. So the reviewer should agree that our method of finding SNS-seq peaks that fall within initiation zones actually refines the origins to find the most reproducible origins. We are not losing the spatial precision of the SNS-seq peaks.

Second, the authors give equal weight to all datasets. Certainly, in the case of SNS-seq, this is not appropriate. The technique has evolved over the years and some earlier versions have significantly different technical designs that may impact the reliability and/or resolution of the results e.g. in Foulk et al. (Foulk et al., 2015), lambda exonuclease was added to single stranded DNA from a total genomic preparation rather than purified nascent strands), which may lead to significantly different digestion patterns (ie underdigestion). Curiously, the authors do not make the best use of the largest SNS-seq dataset (Akerman et al., 2020) by ignoring these authors separation of core and stochastic origins. By blending all data together any separation of signal and noise is lost. Further, I am surprised that the authors have chosen not to use data and analysis from a recent study that provides subsets of the most highly used and efficient origins in the human genome, at high resolution (Guilbaud et al., 2022).

  1. We are using the data from Akerman et al., 2020: Dataset GSE128477 in Supplemental Table 1. We can examine the core origins defined by the authors to check its overlap with ORC binding.

  2. To take into account the refinement of the SNS-seq methods through the years, we actually included in our study only those SNS-seq studies after 2018, well after the lambda exonuclease method was introduced. Indeed, all 66 of SNS-seq datasets we used were obtained after the lambda exonuclease digestion step. To reiterate, we recognize that there may be many false positives in the individual origin mapping datasets. Our focus is on the True positives, the SNS-seq peaks that have some support from multiple SNS-seq studies AND fall within the initiation zones defined by the independent means of origin mapping (described in Fig. 1A and 2B). These True positives are most likely to be real and reproducible origins and should be expected to be near ORC binding sites.

We will change the last box of Fig. 1A to say: Identify reproducible SNS-seq origins that are contained in IZs defined by Repli-seq, OK-seq and Bubble-seq. These are the “Shared origins”.

Ini-seq by Torsten Krude and co-workers (Guillbaud, 2022) does NOT use Lambda exonuclease digestion. So using Ini-seq defined origins is at odds with the suggestion above that we focus only on SNS-seq datasets that use Lambda exonuclease. However, Ini-seq identifies a much smaller subset of SNS-seq origins, so we will do the analysis with just that smaller set in the revision of the paper.

References:

Akerman I, Kasaai B, Bazarova A, Sang PB, Peiffer I, Artufel M, Derelle R, Smith G, Rodriguez-Martinez M, Romano M, Kinet S, Tino P, Theillet C, Taylor N, Ballester B, Méchali M (2020) A predictable conserved DNA base composition signature defines human core DNA replication origins. Nat Commun, 11: 4826

Foulk MS, Urban JM, Casella C, Gerbi SA (2015) Characterizing and controlling intrinsic biases of lambda exonuclease in nascent strand sequencing reveals phasing between nucleosomes and G-quadruplex motifs around a subset of human replication origins. Genome Res, 25: 725-735

Guilbaud G, Murat P, Wilkes HS, Lerner LK, Sale JE, Krude T (2022) Determination of human DNA replication origin position and efficiency reveals principles of initiation zone organisation. Nucleic Acids Res, 50: 7436-7450

Reviewer #2 (Public Review):

Tian et al. perform a meta-analysis of 113 genome-wide origin profile datasets in humans to assess the reproducibility of experimental techniques and shared genomics features of origins. Techniques to map DNA replication sites have quickly evolved over the last decade, yet little is known about how these methods fare against each other (pros and cons), nor how consistent their maps are. The authors show that high-confidence origins recapitulate several known features of origins (e.g., correspondence with open chromatin, overlap with transcriptional promoters, CTCF binding sites). However, surprisingly, they find little overlap between ORC/MCM binding sites and origin locations.

Overall, this meta-analysis provides the field with a good assessment of the current state of experimental techniques and their reproducibility, but I am worried about: (a) whether we've learned any new biology from this analysis; (b) how binding sites and origin locations can be so mismatched, in light of numerous studies that suggest otherwise; and (c) some methodological details described below.

Major comments:

Line 26: "0.27% were reproducibly detected by four techniques" -- what does this mean? Does the fragment need to be detected by ALL FOUR techniques to be deemed reproducible?

If the reproducible SNS-seq peaks are included in the reproducible initiation zones found by the other methods, then we consider it reproducible across datasets. The strategy is to focus our analysis on the most reproducible SNS-seq peaks that happen to be in reproducible initiation zones. It is the best way to confidently identify a very small set of true positive origins.

And what if the technique detected the fragment is only 1 of N experiments conducted; does that count as "detected"?

A reproducible SNS-seq origin has been reproduced above a statistical threshold of 20 reproductions. A threshold of reproduction in 20 datasets out of 66 SNS-seq datasets gives an FDR of <0.1. This is explained in Fig. 2a and Supplementary Fig. S2. For the initiation zones, we considered a Zone even if it appears in only 1 of N experiments, because N is usually small. This relaxed method for selecting the initiation zones gives the best chance of finding SNS-seq peaks that are reproduced by the other methods.

Later in Methods, the authors (line 512) say, "shared origins ... occur in sufficient number of samples" but what does sufficient mean?

Sufficient means that SNS-seq origin was reproducibly detected in ≥ 20 datasets and was included in any initiation zone defined by three other techniques.

Then on line 522, they use a threshold of "20" samples, which seems arbitrary to me. How are these parameters set, and how robust are the conclusions to these settings? An alternative to setting these (arbitrary) thresholds and discretizing the data is to analyze the data continuously; i.e., associate with each fragment a continuous confidence score.

We explained Fig. 2a and Supplementary Fig. S2 in the text as follows: The occupancy score of each origin defined by SNS-seq (Supplementary Fig. 2a) counts the frequency at which a given origin is detected in the datasets under consideration. For the random background, we assumed that the number of origins confirmed by increasing occupancy scores decreases exponentially (see Methods and Supplementary Table 2). Plotting the number of origins with various occupancy scores when all SNS-seq datasets published after 2018 are considered together (the union origins) shows that the experimental curve deviates from the random background at a given occupancy score (Fig. 2a). The threshold occupancy score of 20 is the point where the observed number of origins deviates from the expected background number (with an FDR < 0.1) (Fig. 2a). In the Methods: In other words, the number of observed origins with occupancy score greater than 20 is 10 times more than expected in the background model. This approach is statistically sound and described by us in (Fang et al. 2020).

Line 20: "50,000 origins" vs "7.5M 300bp chromosomal fragments" -- how do these two numbers relate? How many 300bp fragments would be expected given that there are ~50,000 origins? (i.e., how many fragments are there per origin, on average)? This is an important number to report because it gives some sense of how many of these fragments are likely nonsense/noise. The authors might consider eliminating those fragments significantly above the expected number, since their inclusion may muddle biological interpretation.

I think we confused the reviewer by the way we wrote the abstract. The 50,000 origins that are mentioned in the abstract is the hypothetical expected number of origins that have to fire to replicate the whole 6x10^9 base diploid genome based on the average inter-origin distance of 10^5 bases (as determined by molecular combing). The 7.5M 300 bp fragments are the genomic regions where the 7.5M union SNS-seq-defined origins are located. Clearly, that is a lot of noise, some because of technical noise and some due to the fact that origins fire stochastically. Which is why our paper focuses on a smaller number of reproducible origins, the 20,250 shared origins. Our analysis is on the 20,250 shared origins, and not on all 7.5M union origins. Thus, we are not including the excess of non-reproducible (stochastic?) origins in our analysis.

The revised abstract in the revised paper will say: “Based on experimentally determined average inter-origin distances of ~100 kb, DNA replication initiates from ~50,000 origins on human chromosomes in each cell-cycle. The origins are believed to be specified by binding of factors like the Origin Recognition Complex (ORC) or CTCF or other features like G-quadruplexes. We have performed an integrative analysis of 113 genome-wide human origin profiles (from five different techniques) and 5 ORC-binding site datasets to critically evaluate whether the most reproducible origins are specified by these features. Out of ~7.5 million union origins identified by 66 SNS-seq datasets, only 0.27% were reproducibly contained in initiation zones identified by three other techniques (20,250 shared origins), suggesting extensive variability in origin usage and identification in different circumstances.”

Line 143: I'm not terribly convinced by the PCA clustering analysis, since the variance explained by the first 2 PCs is only ~25%. A more robust analysis of whether origins cluster by cell type, year etc is to simply compute the distribution of pairwise correlations of origin profiles within the same group (cell type, year) vs the correlation distribution between groups. Relatedly, the authors should explain what an "origin profile" is (line 141). Is the matrix (to which PCA is applied) of size 7.5M x 113, with a "1" in the (i,j) position if the ith fragment was detected in the jth dataset?

The reviewer is correct about how we did the PCA and have now included the description in the Methods. We will also do the pairwise correlations the way the reviewer suggests (a) by techniques, (b) by cell types (SNS-seq), (c) by year of publication (SNS-seq).

It's not clear to me what new biology (genomic features) has been learned from this meta-analysis. All the major genomic features analyzed have already been found to be associated with origin sites. For example, the correspondence with TSS has been reported before:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6320713/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6547456/

So what new biology has been discovered from this meta-analysis?

The new biology can be summarized as: (a) We can identify a set of reproducible (in multiple datasets and in multiple cell lines) SNS-seq origins that also fall within initiation zones identified by completely independent methods. These may be the best origins to study in the midst of the noise created by stochastic origin firing. (b) The overlap of these True Positive origins with known ORC binding sites is tenuous. So either all the origin mapping data, or all the ORC binding data has to be discarded, or this is the new biological reality in mammalian cancer cells: on a genome-wide scale the most reproduced origins are not in close proximity to ORC binding sites, in contrast to the situation in yeast. (c) All the features that have been reported to define origins (CTCF binding sites, G quadruplexes etc.) could simply be from the fact that those features also define transcription start sites (TSS), and origins prefer to be near TSS because of the favorable chromatin state.

Line 250: The most surprising finding is that there is little overlap between ORC/MCM binding sites and origin locations. The authors speculate that the overlap between ORC1 and ORC2 could be low because they come from different cell types. Equally concerning is the lack of overlap with MCM. If true, these are potentially major discoveries that butts heads with numerous other studies that have suggested otherwise. More needs to be done to convince the reader that such a mis-match is true. Some ideas are below:

Idea 1) One explanation given is that the ORC1 and ORC2 data come from different cell types. But there must be a dataset where both are mapped in the same cell type. Can the authors check the overlap here? In Fig S4A, I would expect the circles to not only strongly overlap but to also be of roughly the same size, since both ORC's are required in the complex. So something seems off here.

We agree with the reviewer that there is something “off here”. Either the techniques that report these sites are all wrong, or the biology does not fit into the prevailing hypothesis. One secret in the ORC ChIP field that our lab has struggled with for quite some time is that the various ORC subunits do not necessarily ChiP-seq to the same sites. The poor overlap between the binding sites of subunits of the same complex either suggests that the subunits do not always bind to the chromatin as a six-subunit complex or that all the ChIP-seq data in the Literature is suspect. We provide in the supplementary figure S4A examples of true positive complexes (SMARCA4/ARID1A, SMC1A/SMC3, EZH2/SUZ12), whose subunits ChIP-seq to a large fraction of common sites. As shown in Supplementary Fig. S4C, we do not have ORC1 and ORC2 ChIP-seq data from the same cell-type. We have ORC1 ChIP-seq and SNS-seq data from HeLa cells and ORC2 ChIP seq and origins from K562 cells, and so will add the proximity/overlap of the binding sites to the origins in the same cell-type in the revision.

Idea 2) Another explanation given is that origins fire stochastically. One way to quantify the role of stochasticity is to quantify the overlap of origin locations performed by the same lab, in the same year, in the same experiment, in the same cell type -- i.e., across replicates -- and then compute the overlap of mapped origins. This would quantify how much mis-match is truly due to stochasticity, and how much may be due to other factors.

A given lab may have superior reproducibility compared to the entire field. But the notion of stochasticity is well accepted in the field because of this observation: the average inter-origin distance measured by single molecule techniques like molecular combing is ~100 kb, but the average inter-origin distance measure on a population of cells (same cell line) is ~30 kb. The only explanation is that in a population of cells many origins can fire, but in a given cell on a given allele, only one-third of those possible origins fire. This is why we did not worry about the lack of reproducibility between cell-lines, labs etc, but instead focused on those SNS-seq origins that are reproducible over multiple techniques and cell lines.

Idea 3) A third explanation is that MCMs are loaded further from origin sites in human than in yeast. Is there any evidence of this? How far away does the evidence suggest, and what if this distance is used to define proximity?

MCMs, of course, have to be loaded at an origin at the time the origin fires because MCMs provide the core of the helicase that starts unwinding the DNA at the origin. Thus, the lack of proximity of MCM binding sites with origins can be because the most detected MCM sites (where MCM spends the most time in a cell-population) does not correspond to where it is first active to initiate origin firing. This has been discussed. MCMs may be loaded far from origin site, but because of their ability to move along the chromatin, they have to move to the origin-site at some point to fire the origin.

Idea 4) How many individual datasets (i.e., those collected and published together) also demonstrate the feature that ORC/MCM binding locations do not correlate with origins? If there are few, then indeed, the integrative analysis performed here is consistent. But if there are many, then why would individual datasets reveal one thing, but integrative analysis reveal something else?

We apologize for this oversight. In the revised manuscript we will discuss PMC3530669, PMC7993996, PMC5389698, PMC10366126. None of them have addressed what we are addressing, which is whether the small subset of the most reproducible origins proximal to ORC or MCM binding sites, but the discussion is essential.

Idea 5) What if you were much more restrictive when defining "high-confidence" origins / binding sites. Does the overlap between origins and binding sites go up with increasing restriction?

We will make origins more restrictive by selecting those reproduced by 30-60 datasets. The number of origins will of course fall, but we will measure whether the proximity to ORC or MCM-binding sites increases/decreases in a statistically rigorous way.

Overall, I have the sense that these experimental techniques may be producing a lot of junk. If true, this would be useful for the field to know! But if not, and there are indeed "unexplored mechanisms of origin specification" that would be exciting. But I'm not convinced yet.

It would be nice in the Discussion for the authors to comment about the trade-offs of different techniques; what are their pros and cons, which should be used when, which should be avoided altogether, and why? This would be a valuable prescription for the field.

Thanks for the suggestion. We will do what the reviewer suggests: use cell type-specific data wherever origins have been defined by at least two methods in the same cell type, specifically reporting the percent of shared origins amongst the datasets to compare whether some methods correlate better with each other. ORC ChIP-seq and MCM ChIP-seq data do not define origins: they define the binding sites of these proteins. Thus we will discuss why the ChIP-seq sites of these protein complexes should not be used to define origins.

Reviewer #3 (Public Review):

Summary: The authors present a thought-provoking and comprehensive re-analysis of previously published human cell genomics data that seeks to understand the relationship between the sites where the Origin Recognition Complex (ORC) binds chromatin, where the replicative helicase (Mcm2-7) is situated on chromatin, and where DNA replication actually beings (origins). The view that these should coincide is influenced by studies in yeast where ORC binds site-specifically to dedicated nucleosome-free origins where Mcm2-7 can be loaded and remains stably positioned for subsequent replication initiation. However, this is most certainly not the case in metazoans where it has already been reported that chromatin bindings sites of ORC, Mcm2-7, and origins do not necessarily overlap, likely because ORC loads the helicase in transcriptionally active regions of the genome and, since Mcm2-7 retains linear mobility (i.e., it can slide), it is displaced from its original position by other chromatin-contextualized processes (for example, see Gros et al., 2015 Mol Cell, Powell et al., 2015 EMBO J, Miotto et al., 2016 PNAS, and Prioleau et al., 2016 G&D amongst others). This study reaches a very similar conclusion: in short, they find a high degree of discordance between ORC, Mcm2-7, and origin positions in human cells.

Strengths: The strength of this work is its comprehensive and unbiased analysis of all relevant genomics datasets. To my knowledge, this is the first attempt to integrate these observations and the analyses employed were suited for the questions under consideration.

Thank you for recognizing the comprehensive and unbiased nature of our analysis. The fact that the major weakness is that the comprehensive view fails to move the field forward, is actually a strength. It should be viewed in the light that we cannot even find evidence to support the primary hypothesis: that the most reproducible origins must be near ORC and MCM binding sites. This finding will prevent the unwise adoption of ORC or MCM binding sites as surrogate markers of origins and may perhaps stimulate the field to try and improve methods of identifying ORC or MCM binding until the binding sites are found to be proximal to the most reproducible origins. The last possibility is that there are ORC- or MCM-independent modes of defining origins, but we have no evidence of that.

Weaknesses: The major weakness of this paper is that this comprehensive view failed to move the field forward from what was already known. Further, a substantial body of relevant prior genomics literature on the subject was neither cited nor discussed. This omission is important given that this group reaches very similar conclusions as studies published a number of years ago. Further, their study seems to present a unique opportunity to evaluate and shape our confidence in the different genomics techniques compared in this study. This, however, was also not discussed.

We will do what the reviewer suggests: use cell type-specific data wherever origins have been defined by at least two methods in the same cell type, specifically reporting the percent of shared origins amongst the datasets to compare whether some methods correlate better with each other. Thanks for the suggestion. ORC ChIP-seq and MCM ChIP-seq data do not define origins: they define the binding sites of these proteins. Thus, we will discuss why the ChIP-seq sites of these protein complexes should not be used to define origins.

We do not cite the SNS-seq data before 2018 because of the concerns discussed above about the earlier techniques needing improvement. We will discuss other genomics data that we failed to discuss.

We will cite the papers the reviewer names:

Gros, Mol Cell 2015 and Powell, EMBO J. 2015 discuss the movement of MCM2-7 away from ORC in yeast and fliesand will be cited. MCM2-7 binding to sites away from ORC and being loaded in vast excess of ORC was reported earlier on Xenopus chromatin in PMC193934, and will also be cited.

Miotto, PNAS, 2016: publishes ORC2 ChIP-seq sites in HeLa (data we have used in our analysis), but do not measure ORC1 ChIP-seq sites. They say: “ORC1 and ORC2 recognize similar chromatin states and hence are likely to have similar binding profiles.” This is a conclusion based on the fact that the ChIP seq sites in the two studies are in areas with open chromatin, it is not a direct comparison of binding sites of the two proteins.

Prioleau, G&D, 2016: This is a review that compared different techniques of origin identification but has no primary data to say that ORC and MCM binding sites overlap with the most reproducible origins.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation