Peer review process
Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.
Read more about eLife’s peer review process.Editors
- Reviewing EditorZiyue GaoUniversity of Pennsylvania, Philadelphia, United States of America
- Senior EditorGeorge PerryPennsylvania State University, University Park, United States of America
Reviewer #1 (Public review):
The fundamental claim of the manuscript is that rRNA genes experience substitutions much too quickly, given that they are a multi-copy gene system. As clarified by the authors in their response, and as I think is relatively clear in the manuscript, they are collapsing all copies of the rRNA array down. They first quantify polymorphism (in this expanded definition, where polymorphism means variable at a given site across any copy). The authors find elevated levels of heterozygosity in rRNA genes compared to single copy genes, which isn't surprising, given that there is a substantially higher target size; that being said, the increase in polymorphism is smaller than the increase in target size. They then look at substitutions between mouse species and also between human and chimp, and argue that the substitution rate is too fast compared to single copy genes in many cases.
[Editors' note: we invite readers to consult the review in full from the previous version of the submission: https://doi.org/10.7554/eLife.99992.2.sa1]
Reviewer #2 (Public review):
This revision has further improved the clarity of the paper, better articulating assumptions of the model and data analysis. I particularly appreciate the authors' thorough response to eLife assessment. However, the authors did not provide point-by-point response to the specific comments I had from last round of review and didn't revise the manuscript accordingly, so my major concerns remain.
At conceptual level, my biggest concern with the model is the lack of constraint on V*(K), which makes the null neutral model too "liberal". On the one hand, the number of descendants of each gene copy must be non-negative; on the other hand, even homogenizing process within an individual is extremely strong, it cannot "spread" gene copies across individuals, so the maximum number of descendants of one gene copy cannot exceed the number of offspring that individual has times C. For these reasons, I believe there must be a theoretical upper bound of the value of V*(K), and the actual V*(K) is likely much smaller under realistic strength of the homogenizing process. When I asked about modeling of the underlying homogenizing process, I did not mean the authors need to include specific molecular process in the model; instead, I am asking the authors to provide some realistic scenarios that can give rise to very large V*(K) values. As a result of the very "liberal" neutral model, although I do agree that rejection of null provides stronger evidence for selection in human, it is unclear whether there is no evidence of selection in mouse. Please see below for my specific comments regarding the definition and assumptions of V*(K) (copied from last review).
Regarding the data analysis, although I understand the authors' methodology and rationale behind, I am not convinced that high sequence similarity between rDNA copies guarantees no biases in alignment and variant calling. Furthermore, given divergence between species, I am particularly concerned about the practice of aligning reads of different species to human and mus musculus reference sequences. A separate issue is the calculation of divergence level. Instead of using Fst>0.8 as the criterion of calling fixed sites, the authors could calculate the pairwise average divergence between a random copy from one species and a random copy from another species. Mathematically, this could be calculated as p1(1-p2)+p2(1-p1). The observation that the estimated substitution rates for rDNA with and without CpG sites are so close seems to be an indication of technical error. Please also see below for my specific questions about data analysis (copied from last round of review).
Specific comments from last round of review:
Questions regarding V*(K)
(1) Another key parameter V*(K) was still not defined within the paper. In response 9, the authors explained that V*(K) refers to "the number of progeny to whom the gene copy of interest is transmitted (K) over a specific time interval". However, the meaning of "progeny" remains unclear. Are the authors referring to the descendent copies of a gene copy, or the offspring individuals (i.e., the living organisms)? For example, if a variant spreads horizontally through homogenizing processes and transmits vertically to multiple offspring individuals, the number of descent gene copies could differ substantially from the number of descendent individuals to whom a gene copy is transmitted to. This distinction needs to be clarified and clearly stated in the paper.
(2) The authors state that V*(K)>=1 for rDNA genes because of the homogenizing processes (lines 139-141) without providing justification. It is unclear, at least to me, whether homogenizing processes are expected increase or decrease the variance in "reproductive success" across gene copies. Moreover, the authors claim that V*(K) "can potentially reach values in the hundreds and may even exceed C, resulting in C*=C/V*(K)<1" (Response 7). This claim is unlikely to be true, as the minimum value of K is bounded by zero and E(K) is assumed to be 1. Even in the extreme case that 1% gene copies leave large numbers of descends while the others leave none, V*(K) would still be less than 100. Such extreme case seems highly improbable, given realistic rates of the homogenizing processes.
(3) Regardless of how the authors define V*(K), it is not immediately clear why Equation 1 (N*=NC/V*(K)) holds. Both sides of the equation have their independent meanings, so the authors need to provide a step-by-step derivation demonstrating that they are equal. Only by doing this will the implicit underlying assumptions become clearer. I also strongly recommend that the authors conduct forward-in-time simulations with fixed N, C, V*(K) (however they define it) and μ to confirm that the right side of Equation 1 actually predicts the N* as calculated from the polymorphism level using the equation in line 165.
Questions about Ne* for multi-copy system
(1) While Ne is clearly defined in the standard single-copy gene model as the reciprocal of genetic drift (i.e., the decay in heterozygosity), its meaning for multiple-copy genes is unclear. Based on the context, it appears that the authors define Ne as the parameter that fits the population polymorphism level (Hs) using the equation in line 165. This definition is reasonable, but it should be explicitly clarified in the text."
(2) Without providing justification, the authors assumed that a certain number N* exists for rRNA such that it fits both the polymorphism level (line 156) in recent timescales and divergence level in longer timescales (i.e., in the comparison between Tf and Td). However, if N, C or any other relevant parameters have varied substantially throughout evolution, N* is expected to vary with time, and the same value may not fit both polymorphism and divergence data simultaneously.
Questions about data analysis
(1) A significant issue with aligning reads to a single reference genome is reference bias, referring to the phenomenon that reads carrying the reference alleles tend to align more easily than those with one or more non-reference alleles, thus creating a bias in genotype calling or variant allele frequency quantification. As a result, there may be an underrepresentation of non-reference alleles in called variants or an underestimate of non-reference allele frequency, particularly in regions with high genetic diversity. Simply focusing on bi-allelic SNVs is insufficient to minimize reference bias. Given the fourfold increase in diversity within rDNA, the authors must either provide evidence that reference bias is not a significant concern or adopt graph-based reference genomes or more sophisticated alignment algorithms to address this issue.
(2) The potential for reference bias also renders the analysis of divergence sites unreliable, as aligning reads from one species (e.g. chimpanzee) to the reference of another species (e.g., human) is likely to introduce biases in variant calling between the two. One commonly adopted approach to address this imbalance is to align reads from both species to a third reference genome that is expected to be equidistantly related to both.
(3) Although it is somewhat reassuring that the estimated divergence rate of rDNA between human and macaque is comparable to that of the rest of the genome, there still remains concern of a under-estimation of divergence in rDNA regions due to reference bias issue. Note that while the "third genome" approach reduces imbalance between two genomes in comparison, it may still under-estimate overall divergence level due to under-calling of non-reference variants.
(4) In response to my question about the similarity in rDNA substitution rates estimated with or without CpG sites, the authors suggest that this "may be due to strong homogenizing forces, which can rapidly fix or eliminate variants" (response17). However, this explanation is insufficient, because the observed substitution rate depends on the mutation rate multiplied by the fixation probability, and accelerated fixation or loss does not alter either. Unless the authors can provide more convincing explanation, technical errors in calling of fixed sites still remain a concern.
Minor points:
Line 157: The statement "where μ is the mutation rate of the entire gene" must be wrong, as the heterozygosity calculated with such μ would correspond to the chance of seeing two different haplotypes at gene level, which is incompatible with the empirical calculation specified in Equation 2. Instead, μ must represent the mutation rate per site averaged over the entire gene.
In response 22, the authors explained that the allele frequency spectrum shown in Fig 3 is folded, because the ancestral allele was not determined. However, this is inconsistent with x-axis Fig 3 ranging between 0 and 1. I suspect the x-axis represents the frequency of the alternative (i.e., non-reference) allele. If so, the reported correlation is inflated, as the reference allele is somewhat random, and a variant at joint ALT allele frequencies of (0.9, 0.9) is no different from a variant at (0.1, 0.1). The proper way of calculate this correlation is to first determine the minor allele frequency across individuals and then calculate the correlation between minor allele frequencies.
Similarly, in response 14, it is unclear what the x-axis represents. Is it the ALT allele frequency or derived allele frequency? If the former, why are only variants with AF>0.8 defined as fixed variants, while those with AF<0.2 excluded? If it is the latter, please describe how ancestral state is determined.