1. Evolutionary Biology
  2. Genetics and Genomics
Download icon

Testis single-cell RNA-seq reveals the dynamics of de novo gene transcription and germline mutational bias in Drosophila

  1. Evan Witt
  2. Sigi Benjamin
  3. Nicolas Svetec
  4. Li Zhao  Is a corresponding author
  1. The Rockefeller University, United States
Research Article
  • Cited 1
  • Views 1,841
  • Annotations
Cite this article as: eLife 2019;8:e47138 doi: 10.7554/eLife.47138

Abstract

The testis is a peculiar tissue in many respects. It shows patterns of rapid gene evolution and provides a hotspot for the origination of genetic novelties such as de novo genes, duplications and mutations. To investigate the expression patterns of genetic novelties across cell types, we performed single-cell RNA-sequencing of adult Drosophila testis. We found that new genes were expressed in various cell types, the patterns of which may be influenced by their mode of origination. In particular, lineage-specific de novo genes are commonly expressed in early spermatocytes, while young duplicated genes are often bimodally expressed. Analysis of germline substitutions suggests that spermatogenesis is a highly reparative process, with the mutational load of germ cells decreasing as spermatogenesis progresses. By elucidating the distribution of genetic novelties across spermatogenesis, this study provides a deeper understanding of how the testis maintains its core reproductive function while being a hotbed of evolutionary innovation.

https://doi.org/10.7554/eLife.47138.001

Introduction

The testis is a highly transcriptionally active tissue whose core function of sperm production is conserved across kingdoms. In humans, flies, and mice, spermatogenesis consists of several key steps: (1) differentiation of germline stem cells into spermatogonia, (2) mitotic divisions of spermatogonia, which become spermatocytes, (3) meiotic divisions to generate primary spermatids, and (4) sperm maturation (Fuller, 1993; Jan et al., 2012; White-Cooper, 2010). Across animal species, the testis is unique from a transcriptomics perspective because it expresses more genes than any other tissue (Parisi et al., 2004). Genotypes and phenotypes associated with sex and reproduction diverge rapidly and may have important functional consequences (Lande, 1981). Despite evolutionary genetic hypotheses trying to explain the complexity of the testis transcriptome, it remains unclear why this tissue expresses a broader array of genes than any other tissue, including the brain, which is more phenotypically and structurally complex and contains more cell types (Parisi et al., 2004; Soumillon et al., 2013; Parisi et al., 2003).

Not only it is a highly transcriptionally active tissue, the testis is also a hotspot for newly originated genes (Long et al., 2003; Neme and Tautz, 2016). One hypothesis is that testis catalyzes the birth and retention of novel genes (Kaessmann, 2010). This hypothesis suggests that novel genes are likely to be born in testis due to a permissive chromatin state. Novel functional genes with beneficial products are selectively preserved and eventually evolve more refined regulatory programs (Bai et al., 2007; Kaessmann, 2010). In the past decade, many studies have found that young genes, including de novo originated genes (genes born from ancestrally noncoding DNA), tend to be biased towards the testis (Levine et al., 2006; Long et al., 2003; Brown et al., 2014; Ruiz-Orera et al., 2015; Tautz and Domazet-Lošo, 2011; Zhao et al., 2014). De novo genes may arise in two main ways: (1) an unexpressed DNA sequence gains expression, and the resulting transcript can acquire a function, coding potential or (2) a potential Open Reading Frame (ORF) gains expression and translation, and undergoes functional refinement (Carvunis et al., 2012; Durand et al., 2019; McLysaght and Hurst, 2016; Schlötterer, 2015). Natural selection may not only preserve testis-bias and function of novel genes, but also shape expression and function in somatic tissues for others (Chen et al., 2010; Chen et al., 2013; Zhou et al., 2008). Elucidating the biology of new-gene evolution therefore requires a comprehensive picture of spatio-temporal dynamics of testis gene regulation.

Spermatogenesis is a highly conserved process in many animal taxa and is well-understood from an anatomical and histological perspective, but its molecular foundations are still poorly understood (Birkhead et al., 2008; Demarco et al., 2014; Russell et al., 1993; White-Cooper, 2010). New analytical methods in genomics allow the quantification of expression biases of gene groups involved in various cellular processes (Jung et al., 2018; Lukassen et al., 2018; Stévant et al., 2018). From the prevalence of their transcripts, one can make inferences about the developmental timing of translation, DNA repair, nuclear export, and other processes. Moreover, these methods also make possible the identification of germline variants and the individual cells in which they occur.

Such methods include the recent advent of single-cell sequencing, a technology that may shed light on unknown aspects of germline mutation. For instance, it is known that the human mutation rate per base per generation ranges from 10E-7 to 10E-9 (Moorjani et al., 2016; Scally and Durbin, 2012), but this germline mutation rate is the result of an equilibrium between errors/lesions and repair. Substitutions that arise within an individual’s germline but do not reach mature gametes will not be passed to the next generation, meaning that population genetics approaches can only observe a subset of germline variants. Is the population-level mutation rate influenced by the mutation-repair equilibrium of spermatogenesis?

A roadblock to the answer of this question is the fact that any substitutions that prevent gamete maturation or fertilization will be lost from the population, meaning that the population-level mutation rate may vastly underestimate the germline variants propagating within individuals. Since male Drosophila do not undergo meiotic recombination, germ cell variants that occur in earlier developmental stages may not be repaired through recombination related mechanisms (Hunter, 2015). It is also known that different cell types in the testis accumulate DNA lesions at different rates (Gao et al., 2014), but it is unclear if the net mutational load varies during spermatogenesis. Single-cell RNA-seq can be used to infer mutational events within a whole tissue, even if such lesions would be repaired before gamete maturation. Unlike single-cell genome sequencing, this approach can infer the cell types associated with each variant, allowing estimation of the mutational load of cells as they progress through spermatogenesis. Due to its versatility, reproducibility, and wealth of useful data, single-cell RNA-seq is a powerful tool for the study of germline mutation.

We leveraged single-cell RNA-seq and unsupervised clustering to identify all the major cell classes of the sperm lineage, validated by previously studied marker genes. We identified populations of somatic cells, including cyst stem cells, hub cells, and terminal epithelial cells. We found that the overall gene expression is very active in early spermatogenesis and decreases throughout spermatogenesis. Lineage-specific de novo genes (genes derived from ancestrally noncoding DNA [Zhao et al., 2014]) showed expression in various cell types and are commonly expressed in spermatocytes. We also identified putative germline de novo substitutions from our population of cells and found that they decrease in relative abundance during spermatogenesis. We also found that the proportion of mutated cells decreases throughout spermatogenesis, a finding with possible implications for the study of male germline DNA repair. In an opposite pattern, DNA damage response genes are upregulated in early spermatogenesis, indicating a role for these genes in early spermatogenesis.

These patterns of mutation and de novo gene expression augment and enrich our current understanding of the male-specific evolutionary novelty. It was previously known that young de novo genes tend to be testis biased, and we have further traced the main source of this bias to spermatocytes. We uncover a compelling time course of mutational load throughout spermatogenesis, putting forward the Drosophila testis as a model for the study of spermatogenic mutational surveillance. Mutation and de novo gene evolution are critical components of the adaptive process, and our results demonstrate these processes in action during spermatogenesis.

Results

Unsupervised clustering elucidates the distribution of de novo genes across cell types

We prepared a single-cell suspension from freshly dissected testes of 48-hours-old D. melanogaster adult males (Figure 1—figure supplement 1, also see Materials and methods). The cell suspension was then made into a library and sequenced. We recovered 426,563,073 reads from a total of 5000 cells. On average, we mapped 85,312 reads per cell and detected the expression of an average of 4185 genes per cell. The dataset correlates well with bulk testis RNA-seq and a separate testis single-cell RNA-seq library, with a Pearson’s R of 0.97, indicating high reproducibility (Figure 1—figure supplement 2). Using t-Stochastic-Neighbor Embedding (t-SNE) in Seurat (Van Der Maaten and Hinton, 2008; Satija et al., 2015) we reduced the dimensionality of the gene/cell expression matrix to two primary axes and grouped cells by their similarity across their thousands of unique gene expression profiles. Grouping similar cells into clusters, we observed marker genes enriched in particular clusters, allowing us to infer the identity of the cells within each cluster (see Materials and methods).

Based on the clustering results, we inferred the presence of germline stem cells, spermatogonia, spermatocytes, and spermatids (germ cells) as well as cyst stem cells, terminal epithelial cells, and hub cells (somatic cells) (Figure 1A and B). We confirmed that the top 50 most highly enriched genes in cell clusters from each cell type (Supplementary file 1) were consistent with previous knowledge of marker genes. For instance, cup genes were biased toward late spermatids (Barreau et al., 2008), and Hsp23 and MtnA were highly expressed in the epithelial cells (Faisal et al., 2014; Michaud et al., 1997). Cell clusters from each developmental stage in the t-SNE map are near each other, suggesting that cell progression through spermatogenesis is a continuous process. The expression of marker genes confirmed the assignment of cell clusters (Figure 1C and D). Germline Stem Cells (GSCs) and early spermatogonia clustered together due to 1) high transcriptional similarity, 2) the relatively low numbers of GSCs within the tissue, and 3) the sparse expression of GSC-specific marker genes. Different types of somatic cells clustered close to each other in the t-SNE graph, suggesting distinct transcriptional patterns compared to germ cells. A principal component analysis of variable genes in the testis is presented in Figure 1—figure supplement 3.

Figure 1 with 3 supplements see all
Clustering and cell-type assignment of single cells in Seurat.

(A) An illustration of the major cell types in the testis, and the marker genes we used to identify them are in brackets. Somatic cells are hub, cyst, and epithelial cells. Spermatogenesis begins with germline stem cells which undergo mitotic divisions to form spermatogonia. These become spermatocytes which undergo meiosis and differentiate into spermatids. (B) A t-SNE projection of every cell type identified in the data. (C) Examples of marker genes that vary throughout spermatogenesis. His2Av is most active in early spermatogenesis, fzo and soti are active in intermediate and late stages, respectively, and p-cup is exclusively enriched in late spermatids. (D) Dotplot of scaled expression of marker genes in each inferred cell type. The size of each dot refers to the proportion of cells expressing a gene, and the color of each dot represents the calculated scaled expression value; blue is lowest, red is highest. 0 is the gene’s mean scaled expression across all cells and the numbers in the scale are z scores. The cutoffs shown here were chosen to emphasize cell-type-specific enrichment of key marker genes. The genes used to assign each cell type are detailed in the Materials and methods section.

https://doi.org/10.7554/eLife.47138.002

To gauge the accuracy of our cell type assignments, we queried if various cell types utilize biological pathways known to be important in spermatogenesis. Using a PANTHER Gene Ontology (GO) search of all significantly enriched genes for each cell type, we found that the most enriched GO terms for GSC, early and late spermatogonia tend to involve translation, transcription, and ATP synthesis (Supplementary file 2), supporting high levels of cellular activity. Early spermatocytes showed an enrichment for ubiquitin-independent proteasomal catabolism; late spermatocytes were enriched for genes involved in spermatid motility and differentiation (Supplementary file 2). Early spermatids were enriched in GO terms for spermatogenesis, gamete generation, and cellular movement, and late spermatids showed no enrichment in any GO terms (Supplementary file 2).

The average number of expressed genes per cell ranged between ~2000 genes for late spermatids and ~7000 genes for our late spermatogonia (Figure 2A). The number of genes expressed in early and late spermatids is lower than at any other point during the sperm lineage (Figure 2A), suggesting that post-meiotic transcription exists, but occurs at a lower level (Schultz et al., 2003). Consistently with this result, the cellular RNA content, measured by the number of Unique Molecular Indices (UMI) recovered per cell, is low in spermatids and high in spermatogonia and early spermatocyte (Figure 2B). The RNA content in the post-meiotic cells is five times lower (21%) than that of meiotic stages, inferred from the average number of UMIs per cell. Congruently, we noticed that spermatids express 53% of the total number of genes that spermatocytes express.

Gene expression and RNA content through spermatogenesis.

(A) Boxplots of the number of genes expressed in each cell, binned by assigned cell type. Late spermatogonia and early spermatocytes express the most genes, and spermatids the least. (B) The number of Unique Molecular Indices (UMIs) detected for each cell, a proxy of RNA content. By this metric RNA content peaks in early spermatocytes, and is reduced thereafter by post-meiotic transcriptional suppression. (C) The proportion of segregating de novo, fixed de novo, testis-specific, and all genes expressed in every cell. For each cell, we counted the number of each class of gene with non-zero expression and divided it by the total number of genes of that type, grouping by cell type. For every cell type except spermatocytes, segregating de novo genes are the least commonly expressed, fixed de novo genes are more commonly expressed and all genes are most commonly expressed. In every cell type except early spermatocytes, a smaller proportion of fixed de novo genes are expressed than testis-specific genes, but early and late spermatocytes express similar proportions of fixed de novo genes and testis-specific genes. It is important to note that this measure looks at the number of genes of each type detected in a cell, not the expression level of each, and does not distinguish between high and low expression.

https://doi.org/10.7554/eLife.47138.006

Since most de novo genes in Drosophila are expressed in the testis (Zhao et al., 2014), we asked whether they can be found uniformly across cell types, or whether they are enriched in particular stages of spermatogenesis. We detected expression of 87 segregating and 97 fixed de novo genes from Zhao et al. (2014) that are expected to have originated sometime after the divergence with D. simulans (Zhao et al., 2014 identified 142 segregating and 106 fixed genes, respectively). Consistent with our predicted expression patterns of functional novel genes, we found that de novo genes are expressed in various cell types and a large number of de novo genes are expressed in meiotic germ cells (Figure 2C).

After calculating the cell-type-specific expression profile for every detectable gene, we asked whether a given cell expresses similar proportions of de novo genes, testis-specific genes, and all other annotated genes. We observed that in most cell types, segregating de novo genes were the least commonly expressed group of genes, fixed de novo genes were more common, and testis-specific genes were most commonly expressed (Figure 2C). Early and late spermatocytes, however, express similar proportions of fixed de novo genes and testis-specific genes. Moreover, spermatocytes also show the highest relative abundance of segregating de novo genes compared to other cell types. Altogether, the high proportion of de novo genes expressed in spermatocytes suggests that such genes may play functional roles in these cells and development stage.

Developmental trajectories show de novo gene expression bias during spermatogenesis

To study the transcriptomic path that a progenitor cell would take during its differentiation process, we reconstructed the developmental trajectory of spermatogenesis using monocle (Trapnell et al., 2014), which uses a graph-based minimum-spanning tree to align cells along an inferred path called pseudotime (Figure 3A, Figure 3—figure supplement 1). Pseudotime does not correspond to the actual timing of developmental processes; rather, it is a roadmap of cell differentiation as a function of transcriptomic changes. As an initial step to verify the accuracy of our pseudotime map, we plotted the number of UMIs detected as a function of pseudotime as a proxy of RNA content throughout spermatogenesis (Figure 3B). We saw that the number of UMIs starts fairly low, increases dramatically, and then decreases towards the end of pseudotime. This is consistent with the known post-meiotic downregulation of most transcription during spermatogenesis (Barreau et al., 2008).

Figure 3 with 1 supplement see all
Pseudotime approximates the developmental trajectory of spermatogenesis.

(A) We aligned every cell from our testis sample along an unsupervised developmental trajectory. From the expression of marker genes, we found somatic cells (blue) which were forced onto the developmental trajectory. For further analysis we disregard this branch (See Materials and methods, Figure 3—figure supplement 1). Spermatogenesis begins at the far-left end of the trajectory. (B) The relative RNA content per cell peaks in mid-spermatogenesis, and declines during spermatid maturation, as approximated by the number of UMIs detected per cell. The number of genes expressed declines as well. The black line is a Loess-smoothed regression of the data, which should be thought of as a general trend among stochastic data and not a mathematical model. (C) Loess-smoothed expression of marker genes along the red germ cell lineage assigned in panel A. Along this lineage, the relative expression of marker genes is consistent with their temporal dynamics inferred from previous work. (D) Fixed de novo genes show a variety of expression patterns, including biphasic, early-biased, and late-biased. (E) Segregating de novo genes are often biased towards early/mid spermatogenesis.

https://doi.org/10.7554/eLife.47138.007

By plotting inferred gene expression in every cell as a function of pseudotime, we approximated the behavior of individual genes throughout spermatogenesis. Marker genes consistently show a similar profile in pseudotime and the Seurat analysis. For instance, bam, vas, and His2Av enrichment denote the beginnings of spermatogenesis, and fzo and twe denote early and late spermatocytes, respectively (Figure 3C). Confident that our calculated pseudotime is an accurate representation of spermatogenesis, we proceeded to use it to query how the expression of de novo genes changes throughout spermatogenesis.

If a given novel gene is functional, one would expect it to be biased towards meiotic cells, since germline stem cell-specific genes may not undergo long-term and recurrent positive selection (Choi and Aquadro, 2015). If these genes confer limited beneficial effects, we predict that they may show stochastic transcription pattern in a large variety of cell types. Consistent with our predicted expression patterns of functional novel genes, we found that a large number of de novo genes are expressed specifically in a stage-biased manner, with a significant bias towards meiotic germ cells. Fixed annotated de novo genes show a variety of expression patterns over pseudotime (Figure 3D), with some showing bias towards early stages (CG44174), some with a bimodal expression pattern (CG44329), and some biased towards late spermatogenesis (CR44412). The top five most differentially expressed segregating de novo genes show a variety of expression patterns, but four of the five are biased towards early/middle pseudotime (Figure 3E).

Gene age and mode of origination affects gene expression bias across cell types

Our prior observation that many de novo genes are enriched in GSC/early spermatogonia led us to ask whether the expression patterns of de novo genes differ from the expression patterns of other genes. Although individual de novo genes show a variety of expression patterns, we found that, compared to testis-specific genes, segregating de novo genes are less expressed in germline stem cells (p.adj = 9.35E-04) and slightly enriched in early spermatids (p.adj = 1.70E-02) (Figure 4A, Table 1). By contrast, the scaled expression of fixed de novo genes is not statistically different from that of testis-specific genes (Figure 4B, Table 1). These results suggest that cell-type expression patterns may impact the likelihood that a de novo gene will reach fixation.

Figure 4 with 1 supplement see all
Expression bias of young genes.

Spermatogenesis starts at GSC, early spermatogonia and proceeds rightward. (A) The scaled expression distribution of segregating de novo genes in each cell type, compared with the distribution of every other gene and testis-specific genes. For every gene, 0 is its mean scaled expression in a cell type, and the Y axis corresponds to Z scores of deviations higher or lower than that mean value. Asterixis represent Hochberg-corrected p values. The color of the asterixis indicates which gene set is being compared to de novo genes, and their placement above or below the boxplots indicates that gene set’s relationship (higher or lower) to de novo genes. By this measure, de novo genes are biased downwards in early spermatogenesis and upwards in early spermatids. (B) The scaled expression patterns of fixed de novo genes are typical of testis-specific genes. (C) The scaled expression of detected fixed de novo genes across pseudotime (left to right), clustered by monocle’s plot_pseudotime_heatmap function. While most de novo genes are biased towards intermediate cell-types, a small portion of de novo genes are most expressed during early and late spermatogenesis. (D) The scaled expression of melanogaster-specific duplicate genes over pseudotime. Despite being a similar evolutionary age to fixed de novo genes, young duplicate genes are more likely to be biased towards early and late spermatogenesis.

https://doi.org/10.7554/eLife.47138.009
Table 1
Adjusted p values and direction of bias for gene expression biases of selected gene groups in germ cells.

Spermatogenesis progresses downward from GSC/Early spermatogonia and ends in late spermatids. Upwards arrows indicate that the top group of genes is biased upwards compared to the bottom group, and downwards arrows indicate that it is biased downward according to a directional Hochberg test. For example, ribosomal proteins are more expressed in late spermatogonia than all other genes, with an adjusted p value of 1.24E-75. Note that while segregating de novo genes are expressed differently from testis-specific genes in GSC, early spermatogonia and early spermatids, fixed de novo genes do not significantly deviate from expression patterns of testis-specific genes in any cell type.

https://doi.org/10.7554/eLife.47138.011
Versus:Ribosomal protein genesSegregating de novo genesFixed de novo genesDNA repair genes
All other genesTestis-specific genesAll other genesTestis-specific genesAll other genesTestis-specific genesAll other genesTestis-specific genes
GSC, early spermatogonia↑ 1.13E-82↑ 1.44E-84↓ 1.46E-21↑ 9.35E-04↓ 2.92E-22ns↑ 4.69E-26↑ 8.14E-62
Late spermatogonia↑ 1.24E-75↑ 1.62E-74↓ 8.22E-19ns↓ 5.89E-18ns↑ 2.30E-20↑ 5.80E-44
Early spermatocytes↓ 2.53E-76↓ 1.08E-71↑ 4.08E-15ns↑ 5.31E-13ns↓ 6.50E-23↓ 1.75E-38
Late spermatocytes↓ 2.51E-57↓ 1.90E-58↑ 1.09E-10ns↑ 2.17E-15ns↓ 3.96E-09↓ 4.58E-29
Early spermatids↓ 8.94E-03↓ 1.57E-08ns↓ 1.70E-02nsns↑ 5.89E-08ns
Late spermatids↓ 7.40E-10↓ 1.70E-02nsnsnsnsnsns

We also asked whether this spermatocyte-biased expression is driven by segregating or fixed de novo genes. We quantified gene expression bias for segregating and fixed de novo genes separately and found that both groups of genes display the same direction of bias and a similar degree of statistical significance in every cell type (Table 1, Supplementary file 3). These results suggest that cell-type expression patterns do not impact the likelihood that a de novo gene will reach fixation, rather, the function and fitness effect may play an important role in the process of fixation.

Given a general trend for meiotic enrichment of de novo genes, we asked what proportion of de novo genes exhibit this pattern. Across pseudotime, we qualitatively estimated the relative expression biases of all de novo genes we could detect from Zhao et al. (2014). Overall, 55% of segregating and 62% of fixed genes were biased towards middle stages (spermatocytes), and 11% of both segregating and fixed genes showed high expression bias toward later stages (spermatids). Surprisingly, 29% of segregating genes and 26% of fixed genes showed high expression bias toward early stages (GSC and spermatogonia) (Figure 4C). While many segregating de novo genes are highly expressed in early spermatogenesis, our results from Figure 4A suggest that as a group they are less expressed than typical testis-specific genes in GSC and early spermatogonia. This variety of expression patterns in young de novo genes indicates functional diversification in short evolutionary timescales.

Given that de novo genes, like typical testis-specific genes, are usually maximally expressed during meiosis, we asked if the expression dynamics of recently duplicated genes, another class of young genes, are similar (Figure 4D). Using a list of D. melanogaster-specific ‘child’ genes and their parental copies (Zhou et al., 2008), we queried expression of the parental and derived copies of duplicated genes over pseudotime (Figure 4—figure supplement 1). We classified gene expression patterns into ‘early’, ‘late’ ‘middle’ and ‘bimodal’ for each group. Only 2/14 ‘child’ genes whose expression could be detected in testis had the same expression pattern as their parental copy, indicating that most derived gene copies are regulated by different mechanisms than their parental copy. All parental genes exhibited an early or late expression pattern, but child genes were a mixture of early, late, middle and bimodal expression patterns. (Table 2, Figure 4—figure supplement 1).

Table 2
Frequency of pseudotime expression patterns for melanogaster-specific fixed de novo genes and melanogaster-specific duplicate genes.

For genes in Figure 4C and D, we counted the number of genes showing a strong bias for early pseudotime, late pseudotime, mid-pseudotime, or a bimodal expression pattern. Fixed de novo genes are most frequently biased towards mid-pseudotime and the plurality of melanogaster-specific child duplicate genes show a bimodal expression pattern. Pseudotime expression plots of the parent-child duplicate gene pairs used in this analysis are in Figure 4—figure supplement 1. Proportions are rounded to two decimal places and may not add up to 1.

https://doi.org/10.7554/eLife.47138.012
PatternFixed de novo
proportion
Parental duplicate
proportion
melanogaster-specific
child duplicate
proportion
Early0.260.370.14
Mid0.620.000.21
Late0.110.630.21
Bimodal0.010.000.43

Bimodal expression (a peak in early and middle/late stages) is the most frequent expression pattern for child genes (43%), a pattern we did not observe for any parental genes. It is possible that these bimodal genes were originally expressed with the same pattern as their parental copy and later acquired expression in a different stage, consistent with neofunctionalization (Ding et al., 2010; Lynch and Conery, 2000).

We also observed strikingly different expression patterns for young genes depending on their mode of origination (duplication vs. de novo). To compare young genes of a similar age group, we quantified expression patterns for fixed melanogaster-specific de novo genes from Zhao et al. (2014), and melanogaster-specific gene duplicates from Zhou et al. (2008). We found that fixed de novo genes are most frequency biased towards mid-spermatogenesis (Table 2), and melanogaster-specific duplicate genes are most commonly bimodally expressed. This result indicates that a gene’s expression pattern is influenced by its mode of origination. De novo genes often build regulatory sequences from scratch, but young gene duplicates may co-opt flanking promoter and enhancer sequences from their parental copy.

Mutational load decreases throughout spermatogenesis

Since evolutionary innovations largely depend on novelties occurring at the DNA sequence level, we asked if the mutational load of germ cells varies during the process of sperm development. From our single-cell RNA-seq data, we identified 73 high-confidence substitutions that likely arose de novo. While the reference allele for every variant was present in somatic cells, the variant form of each of them was exclusively found in germ cells, and each inferred substitution is unlikely to be an RNA editing event or unrepaired transcriptional error (see Materials and methods). These substitutions were not present in population-level genome sequencing or previous whole-tissue RNA-seq of RAL517 testis, and the variant form of each substitution was also not present in any of our 3 types of somatic cells. We observed several instances of tightly clustered substitutions (<20 bp apart) present in the same cells, which we interpreted as single mutational events (Supplementary file 2, Figure 5—figure supplement 1). These substitution clusters could be the result of replicative errors resulting from the misincorporation of bi-nucleotides or multi-nucleotides, or due to the recruitment of an error-prone repair pathway at a double-strand break or bulky lesion. After counting clustered mutations as one mutational event, we obtained 44 mutational events present in one or more cell types (Figure 5A).

Figure 5 with 1 supplement see all
Abundance of putative de novo germline mutations.

(A) For every cell type, the total number of high-quality polymorphisms identified. Out of 2590 candidate variants, we excluded all substitutions that could be found in any somatic cell, leaving 73 variants. We then counted clustered polymorphisms as single mutational events and removed variants that could have resulted from RNA editing. See Materials and methods for details. (B) Dividing the number of polymorphisms in a cell type by the number of cells of that type, and the number of bases covered with at least 10 reads in that cell type (Supplementary file 5) yields an approximate relative substitution frequency for each cell type. By this metric, substitutions are most prevalent in early spermatogenesis, and decrease in relative abundance during spermatid development. This could be due to the apoptosis of mutated cells, or the systematic repair of DNA lesions during spermatogenesis. (C) The proportion of cells of each type with at least one identified germline lesion. Error bars are the 95 percent confidence intervals for each proportion. A Chi-square test for trend in proportions gives a p value of 2.20E-16, indicating strong evidence of a linear downward trend. (D) DNA repair genes are generally biased towards early spermatogenesis, statistically enriched compared to the distribution of all other genes. (Wilcoxon adjusted p value < 0.05).

https://doi.org/10.7554/eLife.47138.013

Putative de novo mutations are each likely unique to an individual. If a mutation were found in multiple individuals, it would likely be an inherited somatic variant and we would catch such mutated alleles in somatic cells. For each of the mutations, we identified reads from somatic cells with the WT allele at that position, and the mutated allele is only present in germ cells. Each variant is also supported by multiple germ cell reads with different UMIs.

To approximate per-base mutation load of each cell type, we accounted for two factors. Firstly, we are more likely to call a mutation event in more abundant cell types. Secondly, we are only able to detect mutation events in transcribed regions, so cells with a larger breadth of transcribed regions will likely yield more events. To control for these variables, we calculated the approximate per-base mutation load of each cell type by dividing the number of detected substitutions by the number of cells and the number of bases covered by 10 or more reads in that cell type, finding a decrease in the relative abundance of substitutions during the progression of spermatogenesis (Figure 5B).

Importantly, while we detected 30% (22/73) of inferred germline substitutions in early spermatids, we detected no germline variants in late spermatids. This means that either 1) most lesions are corrected by this stage, or 2) cells with lesions were removed by programmed cell death, or 3) that we captured insufficient quantities of mature spermatid mRNA to detect remaining variants. Although we found that early and late spermatids have similar RNA content, the low abundance of late spermatids makes either explanation possible. Since we observed a steady downward trend of mutation abundance during the progression of spermatogenesis, it is reasonable to infer that late spermatids have a mutational burden equal to or less than that of early spermatids. We counted the number of cells of each type carrying mutations throughout spermatogenesis. We observed that the relative proportion of cells carrying mutations drops consistently throughout spermatogenesis (Figure 5C), indicating that mutational load decreases during spermatogenesis. A chi-square test of the trend in proportions shows that the relative numbers of mutated cells follow a linear trend (p value = 2.20E-16). This result is highly statistically significant and lends credence to our other observations of dwindling mutational load during spermatogenic progression. This trend could be the result of active lesion repair, or the death of cells carrying unrepaired lesions.

DNA repair genes and ribosomal protein genes show an early expression bias

We asked whether two key programs, DNA repair, and translation, show signatures of expression bias during spermatogenesis. We hypothesized that both programs are critical to the production of healthy spermatids, which must undergo heavy periods of growth and division without accumulating mutations.

Ribosomal protein genes appear to be strongly biased towards early spermatogenesis (Table 1, Supplementary file 3). Compared to testis-specific genes or all other genes, they are upregulated in GSCs/early spermatogonia and late spermatogonia, and downregulated in early spermatocytes, late spermatocytes, early spermatids, and late spermatids. Our results indicate that translation is required at the very beginning of spermatogenesis, possibly to build cellular machinery during a period of rapid growth and division. Interestingly, recent studies suggest that ribosomes play an important role in regulating stem cell fate and homeostasis (Nagy et al., 1993; Turner, 2008). The observed abundance of those ribosome protein genes is consistent with ribosome loading playing an important role in stem cell differentiation and germ cell differentiation. Translation is important for later spermatogenesis as well, and our results indicate that the ribosomal machinery may be built early and stored for use in later developmental stages.

We hypothesized that since replication and transcription are very active in early spermatogenesis, DNA repair-gene expression may be biased towards early spermatogenesis. We quantified the expression pattern of 211 DNA repair genes in the testis (DNA repair genes were taken from Svetec et al., 2016). We found that, compared to testis-specific genes, DNA repair genes were upregulated in GSCs/early spermatogonia and late spermatogonia (corr. p values = 8.14E-62, 5.80E-44, respectively), and depleted in early and late spermatocytes (corr. p values = 1.75E-38, 4.58E-29, respectively) (Figure 5D, Table 1). We reason that DNA repair genes are transcribed early because the DNA repair machinery must be assembled early in order to repair mutations as soon as they occur.

Discussion

Our findings provide an unprecedented perspective on evolutionary novelty within the testis. We have developed a simple but robust method to quantify gene expression bias in a cell-type specific manner in single-cell data. It revealed the presence of expression biases in DNA repair genes, segregating de novo genes, and other gene groups. Zhao et al. (2014) characterized de novo genes as lowly expressed from bulk RNA-seq data, but our data demonstrate that de novo genes show various expression patterns among all cell types and are commonly expressed in spermatocytes. Our other observation that segregating de novo genes exceed the expected post-meiotic expression of testis-specific genes is also intriguing. One possibility is that some de novo transcripts may escape RNA degeneration and have a long lifespan after meiosis. Over time, if the products of de novo transcripts are selected and modified by natural selection, the regulatory sequences and resulting expression pattern will be refined. Since fixed de novo genes show similar scaled expression patterns to older testis-specific genes, it is possible that certain expression patterns common to older testis-specific genes increase the likelihood of a segregating gene reaching fixation, or that many of the de novo genes function similarly to testis-specific genes.

Our finding that young duplicated genes have different expression patterns than de novo genes merits further study. Young duplicated genes are more likely to be bimodally expressed than de novo genes of a similar age. While it appears that de novo genes are often highly expressed during meiosis, many duplicated genes display a bimodal expression pattern. While de novo genes may have relatively simple minimal promoters, young duplicated genes may maintain much of the regulatory sequence of their parent copy. However, since only 2 out of 14 melanogaster-specific child duplicates have the same general expression pattern as their parent genes, it seems that their regulatory sequences are often modified to produce alternative expression patterns. This observation suggests that some young duplicated genes have relaxed selective pressure to perform the ancestral function, allowing for neofunctionalization.

We also developed a method to extract mutational information from single-cell RNA-seq data, which can provide information about germline or de novo DNA lesions present in a sample without the need for DNA sequencing. While our method cannot identify variants in untranscribed regions, introns, or sense strands, our method approximates the relative mutational load of different cell types in a sample. Variation in RNA content between cell types may decrease our power to detect substitutions in less transcriptionally-active cells such as late spermatids, although our calculated mutational load in Figure 5B accounts for this. Despite the lack of data for late spermatids, our results suggest that many errors are at least partially repaired before the completion of spermatid maturation. Alternatively, cell death could have removed mutated cells before spermatid maturation if a lesion could not be repaired. Our data cannot distinguish between the death of mutated cells and successful repair of lesions.

The variants we have found are either DNA lesions that have escaped repair, or the lesions that have been selected through competition among cells (Loewe and Hill, 2010). It is likely that some of these substitutions would result in inviable offspring and would never be observed in an adult population. Our result suggests that mutational load varies between different cell divisions, consistent with previous work that suggests a variable lesion rate between cell types (Gao et al., 2011; Gao et al., 2014). Mutational load is the net product of damage and repair, and further characterization of how lesions occur and accumulate in the germline is needed to better understand the evolutionary ramifications of this process (Moorjani et al., 2016).

It appears that the cell types with the highest mutational load are germline stem cells/early spermatogonia, the earliest germ cells. This indicates that by the time germline stem cells enter spermatogenesis, they carry a relatively high mutational burden. This could be due to the fact that germline stem cells cycle many times, dividing asymmetrically to produce spermatogonia and a replenished germline stem cell. This cycle, repeated enough times, could cause a buildup of variants in germline stem cells as they continue to produce spermatogonia. Such a scenario would necessitate a mechanism to remove high level of lesions from maturing gametes. This mechanism must be an equilibrium removing enough lesions to prevent the accumulation of harmful phenotypes. However, the population-level mutation rate never reaches zero, otherwise adaptive evolution will cease (Lynch, 2010).

Since we observed that the transcription of 211 DNA repair genes drops during meiotic stages, we suspect that DNA repair gene products are translated early and continue to repair lesions throughout spermatogenesis. After meiosis, however, the gametes become haploid, and there is no longer a template strand to facilitate homology-directed repair. This should constrain the types of DNA damage repair available in late spermatogenesis. It is also important to note that male Drosophila do not undergo meiotic recombination, meaning that the DNA repair events that occur during spermatogenesis are likely due to replicative or transcriptional forces, not recombination. Transcription-coupled repair during spermatogenesis is apparent in mouse and humans, as variants on the template strand and the coding strand of testis-expressed genes are asymmetrical (Xia et al., 2018). Our finding that the number of variants decreases throughout spermatogenesis is consistent with the results of Xia et al., who posit a generalized genomic surveillance function of spermatogenesis. Future work should use single-cell genome sequencing on FACS-purified subpopulations of testis cells to identify germline variants and calculate their relative abundance. Additionally, our work necessitates comparison of the relative mutational burden of older flies to younger flies. If DNA lesions accumulate in cycling germline stem cells over time, spermatogenic mutational surveillance may less efficiently compensate for more lesions in sperm from older individuals. Our result indicates that cell-type specific mutational load can be estimated from single-cell RNA-seq data with reasonable accuracy. Overall, we provide novel insights into the dynamics of mutation, repair, and de novo gene expression profiles in the male germline.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or
reference
IdentifiersAdditional
information
Strain, strain background (Drosophila melanogaster, male)RAL517Mackay et al., 2012BDSC:25197
Commercial assay or kit10X chromium
3' kit V2
10X genomics10X genomics product number CG00052
Chemical compound, drugGibco Collagenase, type IThermo FisherThermo Fisher catalog number 17018029
Chemical compound, drugTrypsin LEThermo FisherThermo Fisher catalog number 12605036
Software, algorithmCellranger10X genomics
Software, algorithmHisat2PMID:25751142
Software, algorithmStringtiePMID:25690850
Software, algorithmSeuratSatija et al., 2015
Software, algorithmbcftoolsPMID:28205675
Software, algorithmsamJDKLindenbaum and Redon, 2018
Software, algorithmMonocleTrapnell et al., 2014

Preparation and sequencing of testis single-cell RNA-seq libraries

Request a detailed protocol

We used 2- to 3- day-old DGRP-RAL517 flies in this study (Mackay et al., 2012). Testes from 50 male flies were dissected in cold PBS. The resulting 50 testes were de-sheathed in 200 μl of lysis buffer (Trypsin LE + 2 mg/ml collagenase). The samples were incubated in lysis buffer for 30 min at room temperature with gentle vortex mixing every 10 min. The samples were filtered through a 30 μm tissue culture filter followed by a 7 min centrifugation at 1200 rpm. The cells were washed with 200 μl of cold HBSS and pelleted again for 7 min at 1200 rpm. The resulting cell preparation was re-suspended in 20 μl of HBSS before further processing. For cell counting, 5 μl of the single cell suspension were mixed with 5 μl of the exclusion dye trypan blue and the total cell number as well as the ratio between live and dead cells were analyzed using an automated cell counter (Logos Biosystems). For imaging, 15 μl of the cell suspension were transferred to a slide and imaged in a Zeiss upright light microscope. This method yielded high numbers of single cells with an average of 93–96% viability. We then submitted 8000 cells (sequenced 5000 cells) for library preparation with the 10X Genomics chromium 3’ kit, followed by sequencing with Illumina Nextseq 98 bp paired-end chemistry.

Preparation of custom annotation file for de novo gene analysis

Request a detailed protocol

We analyzed de novo genes identified in Zhao et al. (2014), by converting the gene coordinates to D. melanogaster version six reference genome with FlyBase coordinates converter. Strand data and splicing information is not present for those reference genes, so we chose to proceed only with genes whose expression could be detected in our D. melanogaster testis single-cell sequencing data. Using whole-tissue RNA-seq data from multiple strains of Drosophila testis, we used Stringtie merged to create a merged transcriptome GTF containing unannotated transcripts and used BLAST to compare the novel transcripts against converted coordinates for the Zhao et al. (2014) genes. For genes with a match between the converted 2014 coordinates and the new merged transcriptome, we added the coordinates from the merged GTF to the FlyBase dmel_r6.15 reference GTF. Since a single-exon de novo genes could be on either strand, we created a plus and minus strand version of every verified de novo gene. Our custom annotation file thus contains all the standard FlyBase dmel_r6.15 genes, plus a set of assembled transcripts known to correspond to de novo genes.

Our study only seeks to analyze previously characterized de novo genes, and will inherit the limitations of identification of de novo genes using bulk RNA-seq data. Zhao et al. (2014), the source paper for these segregating and fixed de novo genes, detected de novo genes from bulk testis RNA-seq of multiple D. melanogaster strains, meaning that de novo genes that are enriched in a rare cell type may not be counted as de novo genes if their expression in the whole tissue does not reach a certain threshold. Despite this possibility we still observe many de novo genes with maximum expression in rare cell types such as germ line stem cells and spermatogonia.

Quantification of reproducibility

Request a detailed protocol

If single-cell suspension results to relatively unbiased ratios of cell types compared to in vivo cell types, one would expect a relatively high correlation of single-cell RNA-seq and bulk RNA-seq data. To verify this, we aligned the single-cell RNA-seq reads and bulk RNA-seq reads of RAL517 separately to the reference genome using Hisat2, calculated gene TPMs with Stringtie, and then used DEseq2 to regularized-log transform the TPM values from both datasets. After that, we plot the correlation of normalized gene expression and calculated the Pearson’s R (Figure 1—figure supplement 2A). Using the same method, we also compared our dataset to a second single-cell library prepared from testis of a wild D. melanogaster strain from our lab (Figure 1—figure supplement 2B).

Processing of single-cell data

Request a detailed protocol

Illumina BCL files were converted into fastq files using Cellranger mkfastq. A reference genome was created with Cellranger mkref, with all genes from the FlyBase D. melanogaster reference. To this reference, we added all segregating and fixed de novo genes from Zhao et al. (2014). We used the custom reference to run Cellranger count, which demultiplexed the single cell reads into a usable format for Seurat. Going forward, we kept all genes expressed in at least 3 cells and all cells with at least 200 genes expressed. We ran Seurat ScaleData and NormalizeData with default parameters. According to the Seurat documentation, ‘Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor (default = 10,000). This is then natural-log transformed using log1p.’ We then ran Seurat’s default t-SNE function and found clusters based on the first nine principal components (resolution = 2). Of the parameters we tried, most produced a similar t-SNE clustering pattern, but nine principal components generated the best separation between different cell types.

Identification of cell types in single-cell RNA-seq data

Request a detailed protocol

We used marker genes to infer the predominant cell type within each cluster in Seurat. Aubergine (aub) is a marker of germline stem cells (Rojas-Ríos et al., 2017), and Bag of Marbles (bam) is a marker of spermatogonia (Kawase et al., 2004). A cluster enriched in vasa (Ohlstein and McKearin, 1997) and bam but not aub was annotated as late spermatogonia. Clusters most enriched in fuzzy onions (fzo) were inferred to be early spermatocytes (Hwa et al., 2002), and clusters with enrichment of twine (twe) but not fzo were inferred to be late spermatocytes (Courtot et al., 1992). The literature is clear that transcription of fzo and twe peaks in spermatocytes, but it is less clear which marker denotes early and late spermatocytes, respectively. To resolve this ambiguity, we used monocle (Trapnell et al., 2014) to align our cells on a developmental trajectory called pseudotime (rho = 68, delta = 5, ordered using the top 1000 differentially expressed genes). We found that twe expression peaked later in spermatogenesis than fzo, and concluded that clusters expressing twe but not fzo were late spermatocytes. Epithelial cells were defined based on enrichment of MntA and Hsp23, Hub cells were defined based on Fas3, and Cyst cells were defined by enrichment of zfh1 (Zhao et al., 2010). Late spermatids were marked by p-cup, a post-meiotically transcribed gene.

Analysis of the spermatogenic developmental trajectory

Request a detailed protocol

The adult testis contains both somatic and germ cells, but lacks the common progenitor cells for each lineage. Therefore, when constructing a lineage tree for all cells in our tissue, we would expect to see a separate branch containing somatic cells erroneously branching from somewhere along the inferred lineage of more common germ cells. In the somatic cell branch from Figure 3A, MtnA is enriched (Figure 3—figure supplement 1), leading us to infer that this state is mainly somatic cells. As such, we ignored this branch for our analysis of gene expression during germ cell development in Figure 3. One should not interpret this result as evidence that somatic cells in the testis arose from germ cell progenitors, rather, this is a consequence of Monocle’s algorithm that forces a minimum spanning tree for all cells in a sample, regardless of their real cell-type of origin. Since the original common progenitor for the germ lineage and somatic testis cells is not present in adult tissues, Monocle placed the somatic cells to their closest germ cell neighbors, in this case late germ cells. As shown in Figure 3, there is a gap between the group of somatic cells and the tightly clustered lineage of germ cells, indicating that the cells are indeed from a different lineage.

To construct the trajectory, we used the following parameters:

  • >reduceDimension(max_components = 2, num_dim = 3, norm_method = 'log', reduction_method = 'tSNE')

  • >clusterCells(my_cds, rho_threshold = 55, delta_threshold = 10, frequency_thresh = 0.1)

To order the cells, we used the top 1000 genes with the highest q value of being differentially expressed between clusters.

Calculation of cell-type bias of gene groups

Request a detailed protocol

Testing whether gene expression is biased across cell types requires overcoming two challenges. Firstly, different cell types have varying levels of RNA and global transcription, so it is important to account for the behavior of other genes in a cell type when calculating expression bias of a group of genes. Additionally, the calculated expression values for different groups of genes will vary by orders of magnitude. Expression values must be scaled across the dataset on a per-gene basis, with 0 representing a gene’s mean expression across the tissue, and positive or negative values corresponding to the Z-score of a calculated expression value. To address the confounding effect of global variation in gene expression, we compared groups of genes against all other genes within a cell type, and asked if some groups of genes behave as outliers in a given cell type. For de novo genes, we compared the scaled, average expression of putative de novo genes to every other gene within a cell type using a signed Wilcoxon test (Wilcoxon, 1945).

For groups of genes (e.g. de novo genes, DNA repair genes), we asked whether their scaled expression distribution in a cell type was statistically different from that of other genes. For every gene, we calculated its average scaled expression within each cell type, and then performed a Wilcoxon signed test to determine if the mean scaled expression of genes in the cell type was statistically higher or lower than all other genes in the same cell type. For each gene group and cell type, we adjusted the resulting p-values with Hochberg’s correction (Haynes, 2013). This shows the direction and statistical significance of each cell-type specific bias of a gene group. For the raw and adjusted p values of every gene group tested, please refer to Supplementary file 3. For germline cells, the direction of bias and adjusted p values are given in Table 1. Gene lists used are in Supplementary file 6.

Calculation of base substitution rate for individual cells

Request a detailed protocol

Using the demultiplexed, aligned reads generated by Cellranger, we ran bcftools mpileup (Narasimhan et al., 2016) with a minimum quality cutoff of 25 to find nucleotide polymorphisms from our RNA-seq data. We filtered the calls to exclude variants known to be segregating in populations of D. melanogaster DGRP-RAL517 (Mackay et al., 2012). We also filtered the variant calls against a D. melanogaster DGRP-RAL517 population genome dataset we generated recently. We also excluded variants whose read coverage for the reference allele was less than 10. With the remaining 2590 polymorphisms, we used samjdk (Lindenbaum and Redon, 2018) to extract reads containing the variant allele and match the cell barcode to the cell identities from our Seurat analysis. To remove variants that likely arose prior to the collection of this data, we excluded variants found in somatic cells (hub, cyst, epithelial cells). The numbers of variants remaining after each filtering step is given in Supplementary file 4.

We found a number of substitutions clustered together in close proximity and expressed in the same cells (Figure 5—figure supplement 1). We treated these clusters as single mutational events to avoid biasing our calculated mutational abundance. After counting the total variants detected within each cell type, we subtracted polymorphisms found within 10 bp of each other in the same cells so that each cluster of variants counted as one mutation event. To approximate cell-type specific substitution rate, the number of mutational events detected in each cell type was divided by the number of cells and the number of bases covered by at least 10 reads by all cells of a type using samtools. Number of cells, mutational events and covered bases are given in Supplementary file 5.

To ensure that our inferred mutations are not uncorrected transcriptional errors, we made sure each variant followed at least 2 of following criteria: (1) The alternate allele for most of our inferred mutations was found in multiple germ cells (but not somatic cells). A transcriptional error is unlikely to happen at the same position in multiple cells. (2) In every cell where a mutation was identified, the reference allele was either completely absent (possible homozygote) or present with as many or fewer reads as the alternate allele (possible heterozygote). (3) For substitutions found in only one cell, the alternate allele was present on multiple mRNA molecules (different UMIs). A transcriptional error is unlikely to produce the same change at the same position multiple times.

We performed the following steps to remove possible RNA editing events from our samples. Recurrent RNA editing events would be present in whole-tissue RNA-seq data, so we ran bcftools mpileup with the same parameters on whole-tissue testis RNA seq data of D. melanogaster RAL 517. Four of our seventy-seven inferred germline variants were present in the whole-tissue data, so we removed them for downstream steps. The final list does not show a high level of A > G substitution, which would be expected from RNA editing (Tan et al., 2017).

Calculation of the proportion of mutated cells, by type

Request a detailed protocol

We manually checked every SNP with every cellular barcode within which the alternate allele was found. Using cellular identities that we inferred using Seurat, we counted the number of cells of each type containing at least one substitution. This number, divided by the total cells identified as that type, yields the proportion of mutated cells shown in Figure 5C and Supplementary file 5.

References

  1. 1
  2. 2
  3. 3
    Sperm Biology
    1. T Birkhead
    2. D Hosken
    3. S Pitnick
    (2008)
    Academic Press.
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
    The Drosophila cdc25 homolog twine is required for meiosis
    1. C Courtot
    2. C Fankhauser
    3. V Simanis
    4. CF Lehner
    (1992)
    Development 116:405–416.
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
    Spermatogenesis
    1. MT Fuller
    (1993)
    In: M Bate, A Martinez-Arias, editors. The Development of Drosophila melanogaster. Cold Spring Harbor Laboratory Press. pp. 71–148.
  15. 15
  16. 16
  17. 17
    Holm’s Method
    1. W Haynes
    (2013)
    In: W Dubitzky, O Wolkenhauer, K. -H Cho, H Yokota, editors. Encyclopedia of Systems Biology. New York: Springer. pp. 902–903.
    https://doi.org/10.1007/978-1-4419-9863-7_1214
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    The population genetics of mutations: good, bad and indifferent
    1. L Loewe
    2. WG Hill
    (2010)
    Philosophical Transactions of the Royal Society B: Biological Sciences 365:1153–1167.
    https://doi.org/10.1098/rstb.2009.0317
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    Cell-specific expression and heat-shock induction of hsps during spermatogenesis in Drosophila melanogaster
    1. S Michaud
    2. R Marin
    3. JT Westwood
    4. RM Tanguay
    (1997)
    Journal of Cell Science 110:1989–1997.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
    Ectopic expression of the Drosophila Bam protein eliminates oogenic germline stem cells
    1. B Ohlstein
    2. D McKearin
    (1997)
    Development 124:3651–3662.
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
    Visualizing high-dimensional data using t-sne
    1. LJP Van Der Maaten
    2. GE Hinton
    (2008)
    Journal of Machine Learning Research 9:2579–2605.
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62

Decision letter

  1. Christian R Landry
    Reviewing Editor; Université Laval, Canada
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States
  3. Helen White-Cooper
    Reviewer; Cardiff University, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Mutational and transcriptional dynamics across Drosophila spermatogenesis at the single-cell level" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal her identity: Helen White-Cooper (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This is a very interesting study that reports single cell mRNA sequencing on male Drosophila gonads to look at the spermatogenesis developmental program. The authors also examine the expression timing of young genes that evolved de novo or that duplicated recently. The reviewers were enthusiastic about the work but they raised major concerns that would need to be addressed. All of these concerns except one require only partial re-analysis et re-interpretation of the results. One of them concerns the integrity of the material subjected to single-cell sequencing. Although we understand that it is not possible to repeat the entire protocol, we agreed on the fact that a critical assessment of the staging and assignment to categories was required. We would also like to see a repeat of the cell preparation so that the reader can have more confidence in what they are seeing for different cell types.

Essential revisions:

1) For non-drosophilist readers, it is very important that the authors explain early on that male Drosophila do not typically undergo meiotic recombination. This has important implications for the interpretation of some of the results (especially the section on mutational load).

2) The single cell sequencing experiments were conducted on a pool of testes from 50 flies. We can appreciate that single cells are being profiled and this is some kind of within experiment replication, but this does not provide us with information on how reproducible the entire experiment is. The authors may have access to more data or independent data that would allow to demonstrate the reproducibility of the analyses. If it is the case, it would be useful to provide an analysis of such data. In addition, it is unclear to me how single-cell RNA-seq data is normalized (or not) and what the variance level is among cells. This seems like an important point that needs to be considered (whether the authors think this is a problem or not). If spermatocytes have overall higher transcriptional activity than other cell types (which could be the case considering more genes and mRNA molecular are detected, Figure 2A,B), this could in turn increase power to detect de novo genes, especially if these tend to be expressed at low levels (and affect some conclusions drawn, e.g. de novo genes to be expressed in higher proportion at the early spermatocyte stage, Figure 2C).

3) The identification of de novo mutations needs to be supported or discussed. Two reviewers were convinced but another one raised these points. It is surprising that many mutations are identified. Also, the analysis needs to be refined. The tissues examined are from a pool of flies. Are mutations from different individuals? 73 substitutions means about 1.5 substitution per fly since 50 flies are studied. Are mutations identified at one stage carried on to the next stage in the data? The per-base mutation rate would need to take into account the number of cell divisions, no? Are all stages going through cell division?

4) Could the authors make a prediction as to what active lesion repair vs death of cells carrying unrepaired lesions would look like in their data? I think the latter is more likely than the former given the observations (and that this is a very important and exciting result that could stand out more), because I predict that lesions would be difficult to transcribe (and perhaps lead to various nt misincorporation?), but I could be wrong. If correct, this result has important implications with respect to selective pressure within the testis.

5) The critical point of this paper in terms of determining whether the conclusions are supported by the data is the generation and interpretation of the single cell sequencing data, and one expert reviewer remains to be convinced by all of it. She would be much more reassured that the cells are healthy and intact if pictures were shown of them after the isolation, preferably after sorting. Is the sorting system capable of dealing with cells that are 1mm long? The RNA content of spermatids is reported to be low in this data set. It could be, but it could also be that these cells are not intact in the sample, and only a fragment of the cell has made it into the sequencing. Clearly recovery of late spermatids has been inefficient in this sequencing as there are only 84 such cells from the 5000 cells sequenced. About half of the germline cells in the testis are post-meiotic, so a much higher number of spermatids is to be expected. Similarly, what happens to the cyst cells after they have been dissociated? Their normal morphology is very flat (and concave, wrapped around the germline). The results suggest that these cells express many genes, but again "cells" in this class could also include fragments of other cell types – the relatively high expression of fzo, twe, soti, Dpy-30L2 and p-cup, all genes that are reported to be germline specific in this class is therefore of concern.

6) The integrity of the starting cell population probably does not majorly impact the most abundant cell types spermatogonia and early vs late spermatocytes, however I am not sure the transitions between these cell types are accurately annotated. – "We found that twe expression peaked later in spermatogenesis (546) than fzo, and concluded that clusters expressing twe but not fzo were late spermatids." No, both these transcripts are present in early-mid and late primary spermatocytes, indeed fzo is not translated until the cells are secondary spermatocytes. twe translation is somewhat earlier. These two genes have very similar expression patterns according to RNA in situ. The data for twe and fzo in particular in Figure 3 do not show a strong distinction. Most of the cells in the dataset have reasonable levels of both genes suggesting that most cells are really primary spermatocytes. It would be better to use a gene that is transcribed earlier in spermatocytes, and declines as spermatocytes mature, to distinguish these cell types – eg aly or rye (Taf12L, PC2 contributor from Figure 1—figure supplement 1), or the ribosomal proteins.

7) We do not dispute the pseudotime trajectory that is calculated, it is the boundaries between classes that we are not sure of. In Figure 3 the transition between state 1 and state 2 seems to be at the very end of differentiation, forced on the timeline by the artefactual branch of somatic cells. There are surely more real biological states in this series, and the actual boundaries could be more accurately mapped. E.g., the rapid decline in the number of transcripts present per cell at about 60% of pseudotime likely reflects the transition from spermatocytes through meiosis to early spermatid elongation – a phase at which many transcripts are degraded.

8) The class named as mature spermatids are probably not fully mature.

9) Figure 4 would be much easier to interpret if it also included z-score tracks on the pseudotime plot for the markers used in assigning cell type for reference (i.e., bam, vasa etc).

10) Figure 4—figure supplement 1 is not possible to interpret as shown. The order of genes appears to be different for the parental vs child sets. What this figure needs to show is each pair of genes and their expression in pseudotime, aligned with each other, not clustered within their own class.

11) Abstract. "the distribution of new genes across cell types" is a misleading phrasing. All cell types have all genes. We are not sure exactly what is meant here. Introduction " The testis is a highly heterogeneous tissue", is it really more heterogenous than other tissues (gut, brain, etc?).

12) Results paragraph five and the following: please explain on what basis these de novo genes were identified. Were they identified from transcript in the male gonads? If yes, they are not a random set of de novo genes and may be biased depending on how the mRNA data was collected in the original study. It would be more useful to identify de novo genes from the data itself and look at their age, population frequency, stage at which they are seen, etc. If the original paper just looked at bulk testis expression, the most abundant cell types in the extract will be overrepresented, biasing the discovery of de novo genes in the stages with more coverage. Since the main result of the paper relates to de novo genes, this part of the analysis needs to be strengthened.

13) Results paragraph six. The authors show that de novo genes are initially expressed in some specific cell types and they conclude that spermatocyte expression is an important step for de novo origination. Important is vague term and the reader does not understand what is being shown and supported.

14) Figure 2C: The authors write: "For every cell type except spermatocytes, segregating de novo genes are the least commonly expressed". It seems that segregating de novo genes are always the least expressed, including in the spermatocytes. Also, this analysis shows that in early spermatocytes, fixed de novo genes are more expressed that all other genes. This is an example of bias that I was mentioning in my second point. If this stage is more represented in the study that identified de novo genes, it is not surprising that the de novo genes studied here are more expressed there because they needed to be highly expressed to be discovered in the first place.

15) The comparison with gene duplicates needs more work. The authors need to make sure the duplicates are matched with the de novo genes in terms of their age and maybe also size since detectability by RNAseq could be biased towards longer transcripts. Finally, the mechanisms of duplication needs to be taken into account (RNA based gene duplication versus tandem duplication versus other mechanisms). Very young duplicates are also probably difficult to differentiate at the RNA level.

16) The high proportion of fixed de novo genes expressed in spermatocytes is not unexpected, and fits with the theory that spermatocyte chromatin is permissive for transcription. Most of these genes are testis-specifically expressed, and most testis-specifically expressed genes are expressed in spermatocytes. The "all genes" class includes genes that are expressed earlier in the germline and in somatic cells. If testis-specifically expressed old genes were analysed they would likely have the same pattern as the de novo genes.

https://doi.org/10.7554/eLife.47138.027

Author response

Essential revisions:

1) For non-drosophilist readers, it is very important that the authors explain early on that male Drosophila do not typically undergo meiotic recombination. This has important implications for the interpretation of some of the results (especially the section on mutational load).

Thank you for bringing up this point. A sentence clarifying this point has been added to the Introduction, and Discussion subsection “Gene age and mode of origination affects gene expression bias across cell types”.

2) The single cell sequencing experiments were conducted on a pool of testes from 50 flies. We can appreciate that single cells are being profiled and this is some kind of within experiment replication, but this does not provide us with information on how reproducible the entire experiment is. The authors may have access to more data or independent data that would allow to demonstrate the reproducibility of the analyses. If it is the case, it would be useful to provide an analysis of such data.

Two analyses demonstrating the high degree of reproducibility of our experiments were added in the new version of the manuscript. First, we checked the expression correlation between a bulk RAL517 RNA-seq data (Zhao et al., 2014) and the “bulked” RAL517 single-cell RNA-seq data (treating single-cell RNA seq data as bulk data), these data correlate with a Pearson’s R of 0.97, indicating that the single-cell sequencing data has accurately captured the testis transcriptome in a reproducible fashion. This suggest that although there may be bias for single-cell sorting, this bias is very small and may largely be biased against late spermatid and mature sperm. In addition, we correlated TPM’s from our sample with a separate library from a different D. melanogaster strain from our lab, also with a Pearson’s R of 0.97. As such, our data are robust with respect to biological and technical variation. This also suggest that although there might be some degree of batch effects for single-cell RNA-seq data, this is unlikely impact RNA expression pattern globally. This result was added to Figure 1—figure supplement 2.

In addition, it is unclear to me how single-cell RNA-seq data is normalized (or not) and what the variance level is among cells. This seems like an important point that needs to be considered (whether the authors think this is a problem or not).

To normalize our data, we used Seurat’s normalizeData function with default parameters. Regarding normalization, the program used log-normalization. Specifically, according to Seurat’s documentation, “Feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor (default=10,000). This is then natural-log transformed using log1p.” This log-normalization and a subsequent scaling step ensure that higher transcriptional activity of a given cell type does not bias our ability to infer relative gene expression across cell types. We expanded the Materials and method.

If spermatocytes have overall higher transcriptional activity than other cell types (which could be the case considering more genes and mRNA molecular are detected, Figure 2A,B), this could in turn increase power to detect de novo genes, especially if these tend to be expressed at low levels (and affect some conclusions drawn, e.g. de novo genes to be expressed in higher proportion at the early spermatocyte stage, Figure 2C).

We acknowledge that our study only seeks to analyze previously characterized de novo genes, and will inherit the limitations of Zhao et al., 2014, the source paper for these segregating and fixed de novo genes. Zhao et al., 2014 detected de novo genes from bulk testis RNA-seq of multiple D. melanogaster strains, meaning that de novo genes that are enriched in a rare cell type may not be counted as de novo genes if their expression in the whole tissue does not reach a certain threshold. Despite this possibility we still observe many de novo genes with maximum expression in rare cell types such as germ line stem cells and spermatogonia. This shows that we are able to detect de novo genes from less abundant/transcriptionally active cells. In future studies, when the cost of single-cell sequencing comes down, we hope to use it to identify de novo genes that may be more cell-type specific than the genes currently known. These points are addressed in subsection “Preparation of custom annotation file for de novo gene analysis”.

3) The identification of de novo mutations needs to be supported or discussed. Two reviewers were convinced but another one raised these points. It is surprising that many mutations are identified. Also, the analysis needs to be refined. The tissues examined are from a pool of flies. Are mutations from different individuals? 73 substitutions means about 1.5 substitution per fly since 50 flies are studied. Are mutations identified at one stage carried on to the next stage in the data? The per-base mutation rate would need to take into account the number of cell divisions, no? Are all stages going through cell division?

Putative de novo mutations are each likely unique to an individual. If a mutation were found in multiple individuals, it would likely be an inherited somatic variant and we would catch such variant alleles in somatic cells. For each of the mutations, we identified reads from somatic cells with the WT allele at that position, and the mutated allele is only present in germ cells. Each variant is also supported by multiple germ cell reads with different UMIs.

Since the data is a snapshot of every cell in the testis at the same time, we would expect de novo variants to be present in a single cell type since cells from a common progenitor differentiate simultaneously (except one daughter cell from GSC division). In reality, we observe some variants present in adjacent cell types, reflective of the fact that transitions between these cell types are a continuous, not binary phenomenon.

We agree with the reviewer that “73 substitutions mean about 1.5 substitution per fly since 50 flies are studied”. However, because 73 substitutions are unlikely to reflect the total mutations in all 50 flies given that we only sequenced a small subset of the cells from the 50 flies, we think it would not be the most meaningful way to report the data. Additionally, some of the 77 positional variants are linked, leaving 44 total mutational events, not 73 (this is already accounted for in our previous analysis). We are hesitant to include the number of cell divisions in our calculated mutational load because based on the current resolution it is impossible to know during which cell division a variant arose, and some of our assigned cell types encompass multiple cell division cycles. We do agree with the reviewer that it is an interesting and important question to pursue in the future when the technique is advanced enough to provide such resolution. We have updated the manuscript to emphasize these points in subsection “Mutational load decreases throughout spermatogenesis” paragraph two.

4) Could the authors make a prediction as to what active lesion repair vs death of cells carrying unrepaired lesions would look like in their data? I think the latter is more likely than the former given the observations (and that this is a very important and exciting result that could stand out more), because I predict that lesions would be difficult to transcribe (and perhaps lead to various nt misincorporation?), but I could be wrong. If correct, this result has important implications with respect to selective pressure within the testis.

It is a very exciting and important question to consider, and we also hypothesize that the second model is likely. However, although we tried to identify differentially expressed cell-death related genes in the mutated cells, we were not able to distinguish these two models based on current knowledge. One of the major challenges is that there are not enough marker genes (especially cell-death related genes in Drosophila testis) to be used to provide a statistical power. We are not aware of other ways to distinguish between active lesion repair and death of cells carrying unrepaired lesions, since either will result in removal of mutated cells. In the future, we or other colleagues in the field might be able to do a lineage-tracing experiment to show that cells transcribing a certain transgenic marker acquire and lose mutations during spermatogenesis. To make this point clearer, we expanded it in the Discussion subsection “Gene age and mode of origination affects gene expression bias across cell types”.

5) The critical point of this paper in terms of determining whether the conclusions are supported by the data is the generation and interpretation of the single cell sequencing data, and one expert reviewer remains to be convinced by all of it. She would be much more reassured that the cells are healthy and intact if pictures were shown of them after the isolation, preferably after sorting.

High-quality data is the most important first step for this work. We appreciate the caution and the detailed point from the reviewer. In the original submission, we reported that the single cells with an average of 93-96% viability after sorting, this is a very high number of live cells for single-cell suspension. The other WT strain suspension yielded a similar viability of ~95%. To address her concern, we have included pictures of the single-cell suspension of the sequenced strains created with our protocol showing that cells are intact and form a range of sizes (Figure 1—figure supplement 1). A very important point about the figures is that cells in vitro do not have the same observed size as in vivo, because cells are rarely perfectly round in vivo. Without physical forces from nearby cells, the vast majority of the cells in vitro are round cells.

Is the sorting system capable of dealing with cells that are 1mm long? The RNA content of spermatids is reported to be low in this data set. It could be, but it could also be that these cells are not intact in the sample, and only a fragment of the cell has made it into the sequencing. Clearly recovery of late spermatids has been inefficient in this sequencing as there are only 84 such cells from the 5000 cells sequenced. About half of the germline cells in the testis are post-meiotic, so a much higher number of spermatids is to be expected. Similarly, what happens to the cyst cells after they have been dissociated? Their normal morphology is very flat (and concave, wrapped around the germline). The results suggest that these cells express many genes, but again "cells" in this class could also include fragments of other cell types – the relatively high expression of fzo, twe, soti, Dpy-30L2 and pcup, all genes that are reported to be germline specific in this class is therefore of concern.

As mentioned in the above paragraph, we see variation in cell size – though the cell size differences are not as dramatic as in vivo – but not shape, and see no clusters of adhered cells or intact cysts. We did not observe evidence of fragmented cells or cell debris. It seems clear that the cytoplasmic bridges between conjoined germ cells have been broken and these have been separated from cyst cells. The trypsin, collagenase and filtration steps appear to have coerced these cell types present in our sample into a spherical shape, and none appeared larger than 30 uM, the maximum size allowed by the chromium controller. We saw spermatid-like cells with identifiable tails, and in our sequencing data we see clear evidence of their presence- the cluster-specific enrichment of p-cup and other late-spermatid genes. Their long shape means that spermatids must go through the Chromium controller channel in a precise orientation to be captured, resulting in a lower capture efficiency than other cells. However, although Drosophila spermatids have long tails, they are not necessarily linear in solution, and the tail sometimes wraps around the cell body in a ball shape outside the testis. Although it is likely that the capture of these late-stage cells is inefficient, the captured cells show reasonable expression profiles of spermatid-specific genes. We agree with the reviewer that the testis contains many more spermatids than we have captured. In combination with response (2) that the single-cell sequencing result is highly consistent with bulk RNA-seq data, these results suggest that while some degree of bias is likely, the impact of this bias is relatively small.

Related to the shape of cyst cells, because the cells are not attached to each other, they assume a round or near round shape within the cell suspension. Regarding the expression of fzo, twe, soti, DPY-30L2 and p-cup in cyst cells, we agree that expression of these genes is unexpected, which was largely due to the data presentation of Figure 1D. The clusters marked as cyst cells show specific enrichment of zfh1, indicating that this cluster definitely contains cyst cells. The original scaled color may be misleading, and we adjusted the scaled expression color pattern in Figure 1D to better show cluster-specific enrichment of key marker genes, and found that cyst cells show that fzo and twe were somewhat expressed in this cluster, but not Dpy-30L2 or p-cup compared to other non-spermatid clusters, consistent with Figure 1C. This indicates that this cluster could contain cyst cells as well as some transcriptionally similar or physically adhered spermatocytes.

Our assigned spermatocyte clusters show no enrichment of Fas3, indicating that we have been conservative in our assignment of germ cells at the expense of some false negatives. While our assigned germ cell clusters appear to be relatively free of somatic cells, the somatic cell clusters may contain some misidentified germ cells. Our difficulties identifying true cyst cells indicate that for single-cell analysis of the cyst cell lineage, a future study might need to perform a FACS-purification step to remove germ cells or doublets.

6) The integrity of the starting cell population probably does not majorly impact the most abundant cell types spermatogonia and early vs late spermatocytes, however I am not sure the transitions between these cell types are accurately annotated. – "We found that twe expression peaked later in spermatogenesis (546) than fzo, and concluded that clusters expressing twe but not fzo were late spermatids." No, both these transcripts are present in early-mid and late primary spermatocytes, indeed fzo is not translated until the cells are secondary spermatocytes. twe translation is somewhat earlier. These two genes have very similar expression patterns according to RNA in situ. The data for twe and fzo in particular in Figure 3 do not show a strong distinction. Most of the cells in the dataset have reasonable levels of both genes suggesting that most cells are really primary spermatocytes. It would be better to use a gene that is transcribed earlier in spermatocytes, and declines as spermatocytes mature, to distinguish these cell types – eg aly or rye (Taf12L, PC2 contributor from Figure 1—figure supplement 1), or the ribosomal proteins.

Thank you for this comment. We apologize for a typo- “clusters expressing twe but not fzo were late spermatids” should have read “late spermatocytes.” We have since corrected it, and given that our observations about the peak expression patterns of fzo and twe have not been previously reported, we have sought to strengthen our early/late spermatocyte assignments with another gene, aly, which is known to decline throughout spermatogenesis. We have added aly to Figure 1D. As predicted by the reviewer, it is less commonly expressed in our assigned late spermatocytes than in early spermatocytes, strengthening our confidence in our cell-type predictions.

7) We do not dispute the pseudotime trajectory that is calculated, it is the boundaries between classes that we are not sure of. In Figure 3 the transition between state 1 and state 2 seems to be at the very end of differentiation, forced on the timeline by the artefactual branch of somatic cells. There are surely more real biological states in this series, and the actual boundaries could be more accurately mapped. E.g., the rapid decline in the number of transcripts present per cell at about 60% of pseudotime likely reflects the transition from spermatocytes through meiosis to early spermatid elongation – a phase at which many transcripts are degraded.

We completely agree with this comment and suggestion that state 2 is not a real biological state. However, assigning discrete biological states along a pseudotime trajectory is tricky since the cell states exist not in discrete blocks, but in a continuum of gene expression. The most real, discrete biological state in our pseudotime trajectory is the difference between somatic cells and germ cells, so we have recolored Figure 3A to reflect this.

8) The class named as mature spermatids are probably not fully mature.

We revised this term, and these cells are now called “late spermatids”.

9) Figure 4 would be much easier to interpret if it also included z-score tracks on the pseudotime plot for the markers used in assigning cell type for reference (i.e., bam, vasa etc).

This is a great suggestion and we have added a marker track to Figure 4 as a visual aid.

10) Figure 4—figure supplement 1 is not possible to interpret as shown. The order of genes appears to be different for the parental vs child sets. What this figure needs to show is each pair of genes and their expression in pseudotime, aligned with each other, not clustered within their own class.

Figure 4—figure supplement 1 now includes all D. melanogaster-specific duplicate genes detected in our library, aligned with their parental copies.

11) Abstract. "the distribution of new genes across cell types" is a misleading phrasing. All cell types have all genes. We are not sure exactly what is meant here. Introduction " The testis is a highly heterogeneous tissue", is it really more heterogenous than other tissues (gut, brain, etc?).

We revised the sentences to “To investigate the expression patterns of genetic novelties across different testis cell types, we performed single-cell RNA-sequencing of adult Drosophila testis. We found that new genes were expressed in various cell types, the patterns of which may be influenced by their mode of origination” and “The testis is a highly transcriptionally active tissue”.

12) Results paragraph five and the following: please explain on what basis these de novo genes were identified. Were they identified from transcript in the male gonads? If yes, they are not a random set of de novo genes and may be biased depending on how the mRNA data was collected in the original study. It would be more useful to identify de novo genes from the data itself and look at their age, population frequency, stage at which they are seen, etc. If the original paper just looked at bulk testis expression, the most abundant cell types in the extract will be overrepresented, biasing the discovery of de novo genes in the stages with more coverage. Since the main result of the paper relates to de novo genes, this part of the analysis needs to be strengthened.

In Zhao et al., 2014, segregating and fixed de novo genes were identified based on transcription in the testis of several strains of D. melanogaster. They were characterized as segregating or fixed based on their frequency across different strains. Since our single cell preparation correlates extremely well with whole-tissue RNA sequencing (Figure 1—figure supplement 2), it is reasonable to examine previously identified de novo genes from whole tissue. Identifying segregating de novo genes from a single-cell preparation would be ideal, but is not possible now because it requires libraries from multiple D. melanogaster strains and libraries from outgroup species with ultra-deep sequencing that is beyond the scope of this study. Until the costs of single-cell sequencing come down, using previously identified de novo genes from bulk testis RNA-seq data is the best available option. We do completely agree with the reviewers that genes that are expressed in the less abundant cell types are difficult to identify and study, which is a challenge using the RNA-seq method. We discussed this point in the Materials and methods.

13) Results paragraph six. The authors show that de novo genes are initially expressed in some specific cell types and they conclude that spermatocyte expression is an important step for de novo origination. Important is vague term and the reader does not understand what is being shown and supported.

We toned down the statement. Given our new finding that early spermatocytes express a greater fraction of fixed de novo genes than testis-specific genes or all other genes, this section now reads: “The high proportion of de novo genes expressed in spermatocytes suggests that such genes may play functional roles in these cells and development stage.”

14) Figure 2C: The authors write: "For every cell type except spermatocytes, segregating de novo genes are the least commonly expressed". It seems that segregating de novo genes are always the least expressed, including in the spermatocytes. Also, this analysis shows that in early spermatocytes, fixed de novo genes are more expressed that all other genes. This is an example of bias that I was mentioning in my second point. If this stage is more represented in the study that identified de novo genes, it is not surprising that the de novo genes studied here are more expressed there because they needed to be highly expressed to be discovered in the first place.

We apologize for the misunderstanding. Figure 2C shows that a lower proportion of segregating de novo genes are detected in a given cell type or cluster, but this does not indicate that they were lowly expressed in a given cell type. We added a sentence reading “It is important to note that this measure looks at the number of genes of each type detected, not the expression level of each, and does not distinguish between high and low expression”. Related to the last point that we may have more power to detect gene expression and expression bias for enriched cell types, we overall agree with the reviewer’s point, which was discussed and addressed in response 12.

15) The comparison with gene duplicates needs more work. The authors need to make sure the duplicates are matched with the de novo genes in terms of their age and maybe also size since detectability by RNAseq could be biased towards longer transcripts. Finally, the mechanisms of duplication needs to be taken into account (RNA based gene duplication versus tandem duplication versus other mechanisms). Very young duplicates are also probably difficult to differentiate at the RNA level.

This is a great point and we have revised Figure 4—figure supplement 1, and accompanying text in the revised version. We agree that it is important to compare young duplicate genes with de novo genes of the same age, and have accordingly corrected Figure 4 to show only melanogaster-specific de novo and duplicate genes. We can only detect the expression of 14 of these duplicate genes, however, and 12 of them are tandem duplicate genes, and 2 are retroposed genes. This means that while we can make a fair comparison of expression patterns between melanogaster-specific duplicate genes and de novo genes, just using melanogaster-specific genes we cannot make inferences on the relationship between duplication mechanism and testis expression pattern.

If child duplicate genes are indistinguishable from their parental copies at the sequence level, the two copies would show the same general expression pattern across spermatogenesis. We only observed this for a small number of parental/child gene duplicates, giving us confidence that this is not a significant source of bias in our data (Figure 4—figure supplement 1).

16) The high proportion of fixed de novo genes expressed in spermatocytes is not unexpected, and fits with the theory that spermatocyte chromatin is permissive for transcription. Most of these genes are testis-specifically expressed, and most testis-specifically expressed genes are expressed in spermatocytes. The "all genes" class includes genes that are expressed earlier in the germline and in somatic cells. If testis-specifically expressed old genes were analysed they would likely have the same pattern as the de novo genes.

This is a keen observation, and we have therefore revised the analysis to account for such bias by adding testis-specific genes to our scaled expression comparisons in Figures 2, 4, and 5. We defined testis-specific genes as any gene that, in FlyAtlas2 data, had FPKM >2 in testis and <1 in every other tissue (except whole male). In Figure 4A we can now see that in most cell types de novo genes tend to show scaled expression between all other genes and testis-specific genes. In late spermatogonia and early spermatocytes, fixed de novo genes are expressed similarly to testis-specific genes, while segregating de novo genes are rarer.

https://doi.org/10.7554/eLife.47138.028

Article and author information

Author details

  1. Evan Witt

    Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, United States
    Contribution
    Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2973-6946
  2. Sigi Benjamin

    Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, United States
    Contribution
    Resources, Investigation, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6411-5339
  3. Nicolas Svetec

    Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Resources, Investigation, Methodology, Writing—original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9617-2752
  4. Li Zhao

    Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Supervision, Funding acquisition, Writing—original draft
    For correspondence
    lzhao@rockefeller.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6776-1996

Funding

Robertson Foundation

  • Li Zhao

Monique Weill-Caulier Trust (Monique Weill-Caulier Career Scientist Award)

  • Li Zhao

Alfred P. Sloan Foundation (Research Fellowship)

  • Li Zhao

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Connie Zhao and Nneka Nnatubeugo for the help on the single-cell sequencing experiment. We thank Kristofer Davie for the suggestions on single-cell suspension. We thank members of the Zhao laboratory for helpful discussions during the work. We are grateful to Mia Levine, Leslie Vosshall, David Begun, Ziyue Gao, Molly Przeworski, Xiaolan Zhao, and Sohail Tavazoie for critical reading of an earlier version of the manuscript.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Christian R Landry, Université Laval, Canada

Reviewer

  1. Helen White-Cooper, Cardiff University, United Kingdom

Publication history

  1. Received: March 25, 2019
  2. Accepted: July 6, 2019
  3. Version of Record published: August 16, 2019 (version 1)

Copyright

© 2019, Witt et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,841
    Page views
  • 269
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Evolutionary Biology
    2. Plant Biology
    Yibo Dong et al.
    Research Article Updated
    1. Developmental Biology
    2. Evolutionary Biology
    Lotta Salomies et al.
    Research Article Updated