Abstract

The de novo emergence of new genes has been well documented through genomic analyses. However, a functional analysis, especially of very young protein-coding genes, is still largely lacking. Here, we identify a set of house mouse-specific protein-coding genes and assess their translation by ribosome profiling and mass spectrometry data. We functionally analyze one of them, Gm13030, which is specifically expressed in females in the oviduct. The interruption of the reading frame affects the transcriptional network in the oviducts at a specific stage of the estrous cycle. This includes the upregulation of Dcpp genes, which are known to stimulate the growth of preimplantation embryos. As a consequence, knockout females have their second litters after shorter times and have a higher infanticide rate. Given that Gm13030 shows no signs of positive selection, our findings support the hypothesis that a de novo evolved gene can directly adopt a function without much sequence adaptation.

https://doi.org/10.7554/eLife.44392.001

eLife digest

Different species have specific genes that set them apart from other species. Yet exactly how these species-specific genes originate is not fully known. The traditional view is that existing old genes are duplicated to make a ‘spare’ copy, which can change through mutations into a new gene with a new role gradually over time. Despite there being lots of evidence supporting this theory, not all new genes found in recent years can be traced back to older genes. This led to an alternative view – that recently evolved genes can also appear ‘de novo’, and come from regions of random DNA sequences that did not previously code for a protein.

So far, the possibility of genes forming de novo during evolution has largely been supported by comparing and analyzing the genomes of related species. However, very little is known about the biological role these de novo genes play. Now, Xie et al. have generated a list of recently evolved de novo mouse genes, and carried out a detailed analysis of one de novo gene expressed in females at the time when embryos implant into the uterus wall.

To study the role of this gene, Xie et al. created a strain of knock-out mice that have a defunct version of the protein coded by the gene. Loss of this protein caused female mice to have their second litter after a shorter period of time and increased the likelihood that female mice would terminate their newborn pups. This suggests that this newly discovered de novo gene is involved in regulating the female reproductive cycles of mice.

Further analysis showed that this de novo gene counteracts the action of an older gene that promotes the implantation of embryos. This gene has therefore likely evolved due to the benefit it offers mothers, as it protects them from experiencing the increased physiological stress caused by a premature second pregnancy.

These findings support the idea that genes which have evolved de novo can have an essential biological purpose despite coming from random DNA sequences. This establishes that de novo evolution of genes is the second major mechanism of how new genes with significant biological roles can form in the genome.

https://doi.org/10.7554/eLife.44392.002

Introduction

The evolution of new genes through duplication-divergence processes is well understood (Chen et al., 2013; Kaessmann, 2010; Long et al., 2013; Tautz and Domazet-Lošo, 2011). But the evolution of new genes from non-coding DNA has been little considered for a long time (Tautz, 2014). However, with the increasing availability of comparative genome data from closely related species, more and more cases of unequivocal de novo gene emergence have been described (McLysaght and Hurst, 2016; Schlötterer, 2015; Tautz, 2014; Tautz and Domazet-Lošo, 2011). These analyses have shown that de novo gene emergence is a very active process in all evolutionary lineages analyzed. A comparative analysis of closely related mouse species has even suggested that virtually the whole genome is ‘scanned’ by transcript emergence and loss within about 10 million years of evolutionary history (Neme and Tautz, 2016).

But unlike the detection of the transcriptional and translational expression of de novo genes, functional studies of such genes have lacked behind. In yeast, the de novo evolved gene BSC4 was found to be involved in DNA repair (Cai et al., 2008) and MDF1 (Li et al., 2010; Li et al., 2014) was found to suppress mating and to promote fermentation. Knockdown of candidates of de novo genes in Drosophila have suggested effects on viability and male fertility (Chen et al., 2010; Reinhardt et al., 2013). Male fertility was also found to be affected for Pldi in mice, which codes for a lncRNA. In this case the knockout was shown to affect sperm motility and testis weight (Heinen et al., 2009). There is generally a tendency to focus on male testis effects for newly evolved genes. However, considering that the mammalian females have complex reproduction cycles, including morphology, physiology and behavior relating to mate choice, pregnancy, and parenting, de novo genes in mammals should also be expected to have a function in female-specific organs and affect female fertility and reproductive behavior as well.

Here, we have first generated a list of candidate genes that have evolved in the lineage of mice, after they split from rats. We have analyzed ribosome profiling and mass spectrometry data for these and find that most of them are translated. From this list, we have then chosen a gene specifically expressed in the female reproductive system to address the question of the role of de novo gene evolution in this as yet little studied context. We used a knockout line for the reading frame of the gene, created through CRISPR/Cas9-mediated frameshift mutagenesis, and subjected it to extensive molecular and phenotypic analysis. We conclude that it functions in the oviduct and affects female fertility cycles and that its emergence may have been driven by an evolutionary conflict situation. Given that we find no measurable acceleration of sequence evolution in the gene, we conclude that it became directly functional after its open reading frame became functional. These results support the notion that random protein sequences have a good probability for conveying evolutionarily relevant functions (Neme et al., 2017).

Results

De novo evolved genes in the mouse genome

To identify candidates for recently evolved de novo genes, we have applied a combined phylostratigraphy and synteny-based approach. We were able to identify 119 predicted protein-coding genes from intergenic regions that occur only in the mouse genome, but not in rats or humans. We re-assembled their transcript structures and estimated their expression levels using available ENCODE RNA-Seq data in 35 tissues from the mouse (Figure 1, Figure 1—source data 1). To validate that their predicted open reading frames (ORFs) are indeed translated, we have searched ribosome profiling and peptide mass spectrometry datasets (Figure 1—source data 1). We found for 110 out of the 119 candidate genes direct evidence for translation.

Transcriptional abundance and structural features of 119 candidate de novo genes in the mouse lineage.

(A) Transcriptional abundance in each mouse tissue, represented as the sum of log-transformed FPKM values of each transcript: sum(log2(FPKM + 1)). Details on tissue designations and RNA samples are provided in Figure 1—source data 1. The five tissues with the highest fractions are highlighted in red and the lowest ones in blue. (B) Comparison of overall expression levels (represented as the highest FPKM values in the 35 tissues) between de novo and all other protein-coding genes (‘De novo’ and ‘Other’ on the x-axis). (C) Comparison of averages of intrinsic structural disorder scores between de novo and all other protein-coding genes. (D) Comparison of fractions of sequence covered by hydrophobic clusters between de novo and all other protein-coding genes.

https://doi.org/10.7554/eLife.44392.003
Figure 1—source data 1

Excel file with five tabs, providing (i) the legends for the tissue sources, (ii) the table for the gene lists, (iii) the information for the reassembled transcripts, (iv) the proteomic evidence shown in detail, and (v) the accession numbers for the ENCODE data.

https://doi.org/10.7554/eLife.44392.004

Expression of these genes is found throughout all tissues analyzed, with notable differences. Testis and brain express the relatively largest abundance of these candidate de novo genes, while the digestive system and liver express the lowest (Figure 1A). Expression levels of these genes are generally lower than those of other protein-coding genes (FPKM medians: 0.63 vs. 8.18; two-tailed Wilcoxon rank sum test, p-value<2.2 × 10−16; Figure 1B). Most overall molecular patterns are similar to previous findings (Neme and Tautz, 2013; Schmitz et al., 2018; Wilson et al., 2017). They have fewer exons (medians: 2 vs. 7; two-tailed Wilcoxon rank sum test, p-value<2.2 × 10−16) and fewer coding exons than other protein-coding genes (medians: 1 vs. 6; two-tailed Wilcoxon rank sum test, p-value<2.2 × 10−16). The lengths of their proteins are shorter than those of other proteins (medians: 125 vs. 397; two-tailed Wilcoxon rank sum test, p-value<2.2 × 10−16). However, their proteins are predicted to be less disordered than other proteins (medians: 0.20 vs. 0.27; two-tailed Wilcoxon rank sum test, p-value=0.0024; Figure 1C) and equally hydrophobic to other proteins (medians: 0.56 vs. 0.57; two-tailed Wilcoxon rank sum test, p-value=0.52; Figure 1D), but note that the two sets of values show a broad distribution.

Analysis of a female expressed gene

To study the function of a gene expressed in the female reproductive tract, we picked Gm13030 (Figure 2) from the above list for in-depth analyses, including evolutionary history, reading-frame knockout, transcriptomic studies and phenotyping. According to the ENCODE RNA-Seq data, Gm13030 is only expressed in two tissues, the ovary of 8 weeks old females (FPKM 0.135), as well as the subcutaneous adipose tissue of 8 weeks old animals (FPKM 0.115) (Figure 1—source data 1). Given that the ovary is a small organ, with closely attached tissues, such as oviduct and gonadal fat pad, there could be contamination between these different tissue types. Hence, we were interested whether there is specificity for one of them. We used reverse transcription PCR on RNA from the respective carefully prepared tissue samples, to trace the expression of Gm13030 and a control gene (Uba1). We found that Gm13030 is not expressed in the ovary, but predominantly in the oviduct with only a weak signal from the adjacent fat pad (Figure 2B).

General information of Gm13030, expression, and knockout strategy.

(A) General information on transcript ID, location and protein characteristics. (B) Reverse transcription PCR across intron junctions to study Gm13030 expression in gonadal fat pad, ovary, oviduct, and uterus. Fat: gonadal fat pad; M: marker (from top to bottom: 1500 bp, 850 bp, 400 bp, 200 bp, 50 bp); U: Uba1 (control gene, 255 bp); j1: Gm13030 junction 1 (161 bp); j2: Gm13030 junction 2 (209 bp). (C) Transcript structure, DNA target, guide RNA, and depiction of the deletion created by the CRISPR/Cas9 knockout of Gm13030. The 20-nt guide sequence is colored blue and the 3-nt PAM is colored red. The induced deletion was verified by sequencing.

https://doi.org/10.7554/eLife.44392.005

Evolutionary analysis of Gm13030

To trace the evolutionary emergence of Gm13030, we used available whole genome information of different mouse species to generate alignments, combined with Sanger sequencing data of PCR fragments from mouse populations, subspecies, and related species from the genus Mus. We found the respective genomic region covering the ORF in all mouse species analyzed. It is not possible to identify an unequivocal orthologous region in the rat, because the unique genomic region in the mouse matches with multiple diverged genomic fragments in the rat reference genome, and all these fragments overlap only marginally with the mouse region.

The alignments for the whole coding region allowed us to infer mutations that have led to the opening of the reading frame (enabler mutations), as well as further substitutions and secondary disablers along the tree topology (Figure 3, Figure 3—figure supplement 1). The most distant species in which we can trace the orthologous genomic region, M. pahari, lacks part of the coding region. Two further outgoup species, M. matheyi and M. caroli have an orthologous genomic region that spans the whole reading frame, but harbor stop codons at position 204 and 258 of the alignment (Figure 3, Figure 3—figure supplement 1). At position 258 we find a change from TGA to TGC in all ingroup species, that is this is a clear enabler mutation. The same change is seen at position 204, but some of the ingroup species that show also secondary disablers (see below) retain the TGA. But since both enabler mutations are at least seen in M. spicilegus, we place the emergence of the Gm13030 ORF at this node, that is between 2–4 million years ago. Figure 3 includes all coding and non-coding substitutions that have occurred beyond this node. This includes secondary disablers in M. spretus, as well as M. m. domesticus. Most notably, all three M. m. domesticus populations carry a 17nt deletion that leads to a disruption of the reading frame. They share also several other substitutions, not only among them, but also with M. spretus and M. spicilegus, suggesting a secondary introgression effect (Figure 3, Figure 3—figure supplement 1). Hence, after the emergence of the Gm13030 ORF, only the M. m. musculus and M. m. castaneus populations have retained it.

Figure 3 with 3 supplements see all
Evolutionary history of the Gm13030 ORF.

The tree is based on the alignments shown in Figure 3—figure supplement 1, with only M. caroli included as the outgroup. The relevant substitutions at the different nodes are shown in boxes. Numbers refer to coding:non-coding substitutions, ‘stop’ refers to a mutation that creates a stop codon in the reading frame, ‘DEL’ refers to a deletion, ‘INT’ to an assumed introgression. 3-letter codes on the tips refer to the different populations of the respective sub-species. Expected substitutions on the top are inferred from whole genome distances and represent the approximately neutral number of substitutions for the respective comparisons (Figure 3—figure supplement 2).

https://doi.org/10.7554/eLife.44392.006

When focusing on the substitutions that occurred within the lineage towards M. m. musculus, we find a total of 7 coding and six non-coding substitutions. Hence, the total number of substitutions is slightly higher than the 6–7 expected for approximately neutral substitutions from a genomic average between these populations (indicated on top of Figure 3), but there is no bias towards coding mutations. Overall, there are too few mutations to apply a dN/dS test and the ratios of non-coding to coding mutations are all non-significant (Figure 3—figure supplement 3). Hence, we conclude that there is no traceable signal of positive selection on the protein after the emergence of the ORF.

Generation of gene knockout and off-target analysis

For the further functional characterization of Gm13030, we obtained a knockout line with a frameshift in the ORF through CRISPR/Cas9 mutagenesis. The knockout line is from a laboratory strain that is nominally derived from Mus musculus domesticus (C57BL/6N). However, as stated above, Mus musculus domesticus populations have disabling mutations. But C57BL/6N is known to carry also alleles from Mus musculus musculus (Yang et al., 2011) and the Gm13030 allele represents indeed the non-interrupted version that is found in M. m. musculus and M. m. castaneus. The CRISPR/Cas9 treatment introduced a 7 bp deletion at the beginning of the ORF (position 41–47) causing a frameshift and a premature stop codon in exon 2 (Figure 2C).

The CRISPR/Cas9 experiment to generate our knockout line might have generated potential off-target mutations. In order to rule out this possibility, we performed whole genome sequencing on both animals of our founding pair. The female and male of our founding pair were selected from the first-generation offspring of the mating among mosaic and wildtype mice which were directly developed from the zygotes injected. Each of them contained the 7 bp deletion allele described above and a wildtype allele. If there were any off-target sites, they should exist as heterozygous or homozygous indels or single nucleotide variants. However, in our genome sequencing results, we found no variant located in the 100 bp regions around the genome-wide 343 predicted off-target sites. Further, we manually checked the reads mapped to the regions around the top 20 predicted sites in both samples and none of them yielded an indication of variants.

Knockout effect on the transcriptome

The Gm13030 knockout line is homozygous viable and fertile. We were therefore interested in studying the impact on the transcriptional network in the tissue in which Gm13030 is predominantly expressed. Given the observation that Gm13030 is specifically expressed in adult oviducts, we focused the RNA-Seq analysis on the oviducts of 12 homozygous knockout and 12 wildtype females (10–11 weeks old). There were on average 75.9 million unique mapped reads per sample (range from 57.5 to 93.0 million reads; Figure 4—figure supplement 1). The genotypes of the 24 samples were further confirmed by the reads covering the sites in which the 7-bps deletion locates (Figure 4—figure supplement 1). In the initial analysis involving all samples, we found no differentially expressed gene between knockouts and wildtypes.

However, given that the expression in oviducts should be fluctuating according to estrous cycle, we clustered the transcriptomes of the individuals based on both principle component analysis (PCA) and hierarchical clustering methods, which allowed to distinguish three major clusters (Figure 4A and B). To confirm that these correspond to three different phases of the estrous cycle, we analyzed the expression of three known cycle dependent genes in the respective clusters, progesterone receptor (Pgr) and estrogen receptors (Esr1 and Gper1). We found that these genes change indeed in the expected directions, both in the wildtype as well as the knockout animals (Figure 4C–E).

Based on this finding, we performed the differential expression analysis on the three clusters separately. We found 21 differentially expressed genes in cluster 1 (DESeq2, adjusted p-value≤0.01; fold changes range from 0.75 to 1.59; Table 1), but still none for clusters 2 and 3. The 21 differentially expressed genes in cluster 1 do not include the genes neighboring Gm13030 (Pla2g2e and Pla2g5). This suggests that Gm13030 acts during the phase of high progesterone receptor and estrogen receptor 1 expression, and low G protein-coupled estrogen receptor 1 expression, corresponding to proestrus or the starting of estrus, that is, the phase where females start to become receptive for implantation.

Figure 4 with 3 supplements see all
Clusters and expression levels in the 24 RNA-Seq samples of oviducts.

(A) PC1 values from the PCA analysis, (B) hierarchical clustering result. Sample codes and genotypes are listed along X-axis. The 24 samples are assigned into three clusters accordingly. (C-E) The expression levels of three sex hormone receptor genes (Pgr, Esr1, Gper1) are shown by box plots.

https://doi.org/10.7554/eLife.44392.010
Table 1
Differentially expressed genes in oviduct cluster 1.
https://doi.org/10.7554/eLife.44392.014
Gene IDGene nameBase meanaFold changeAdjusted P-Value
ENSMUSG00000057417Dcpp337001.590.0000
ENSMUSG00000096278Dcpp24271.470.0000
ENSMUSG00000096445Dcpp14151.450.0000
ENSMUSG00000034009Rxfp144101.350.0003
ENSMUSG00000022206Npr33491.360.0011
ENSMUSG00000035864Syt16661.340.0011
ENSMUSG00000070348Ccnd173820.800.0012
ENSMUSG00000058897Col25a116051.340.0015
ENSMUSG00000059908Mug12681.350.0015
ENSMUSG00000063130Calml36981.310.0018
ENSMUSG00000015966Il17rb6370.750.0025
ENSMUSG00000022358Fbxo3236141.310.0038
ENSMUSG00000040724Kcna28950.750.0038
ENSMUSG00000061477Rps762471.200.0052
ENSMUSG00000067786Nnat6581.320.0052
ENSMUSG00000019987Arg112081.320.0068
ENSMUSG00000079017Ifi27l2a10651.320.0073
ENSMUSG00000028031Dkk26781.310.0077
ENSMUSG00000022037Clu171391.220.0086
ENSMUSG00000033715Akr1c14238791.210.0086
ENSMUSG00000034039Prss291761.290.0086
  1. aThe mean of the normalized read counts for all cluster one samples.

The top three differentially expressed genes belong all to a single young gene family, namely Dcpp1, Dcpp2 and Dcpp3, all three of which were significantly up-regulated in the knockout samples (DESeq2, fold changes: 1.45 for Dcpp1, 1.47 for Dcpp2, and 1.59 for Dcpp3, Figure 4—figure supplement 2). These genes are expressed in female and male reproductive organs and the thymus, and were previously found to function in oviducts to stimulate pre-implantation embryo development (Lee et al., 2006). Given the special importance of the expression differences for the Dcpp genes, we confirmed them by a quantitative PCR assay (Figure 4—figure supplement 2). The fourth gene in the list of significantly changed expression is Rxfp1, the receptor for the pregnancy hormone relaxin. Relaxin signaling is involved in a variety of cellular processes (Valkovic et al., 2019), whereby the regulation of the reproductive cycle is one of them (Anand-Ivell and Ivell, 2014).

Knockout phenotype

Given that the Dcpp genes are more highly expressed in the knockouts, one could predict a higher implantation frequency of embryos, as it has been shown through experimental manipulation of Dcpp levels (Lee et al., 2006). We assessed the litters of pairs that were produced from our normal breeding stocks and found that the first litters from homozygous knockout females were produced after the same time as those from wildtype or heterozygous females (medians: 23 vs. 22 days, Figure 5—source data 1). However, we saw a major difference with respect to the second litter. Homozygous knockout females tended to produce this faster than wildtype or heterozygous females (medians: 23 vs. 38 days, Figure 5—source data 1). To test this observation directly, we set up additional 10 mating pairs of homozygous knockout females with wildtype males and 10 wildtype pairs for control, all at approximately the same age at the start (8–9 weeks old). We found that the knockout and wildtype pairs had their first litter after the same time (medians: 23 vs. 22 days, Figure 5—source data 1), while the knockout females had their second litter after a shorter time (medians: 24 vs. 36 days, Figure 5—source data 1), thus confirming the initial observation. However, the data have to be seen in the context of the non-continuous nature of pregnancy, caused by the ovulation cycles of females. Females can ovulate within a day of giving birth, but if no successful mating occurs at that time, ovulation is suppressed while the female is lactating. This results in a delay in the timing of the next pregnancy. Figure 5 shows that this pattern is also evident in our data.

Distributions of the time from the first litter to the second litter.

Time points of the second litter are plotted for the different genotypes, with box plots marked. A bimodal distribution becomes evident, as discussed in the text.

https://doi.org/10.7554/eLife.44392.015
Figure 5—source data 1

Details of the fertility scores for the different groups of mice.

The mice listed under the UC codes are part of the standard stock breeding, the mice listed under the WT and KO codes were specifically set up to confirm the observations from the stock breeding.

https://doi.org/10.7554/eLife.44392.016

We found that the times to the second litter were either smaller than or equal to 25 days (early group) or larger than or equal to 35 days (late group) for both the homozygous knockout females and the wildtype or heterozygous females. But in the homozygous knockouts, we saw more in the early group, leading to the median values having a big difference. When using the two-tailed Wilcoxon rank sum test which does not require the assumption of a normal distribution, we found that this difference is significant when calculated across all breeding data (p-value=0.042).

Interestingly, we found not only a timing difference for the second litter but also infanticide in about a quarter of the litters (4 out of 16) from homozygous females, but none in wildtype or heterozygous females (two-tailed Fisher's exact test, p-value=0.031, Figure 5—source data 1). This could indicate that when the second litter follows too quickly, the females may be under stronger postpartum stress resulting in partial killing of pups. In addition, one could also have expected to see homozygous knockout females having larger litter sizes than those of wildtype or heterozygous females, but they were almost the same (medians: 6.5 vs. 7.0 for littler 1 and 6.5 vs. 7.5 for litter 2, Figure 5—source data 1). One possible explanation is that considering the high infanticide rate for homozygous knockouts, more pups from homozygous knockout females were eaten before being observed.

These results suggest that the loss of Gm13030 should be detrimental to the animals in the wild. Still, we see that the M. m. domesticus populations have secondarily lost this gene (Figure 3). Intriguingly, when inspecting the copy number variation data that we have produced previously (Pezer et al., 2015), we found that Dcpp3 was also lost in M. m. domesticus populations (Figure 4—figure supplement 3). Under the assumption that this results in an overall lowered expression of Dcpp RNAs, it could be considered to compensate for the loss of Gm13030.

Discussion

The aim of this study was to trace the possible functions of a gene that has evolved only very recently out of an intergenic region. Out of a list of 119 candidate genes that have evolved de novo ORFs in the mouse lineage, we have chosen a gene specifically expressed in the female reproductive system for detailed molecular and functional analysis. We have used CRISPR/Cas9 induced frameshift mutation within the ORF to obtain the knockout line. This implies that it is indeed the protein, rather than the RNA that is functional. We find that the knockout has an impact on the transcriptome in the oviduct only at a specific stage of the female estrous cycle, and we also find a unique female-specific phenotype. Hence, we propose to give a formal name to Gm13030. We name it after a female figure, Shiji (Shj), who was born from stone (de novo) as a mythology character in the Chinese traditional novel Investiture of the Gods (Fengshen Yanyi), which was published in the 16th century.

Transcriptome and phenotype changes

The knockout line did not show an overt phenotype, but we considered this also as a priori unlikely, given that a de novo evolved gene is expected to be only added to an existing network of genes (Zhang et al., 2015). But given the observed transcriptome changes in the oviducts, we were encouraged to apply the fertility test. We identified a possible direct link between the identified phenotype of a shorter interval to second birth in the knockouts and the transcriptomic changes. We found that the expression level of all three copies of Dcpp genes in C57BL/6N mice is enhanced in the Shj knockout animals. Dcpp expression is induced in the oviduct by pre-implantation embryos and is then secreted into the oviduct. This in turn stimulates the further maturation of the embryos and eventually the implantation (Lee et al., 2006). Hence, this is a system where a selfish tendency for Dcpp expression favoring embryo implantation could develop, in expense of the interest of the mother that wants to build up new resources first. Accordingly, Shj could have found its function in controlling this expression, that is, ‘defending’ the interests of the mother. Intriguingly, the secondary loss of Shj in M. m. domesticus populations is accompanied by a loss of Dccp3 in the same populations. This is compatible with the notion that an evolutionary conflict of interest exists for these interactions, whereby it remains open whether the loss of Dcpp3 preceded the loss of Shj or vice versa. We note that Shj inactivation alleles segregate also in the populations of the other subspecies (M. m. musculus and M. m. castaneus) in low frequency, implying that the evolutionary process of fully integrating this new gene is still ongoing.

Male bias versus female bias

There has so far been much focus on de novo genes and other new genes to have male-biased expression and to affect male fertility (Chen et al., 2013; Ellegren and Parsch, 2007; Heinen et al., 2009; Kaessmann, 2010; Long et al., 2013; Reinhardt et al., 2013; Zhao et al., 2014). Only recently, one of a pair of duplicated genes in Drosophila, Arts, has been shown to have high expression in the ovary and to affect fertility (VanKuren and Long, 2018). Here we have shown that a de novo gene specifically expressed in the female reproductive tract affects the female fertility cycle. Female reproduction should be subject to accelerated evolution patterns, especially in mammals which have high complexities in female reproduction, including mate choice, pregnancy, and parenting, which has been neglected so far. One reason is that the estrous cycle in females adds to the complexity of the analysis. Our clustering analysis of the transcriptomic data, which considers the stages of estrous cycle, provides an approach for studying biased gene expression in female mammals as well. Another reason for the current focus on males is the large number of new genes that are transcribed in testis. However, this is due to the promiscuous phase of expression in meiotic cells, where many genes use alternative promotors (Kleene, 2001). These meiotic cells are abundant in testis, but are difficult to analyze in ovaries. Hence, it is still open whether there might be a similar phase of over-expression of new genes in female meiotic stages as well.

Shj exerts its effects in somatic cells, that is, independent of a possible expression in meiosis, but in the context of a possible selfish gene conflict situation, which has so far been ascribed mostly to the male reproductive system (Kleene, 2005). Hence, we expect that a better analysis of female-specific expression of genes should reveal more evolutionary interesting insights in the future.

Functional de novo gene emergence

It has long been assumed that the emergence of function out of non-coding DNA regions must be rare, and if it occurs, the resulting genes would be far away from assuming a function. Our results do not support these assumptions. It is possible to find many well supported transcripts that could be considered to be true de novo genes. We have shown here that Shj has functions on the transcriptome and the phenotype. In fact, we have initial data for two additional de novo genes expressed in the brain and limbs, where knockouts produce an effect on the transcriptome and show subtle phenotypes (data available on bioRxiv doi.org/10.1101/510214). However, since lacZ replacement constructs were used instead of CRISPR-induced knockouts, it remains still open whether the effects are due to the new ORFs or to chromatin effects caused by the deletion constructs. This will need further analysis.

The Shj ORF has acquired only a small number of additional substitutions, both coding and non-coding after it emerged. This suggests that it did not need additional adaptation of the protein sequence to become functional. This is in line with a similar analysis on a larger set of de novo ORFs in the mouse (Ruiz-Orera et al., 2018). Hence, this raises the question whether we should necessarily expect signatures of positive selection around de novo genes as part of proof that it is a true gene (McLysaght and Hurst, 2016). Alternatively, given the observation that a large set of expressed random sequences can exert phenotypes (Bao et al., 2017; Neme et al., 2017), it would seem more likely that the conversion of a non-coding region into a coding one would already be sufficient to create a gene function. In the early phase of evolution, such genes would likely be frequently subject to secondary loss (Palmieri et al., 2014), but they could eventually also become fixed and then further evolutionarily optimized.

Materials and methods

Key resources table
Reagent type (species)
or resource
DesignationSource or
reference
IdentifiersAdditional
information
Gene (Mus musculus)Gm13030; ShjNAEnsembl:ENSMUSG00000078518
Genetic reagent (M. musculus)Gm13030 linethis paperGenerated from C57BL/6N line by introducing a 7 bp deletion using CRISPR/Cas9 at Mouse Biology Program (MBP). See detail in Materials and methods.
Sequence-based reagentReverse transcription PCR primersthis paperSee Materials and methods.
Sequence-based reagentPCR and Sanger sequencing primersthis paperSee Materials and methods.
Sequence-based reagentGenotyping primersthis paperSee Materials and methods.
Sequence-based reagentDroplet digital PCR primers and probesthis paperSee Materials and methods.

Ethics statement

Request a detailed protocol

The mouse studies were approved by the supervising authority (Ministerium für Energiewende, Landwirtschaftliche Räume und Umwelt, Kiel) under the registration numbers V244-71173/2015, V244-4415/2017 and V244-47238/17. Animals were kept according to FELASA (Federation of European Laboratory Animal Science Association) guidelines, with the permit from the Veterinäramt Kreis Plön: 1401–144/PLÖ−004697. The respective animal welfare officer at the University of Kiel was informed about the sacrifice of the animals for this study.

Genome-wide identification of de novo genes

Request a detailed protocol

We modified previous phylostratigraphy and synteny-based methods to identify Mus-specific de novo protein-coding genes from intergenic regions. Note that while the phylostratigraphy based approach was criticized to potentially include false positives (Moyers and Zhang, 2015), we have shown that the problem is relatively small and that it is in particularly not relevant for the most recently diverged lineages within which de novo gene evolution is traced (Domazet-Lošo et al., 2017). We started with mouse proteins annotated in Ensembl (Version 80) (Zerbino et al., 2018) (1) with protein length not smaller than 30 amino acids, (2) with a start codon at the beginning of the ORF, (3) with a stop codon at the end of the ORF, (4) without stop codons within the annotated ORF. For the phylostratigraphy-based strategy, in order to save computational time, we first used NCBI BLASTP (2.5.0+) to align low complexity region masked mouse protein sequences to rat protein sequences annotated in Ensembl (Version 80) and filtered out the mouse sequences having hits with E-values smaller than 1 × 10−7. Next we used NCBI BLASTP (2.5.0+) to align the remaining low complexity region masked sequences to NCBI nr protein sequences (10 Nov. 2016) (O'Leary et al., 2016) and filtered out the mouse sequences having non-genus Mus hits with E-values smaller than 1 × 10−3 according to Neme and Tautz (2013).

The genes remaining after these filtering steps are the candidates for the de novo evolved genes. In order to deal also with proteins having low complexity regions, we further applied a synteny-based strategy on the rest proteins by taking advantage of the Chain annotation from Comparative Genomics of UCSC Genome Browser (http://genome.ucsc.edu/) (Kent et al., 2002). We filtered out the proteins encoded on unassembled scaffolds because their chromosome information is not compatible between Ensembl and UCSC annotations. We only compared rat and human proteins with mouse proteins because their genomes are well assembled and genes are well annotated. We performed the same procedures on rat and human data separately, and used ‘mm10.rn5.all.chain’ and ‘rn5ToRn6.over.chain’ from UCSC and gene annotation from Ensembl (Version 80) for rat, and ‘mm10.hg38.all.chain’ from UCSC and gene annotation from Ensembl (Version 80) for human. For each mouse gene, if its ORF overlaps with any ORFs in the rat or human mapping regions in Chain annotation, we aligned its protein sequence to those protein sequences with program water from EMBOSS (6.5.7.0) (Rice et al., 2000); if one of the alignment scores is not smaller than 40, we filtered out the protein. The remaining 119 genes are the candidates for the following analysis and the pool for us to select the gene for further functional experiments.

ENCODE RNA-Seq analysis

Request a detailed protocol

We downloaded the raw read files of 135 strand-specific paired-end RNA-Seq samples generated by the lab of Thomas Gingeras, CSHL from ENCODE (ENCODE Project Consortium, 2012; Sloan et al., 2016) including 35 tissues from different organs and different developmental stages, and each of them had multiple biological or technical replicates. We trimmed the raw reads with Trimmomatic (0.35) (Bolger et al., 2014), and only used paired-end reads left for the following analyses. We mapped the trimmed reads to the mouse genome GRCm38 (Waterston et al., 2002; Zerbino et al., 2018) with HISAT2 (2.0.4) (Kim et al., 2015) and SAMtools (1.3.1) (Li et al., 2009), and took advantage of the mouse gene annotation in Ensembl (Version 80) by using the --ss and --exon options of hisat2-build. We assembled transcripts in each sample, and merged annotated transcripts in Ensembl (Version 80) and all assembled transcripts with StringTie (1.3.4d) (Pertea et al., 2015). Then we estimated the abundances of transcripts, FPKM values, in each sample with StringTie (1.3.4d). For each tissue, we summarized the FPKM values of each transcript by averaging the values from multiple biological or technical replicates; and if a gene has multiple transcripts, we assigned the summary of the FPKM values of the transcripts as the transcriptional abundance of the gene.

Ribosome profiling and proteomics analysis

Request a detailed protocol

We downloaded the datasets that included both strand-specific ribosome profiling (Ribo-Seq) and RNA-Seq experiments of the same mouse samples from Gene Expression Omnibus (Barrett et al., 2013) under accession numbers GSE51424 (Gonzalez et al., 2014), GSE72064 (Cho et al., 2015), GSE41426 (Djiane et al., 2013), GSE22001 (Guo et al., 2010), GSE62134 (Diaz-Muñoz et al., 2015), and GSE50983 (Castañeda et al., 2014), which corresponded to brain, hippocampus, neural ES cells, heart, skeletal muscle, neutrophils, splenic B cells, and testis. Ribo-seq datasets were depleted of possible rRNA contaminants by discarding reads mapped to annotated rRNAs, and then the rest reads were mapped to GRCm38 (Waterston et al., 2002; Zerbino et al., 2018) with Bowtie2 (2.1.0) (Langmead and Salzberg, 2012). RNA-Seq reads were mapped to the mouse genome GRCm38 with TopHat2 (2.0.8) (Kim et al., 2013). Then we applied RiboTaper (1.3) (Calviello et al., 2016) which used the triplet periodicity of ribosomal footprints to identify translated regions to the bam files. Mouse GENCODE Gene Set M5 (Ensembl Version 80) (Mudge and Harrow, 2015) was used as gene annotation input. The Ribo-seq read lengths to use and the distance cutoffs to define the positions of P-sites were determined from the metaplots around annotated start and stop codons as shown below.

SampleRead lengthsOffsets
Brain29,3012,12
Hippocampus29,3012,12
Neural ES cells27,28,29,3012,12,12,12
Heart29,3012,12
Skeletal muscle29,3012,12
Neutrophils25,26,27,28,29,30,31,32,3312,12,12,12,12,12,12,12,12
Splenic B cells30,3112,12
Testis2812

All mouse peptide evidence from large-scale mass spectrometry studies was retrieved from PRIDE (09 Aug. 2015) (Vizcaino et al., 2016) and PeptideAtlas (31 Jul. 2015) (Desiere, 2006) databases. We performed the same procedures on PRIDE and PeptideAtlas data separately following the method described in Xie et al. (2012). In brief, if the whole sequence of a peptide was identical to one fragment of the tested de novo protein sequence, and had at least two amino acids difference compared to all the fragments of other protein sequences in the mouse genome, the peptide was considered to be convincing evidence for the translational expression of the respective de novo protein.

Molecular patterns of de novo genes

Request a detailed protocol

The exon number of a gene was assigned as the exon number of the transcript having highest FPKM value among all the transcripts of the gene. The intrinsic structural disorder of proteins was predicted using IUPred (Dosztányi et al., 2005), long prediction type was used. The intrinsic structural disorder score of a protein was assigned as the average of the scores of all its amino acids. The hydrophobic clusters of proteins were predicted using SEG-HCA (Faure and Callebaut, 2013), and then the fraction of the sequence covered by hydrophobic clusters for each protein was calculated. ‘Other’ genes used to compare against the de novo protein-coding genes were the protein-coding genes annotated in Ensembl (Version 80) excluding the de novo genes.

Reverse transcription PCR

Request a detailed protocol

The ovaries, oviducts, uterus, and gonadal fat pad from the females from the Gm13030 line were carefully collected and immediately frozen in liquid nitrogen. Total RNAs from those tissues were purified using QIAGEN RNeasy Microarray Tissue Mini Kit (Catalog no. 73304), and the genomic DNAs were removed using DNase I, RNase-free (Catalog no. 74106). The first strand cDNAs were synthesized using the Thermo Scientific RevertAid First Strand cDNA Synthesis Kit (Catalog no. K1622) by targeting poly-A mRNAs with oligo dT primers. Two pairs of primers targeted on the two junctions of Gm13030 gene structure and a pair of primers targeted on a control gene Uba1 were used. The sequences of the primers are shown below. PCR was done under standard conditions for 38 cycles.

Primer nameSequence (5’>3’)
junc1_FGGACACAGGCCAGGGAAATG
junc1_RCCTTAGGCCTTGCGAAGGAA
junc2_FGCCTGCTTTCACCATTTCAGG
junc2_RTATGAAAGGCTGGGTGAGGTG
Uba1_FGAAGATCATCCCAGCCATTG
Uba1_RTTGAGGGTCATCTCCTCACC

Genomic DNA sequences of the Gm13030 locus

Request a detailed protocol

The genomic sequences from wild mice M. spretus (eight individuals), M. m. castaneus (TAI, 10 individuals), M. m. musculus from Kazakhstan (KAZ, eight individuals), M. m. musculus from Afghanistan (AFG, six individuals), M. m. musculus from Czech Republic (CZE, eight individuals), M. m. domesticus from Iran (IRA, eight individuals), M. m. domesticus from Germany (GER, 11 individuals), and M. m. domesticus from France (FRA, eight individuals) were retrieved from the whole genome sequencing data in Harr et al. (2016). The genomic sequences from mouse strains CAROLI/EiJ (M. caroli) and PAHARI/EiJ (M. pahari) were retrieved from the whole genome sequencing data in Thybert et al. (2018). For all these sequences, we manually checked and corrected the substitutions based on the original mapped reads.

The genomic sequences from wild mice M. mattheyi (four individuals) and M. spicilegus (four individuals) were determined by Sanger sequencing of the PCR fragments from the genomic DNAs purified with salt precipitation. The PCR primers listed below were designed according to the whole genome sequencing data in Neme and Tautz (2016).

FragmentDirectionSequence (5’>3’)
1ForwardCAATATACAGACTTATACCAATGAAAACC
ReverseTGGGATCCTTAAGGTTCATTGTG
2ForwardCCAGAGACCTCTGGATTTGC
ReverseAAGGCACATCTCAAAGTAAAAGC

Molecular distance analysis

Request a detailed protocol

Whole genome sequencing data in Harr et al. (2016) and Neme and Tautz (2016) were used to obtain the average distances for the taxa in this analysis. For each individual, the mean mapping coverage was calculated using ANGSD (0.921–10-g2d8881c) (Korneliussen et al., 2014) with the options ‘-doDepth 1 -doCounts 1 -minQ 20 -minMapQ 30 -maxDepth 99999’. Then, ANGSD (0.921–10-g2d8881c) was used to extract the consensus sequence for each population accounting for the number of individuals and the average mapping coverage per population (mean + three times standard deviation) with the options “-doFasta 2 -doCounts 1 -maxDepth 99999 -minQ 20 -minMapQ 30 -minIndDepth 5 -setMinDepthInd 5 -minInd X1 -setMinDepth X2 -setMaxDepthInd X3 -setMaxDepth X4’. X1, X2, X3, and X4 are listed below. The consensus sequences of the mouse populations were used to calculate the Jukes-Cantor distances for 10,000 random non-overlapping 25 kbp windows from the autosomes with APE (5.1, ‘dist.dna’ function) (Paradis et al., 2004). The average distances obtained in this way are provided in Figure 3—figure supplement 2. The expected distances for Gm13030 were calculated by multiplying the length of the gap-free alignment with the average distances. The observed values were retrieved from the distance table of the alignments using Geneious (11.1.2).

Pairwise substitution comparisons for the Gm13030 reading frame were calculated with DnaSP (Librado and Rozas, 2009). For this, indels were excluded, stop codons were treated as 21st amino acid following the settings of the program. The results are included in Figure 3—figure supplement 3.

PopulationMean coverageStandard deviation of coverageX1X2X3X4
M. mattheyi23.30483.02815273273
M. spicilegus25.13824.62715100100
M. spretus24.88514.2164206854
M. m. castaneus14.0157.57352537370
M. m. musculus from Afghanistan17.76858.55131559354
M. m. musculus from Kazakhstan25.12315.97542074592
M. m. musculus from Czech Republic24.33814.10342067536
M. m. domesticus from Iran20.2499.82042050400
M. m. domesticus from Germany21.63910.51842054432
M. m. domesticus from France21.49910.02742052416

Gm13030 knockout line

Request a detailed protocol

Gm13030 was originally targeted by the Knock-Out Mouse Project (KOMP), but the line was lost. Hence, we obtained a custom-made CRISPR/Cas9 line from the Mouse Biology Program (MBP). The guide RNA was designed to target the beginning of the ORF in the second coding exon and away from the splicing site (genomic DNA target: 5’ TGCTCCATCTGCTTTTCAGG 3’). We obtained three mosaic frameshift knockout mice (genetic background: C57BL/6N). Then we mated them with the wildtypes from the same litters to have heterozygous pups, and selected one female and one male with a heterozygous 7 bp deletion (chr4:138,873,545–138,873,551) as the founding pair for further breeding and experiments. Primers for genotyping are listed below.

Allele
(Fragment length)
DirectionSequence (5’>3’)
KO
(502 bp)
ForwardCCTACCACATTGGGGCCATC
ReverseTACAAGCCATAAAACCTCCTGGAT
WT
(353 bp)
ForwardTTTTCTGCTCCATCTGCTTTTCA
ReverseAGTCACAGAGAAGGGGACGA

Whole genome sequencing of the founding pair and off-target analysis

Request a detailed protocol

The genomic DNAs from the founding pair were purified with salt precipitation. Then the samples were prepared with Illumina TruSeq Nano DNA HT Library Prep Kit (Catalog no. FC-121–4003), and sequenced on HiSeq 2500 with TruSeq PE Cluster Kit v3-cBot-HS (Catalog no. PE-401–3001) and HiSeq Rapid SBS Kit v2 (500 cycles) (Catalog no. FC-402–4023). The reads were 2 × 250 bp in order to have good power to detect indels.

We followed GATK Best Practices (Van der Auwera et al., 2013) to call variants. Specifically, we mapped the reads to mouse genome GRCm38 (Waterston et al., 2002; Zerbino et al., 2018) with BWA (0.7.15-r1140) (Li and Durbin, 2009), and marked duplicates with Picard (2.9.0) (http://broadinstitute.github.io/picard), and realigned around the indels founded in C57BL/6NJ line (Keane et al., 2011) with GATK (3.7), and recalibrated base quality scores with GATK (3.7) using variants founded in C57BL/6NJ line (Keane et al., 2011) to get analysis-ready reads. We assessed coverage with GATK (3.7) and SAMtools (1.3.1) (Li et al., 2009), and the coverage of female was 35.48 X and the one of male was 35.09 X. High coverages also provided good power to detect indels. We called variants with GATK (3.7), and applied generic hard filters with GATK (3.7): "QD <2.0 || FS >60.0 || MQ <40.0 || MQRankSum <−12.5 || ReadPosRankSum <−8.0 || SOR > 3.0’ for SNVs and ‘QD <2.0 || FS >200.0 || ReadPosRankSum <−20.0 || SOR > 10.0’ for indels. We found 80375 SNVs and 73387 indels in the female and 81213 SNVs and 71857 indels in the male.

347 potential off-target sites were predicted on http://crispr.mit.edu:8079/ based on mouse genome mm9. 343 of them still existed in mouse genome mm10 (GRCm38) after converting by liftOver (26 Jan. 2015) (Kent et al., 2002), and the four missing sites were ranked low anyway: 131, 132, 143, and 200. GATK (3.7) was used to look for variants found in the whole genome sequencing in the 100 bp regions around the 343 sites. In addition, the reads mapped to the regions around the top 20 sites were manually checked in both samples.

RNA-Seq and data analysis

Request a detailed protocol

The oviducts of 10–11 weeks old females from the Gm13030 line were carefully collected and immediately frozen in liquid nitrogen. Then, total RNAs were purified using QIAGEN RNeasy Microarray Tissue Mini Kit (Catalog no. 73304), and prepared using Illumina TruSeq Stranded mRNA HT Library Prep Kit (Catalog no. RS-122–2103), and sequenced using Illumina NextSeq 500 and NextSeq 500/550 High Output v2 Kit (150 cycles) (Catalog no. FC-404–2002). All procedures were performed in a standardized and parallel way to reduce experimental variance.

Raw sequencing outputs were converted to FASTQ files with bcl2fastq (2.17.1.14), and reads were trimmed with Trimmomatic (0.35) (Bolger et al., 2014). Only paired-end reads left were used for following analyses. We mapped the trimmed reads to mouse genome GRCm38 (Waterston et al., 2002; Zerbino et al., 2018) with HISAT2 (2.0.4) (Kim et al., 2015) and SAMtools (1.3.1) (Li and Durbin, 2009), and took advantage of the mouse gene annotation in Ensembl (Version 86) by using the --ss and --exon options of hisat2-build. We counted fragments mapped to the genes annotated by Ensembl (Version 86) with HTSeq (0.6.1p1) (Anders et al., 2015), and performed differential expression analysis with DESeq2 (1.14.1) (Love et al., 2014). Principle component analysis and hierarchical clustering with Euclidean distance and complete agglomeration method on the variance stabilized transformed fragment counts were also performed using DESeq2 (1.14.1) to assign the 24 samples into three clusters.

Droplet digital PCR (the quantitative PCR assay)

Request a detailed protocol

The relative expression levels of three Dcpp genes (Dcpp1, Dcpp2, and Dcpp3) in the six cluster 1 samples of the oviducts were further validated by droplet digital PCR. For each sample, 20 µl first strand cDNA solution was obtained using the Thermo Scientific RevertAid First Strand cDNA Synthesis Kit (Catalog no. K1622) by targeting poly-A mRNAs with oligo dT primers from 1 μg RNAs. Then the cDNA samples were diluted with water 1:400 for the PCR reactions. The information of probes and primers for three Dcpp genes, and Uba1 (the reference gene) are listed below. The sequences of the probe and primers for Dcpp genes were carefully designed to target all three genes at the same time. All PCR reactions were run with the same master mix and in the same plate. The PCR reaction mixture was prepared from 12.5 μL Bio-Rad ddPCR Supermix for Probes (Catalog no. 1863010), 1.25 μL oligo mix (5 μM probes, 18 μM forward primers, and 18 μM reverse primers) for Dcpp genes, 1.25 μL oligo mix for Uba1, and 10 μL cDNA dilution. The oil droplets containing 20 μL of the reaction mixture for each sample were generated by Bio-Rad QX100 Droplet Generator (Catalog no. 1863002). After droplet generation, the plate was sealed with a pierceable foil heat seal using Bio-Rad PX1 PCR Plate Sealer (Catalog no. 1814000) and then placed on a thermal cycler for amplification. The thermal cycling conditions were: 95°C for 10 min (one cycle), 94°C for 30 s and 56°C for 60 s (40 cycles), 98°C for 10 min (one cycle). After PCR, the 96-well PCR plate was loaded into Bio-Rad QX100 Droplet Reader (Catalog no. 1863003) which reads the signals in the droplets. Raw data were analyzed with Bio-Rad QuantaSoft analysis software provided with the Bio-Rad QX100 Droplet Reader. The relative expression level of Dcpp genes in each sample was calculated by dividing the concentration of Dcpp genes by the concentration of Uba1. For each sample, two independent technical replicates were performed.

GeneOligo5’ modificationSequence (5’>3’)3’ modification
Three
Dcpp genes
ProbeFAMGGACGGTCAAGTGTATGGCTBHQ1
Forward primerGATTATCATGGTCCAGAAGTTGGA
Reverse primerATGTGCTCTTCCTTAGACAGTCTG
Uba1ProbeHEXCTGAACCTCTTGCTGCACCTBHQ2
Forward primerGAAGATCATCCCAGCCATTG
Reverse primerTTGAGGGTCATCTCCTCACC

Fertility test

Request a detailed protocol

In addition to using the fertility data from the stock breeding of the Gm13030 animals, dedicated mating pairs were set up for the fertility test. The female and male in each pair were 8–9 weeks old when the mating was started. All the males were wildtype, and 10 females were homozygous knockout and the other 10 were wildtype. The time (days) until having the first and second litters, the numbers of pups of the first and second litters, and whether the pups were eaten later for each mating pair were carefully observed and recorded by animal caretakers who were blind for the genotypes. Figure 5—source data 1 provides the details of the mice, the individual phenotype scores and the notes on the losses of litters, both for the stock breeding, as well as the specifically set up pairs.

Data availability

The ENA BioProject accession number for the sequencing data reported in this study is PRJEB28348.

The following data sets were generated
    1. Xie C
    2. Kuenzel S
    3. Tautz D
    (2018) European Nucleotide Archive
    ID PRJEB28348. RNA-Seq and whole genome sequencing of the samples from three de novo gene knockout mouse lines and transfected MEFs on C57BL/6 background.

References

    1. Waterston RH
    2. Lindblad-Toh K
    3. Birney E
    4. Rogers J
    5. Abril JF
    6. Agarwal P
    7. Agarwala R
    8. Ainscough R
    9. Alexandersson M
    10. An P
    11. Antonarakis SE
    12. Attwood J
    13. Baertsch R
    14. Bailey J
    15. Barlow K
    16. Beck S
    17. Berry E
    18. Birren B
    19. Bloom T
    20. Bork P
    21. Botcherby M
    22. Bray N
    23. Brent MR
    24. Brown DG
    25. Brown SD
    26. Bult C
    27. Burton J
    28. Butler J
    29. Campbell RD
    30. Carninci P
    31. Cawley S
    32. Chiaromonte F
    33. Chinwalla AT
    34. Church DM
    35. Clamp M
    36. Clee C
    37. Collins FS
    38. Cook LL
    39. Copley RR
    40. Coulson A
    41. Couronne O
    42. Cuff J
    43. Curwen V
    44. Cutts T
    45. Daly M
    46. David R
    47. Davies J
    48. Delehaunty KD
    49. Deri J
    50. Dermitzakis ET
    51. Dewey C
    52. Dickens NJ
    53. Diekhans M
    54. Dodge S
    55. Dubchak I
    56. Dunn DM
    57. Eddy SR
    58. Elnitski L
    59. Emes RD
    60. Eswara P
    61. Eyras E
    62. Felsenfeld A
    63. Fewell GA
    64. Flicek P
    65. Foley K
    66. Frankel WN
    67. Fulton LA
    68. Fulton RS
    69. Furey TS
    70. Gage D
    71. Gibbs RA
    72. Glusman G
    73. Gnerre S
    74. Goldman N
    75. Goodstadt L
    76. Grafham D
    77. Graves TA
    78. Green ED
    79. Gregory S
    80. Guigó R
    81. Guyer M
    82. Hardison RC
    83. Haussler D
    84. Hayashizaki Y
    85. Hillier LW
    86. Hinrichs A
    87. Hlavina W
    88. Holzer T
    89. Hsu F
    90. Hua A
    91. Hubbard T
    92. Hunt A
    93. Jackson I
    94. Jaffe DB
    95. Johnson LS
    96. Jones M
    97. Jones TA
    98. Joy A
    99. Kamal M
    100. Karlsson EK
    101. Karolchik D
    102. Kasprzyk A
    103. Kawai J
    104. Keibler E
    105. Kells C
    106. Kent WJ
    107. Kirby A
    108. Kolbe DL
    109. Korf I
    110. Kucherlapati RS
    111. Kulbokas EJ
    112. Kulp D
    113. Landers T
    114. Leger JP
    115. Leonard S
    116. Letunic I
    117. Levine R
    118. Li J
    119. Li M
    120. Lloyd C
    121. Lucas S
    122. Ma B
    123. Maglott DR
    124. Mardis ER
    125. Matthews L
    126. Mauceli E
    127. Mayer JH
    128. McCarthy M
    129. McCombie WR
    130. McLaren S
    131. McLay K
    132. McPherson JD
    133. Meldrim J
    134. Meredith B
    135. Mesirov JP
    136. Miller W
    137. Miner TL
    138. Mongin E
    139. Montgomery KT
    140. Morgan M
    141. Mott R
    142. Mullikin JC
    143. Muzny DM
    144. Nash WE
    145. Nelson JO
    146. Nhan MN
    147. Nicol R
    148. Ning Z
    149. Nusbaum C
    150. O'Connor MJ
    151. Okazaki Y
    152. Oliver K
    153. Overton-Larty E
    154. Pachter L
    155. Parra G
    156. Pepin KH
    157. Peterson J
    158. Pevzner P
    159. Plumb R
    160. Pohl CS
    161. Poliakov A
    162. Ponce TC
    163. Ponting CP
    164. Potter S
    165. Quail M
    166. Reymond A
    167. Roe BA
    168. Roskin KM
    169. Rubin EM
    170. Rust AG
    171. Santos R
    172. Sapojnikov V
    173. Schultz B
    174. Schultz J
    175. Schwartz MS
    176. Schwartz S
    177. Scott C
    178. Seaman S
    179. Searle S
    180. Sharpe T
    181. Sheridan A
    182. Shownkeen R
    183. Sims S
    184. Singer JB
    185. Slater G
    186. Smit A
    187. Smith DR
    188. Spencer B
    189. Stabenau A
    190. Stange-Thomann N
    191. Sugnet C
    192. Suyama M
    193. Tesler G
    194. Thompson J
    195. Torrents D
    196. Trevaskis E
    197. Tromp J
    198. Ucla C
    199. Ureta-Vidal A
    200. Vinson JP
    201. Von Niederhausern AC
    202. Wade CM
    203. Wall M
    204. Weber RJ
    205. Weiss RB
    206. Wendl MC
    207. West AP
    208. Wetterstrand K
    209. Wheeler R
    210. Whelan S
    211. Wierzbowski J
    212. Willey D
    213. Williams S
    214. Wilson RK
    215. Winter E
    216. Worley KC
    217. Wyman D
    218. Yang S
    219. Yang SP
    220. Zdobnov EM
    221. Zody MC
    222. Lander ES
    223. Mouse Genome Sequencing Consortium
    (2002) Initial sequencing and comparative analysis of the mouse genome
    Nature 420:520–562.
    https://doi.org/10.1038/nature01262

Decision letter

  1. George H Perry
    Reviewing Editor; Pennsylvania State University, United States
  2. Detlef Weigel
    Senior Editor; Max Planck Institute for Developmental Biology, Germany
  3. George H Perry
    Reviewer; Pennsylvania State University, United States
  4. Douglas B Menke
    Reviewer; University of Georgia, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Studying the dawn of de novo gene emergence in mice reveals fast integration of new genes into functional networks" for consideration by eLife. Your article has been reviewed by four peer reviewers, including George H Perry as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Detlef Weigel as the Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: Douglas B Menke (Reviewer #4).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

You present a knockout study of three putative mouse lineage-specific de novo (not originating from duplication) emergent protein-coding genes, with an integrative transcriptomic and phenotypic analysis. The reviewers praised the aims of the study; this is a potentially fundamentally insightful research program into the functional mechanics of an important evolutionary process that has never previously been studied at nearly this depth in vertebrates. However, we also collectively identified several significant limitations of the current study design and presentation of the manuscript, some of which could potentially be addressed in a revision, although included among the comments is one request for new experimental data that may not be feasible in a reasonable timeframe (but we choose to present the request and the option, in case something can be done in a cell line or in case additional data are already in hand). For another major concern, related to the specificity of two of the three knockouts, we suggest a revised manuscript presentation that would focus on the third knockout (otherwise, updating and repeating the experiments for the first two knockouts would be well beyond a reasonable scope of revision; although note that following a straightforward analytical verification, we suggest that one of these two experiments could still be secondarily described and discussed in the paper). We very much hope that you will consider such a strategy. If you choose to address our essential revisions and resubmit your manuscript to eLife, a thorough re-evaluation would be required since the perceived definitiveness of the results could change depending on the revised results.

Essential revisions:

1) For two of the gene knockout models used for this study, Udng1 and Udng2, the introduced mutations are not small and they overlap and/or are nearby different annotated transcripts. While the specific details and genomic contexts are not provided by the authors (this information should be provided and illustrated in the revision, and carefully discussed), for example the Undg1 knockout mutation used in this study is a deletion >3kb that also removes part of the annotated ORF A930004D18Rik. Udng2 is located directly upstream of the zinc finger protein encoding Zpf169. Therefore, based on current results it is uncertain whether reported effects are explained by the knockout of Udng1 and 2, knockout of (in one case) the overlapping transcript, removal of functional regulatory sequence for neighboring transcripts, or combinations thereof.

2) The CRISPR/Cas9 induced Udng3 knockout is a 7bp deletion, less likely than the Udng1 and 2 knockouts to have incidental effects on the expression of other genes which could instead explain the findings. We suggest that the authors focus the manuscript on Udng3 primarily, but with two additional requests.

i) Verify perturbed Dcpp expression with Udng3 knockout by an independent experimental analysis (this is the additional experimental requirement mentioned above). The knockout effects on expression levels of Dcpp family genes, alongside the interesting discussion context of the pseudogenization of both Udng3 and Dcpp3 in M. m. domesticus, are potentially exciting and valuable components of the study, but one that reviewers determined should be verified given questions about other results.

ii) Verify that the expression levels of genes neighboring Udng3 are not affected by the 7 bp deletion.

3) If desired, more speculative results on Udng2 could still be offered secondarily, as long as the authors verify that expression levels of the neighboring zinc finger gene are not altered by the deletion.

4) The manuscript should be revised (from the interpretation through to the discussion) to acknowledge that prior to the origin of these ORFs, the non-coding sequences might have had some function and, in fact, that any such ancestral non-coding function might still exist.

5) The phenotyping and behavioral study designs, data, and results are under-reported, and the reviewers raised questions about the statistical analyses performed on these data. How many (and which specific) morphological phenotypes and behavioral traits were measured and tested for each knockout experiment? What were the methods/experimental designs for how each measurement was collected or observation recorded? Importantly, the authors should introduce a multiple test-correction framework and revise their discussion of the results accordingly. Please also provide the individual-level data for each variable tested.

6) A specific concern was raised about the reproductive phenotype highlighted from the Udng3 experiment, which needs to be addressed. Specifically, a large timing difference between the birth of the first and second litters (23 days for KO vs 36 for WT) is reported. Yet WT females often ovulate within a day of giving birth; if a male is present, mating can produce a second litter within 21 days. If the female does not ovulate or mating is not successful, then ovulation is suppressed while the female is lactating, causing a substantial delay in the timing of the next pregnancy. Given the established non-continuous nature of pregnancy intervals in this model system, it is unclear that the authors have robustly identified a reproductive phenotype from these data.

7) The evolutionary analysis of gene gain needs to be strengthened beyond the current 'intact/not intact' assessment presented in Figure 2, and should also include an expanded set of available genomes in the comparison. At present it is assumed that the functional genes emerged in the M. musculus lineage following divergence from common ancestors shared with other studied taxa, yet without more detailed analysis it remains possible that these genes emerged earlier and then accumulated pseudogenizing mutations in the other lineage(s). Here, the analysis should include the specific pattern of ancestral vs. derived mutations that open up (or maintain as closed) the reading frame.

In addition, the authors use an older version of the mouse genome annotation, Ensembl 80 (from 2015), while the current is Ensembl 95. Between these two versions a new set of mouse species annotations have become available, which should be included in the revised Figure 2 analysis and will help inform the study. For example, the 3 genes studied in this paper have annotated protein coding orthologs in Mus caroli (3 MY divergence with M. musculus) and M. Pahari (6 MY), suggesting earlier emergence timing for these genes than currently presented in the paper.

8) In the initial analysis of the ENCODE RNA-seq data a very low transcription threshold of 0.1 FPKM was used, whereas most mouse protein coding genes have expression levels higher than 5 FPKM in the 35 tissues investigated. With such a low threshold, pervasive transcription of much of the genome could potentially be observed. The authors should comment on this issue, their choice of threshold, and consider further analyses to justify / confirm that transcripts would not be spuriously identified.

The process used to select the three genes for detailed study from among the larger set of candidates should be detailed more precisely, especially given that the expression levels for two of the loci are relatively low. Were these the only three candidates meeting all of the (general) criteria mentioned? Are here any additional data (e.g. antibody staining of tissues) that protein products are produced?

9) The transcriptome analysis from the knockout experiments should be strengthened. Especially, it would be more convincing that the magnitude of the observed differences and the specific results themselves do not simply reflect experimental noise (i.e. on a genomic scale, sets of differentially expressed transcripts are always observed even between sets of technical replicates) if the authors were to compare two sets of control experiments to each other, or for example to randomly select WT and KO sample size-matched sets of individuals from the whole sample (which can then be permuted), and evaluate accordingly.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting your article "Studying the dawn of de novo gene emergence in mice reveals fast integration of new genes into functional networks" for consideration by eLife. Your article has been reviewed by six peer reviewers, including George H Perry as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Detlef Weigel as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Douglas B Menke (Reviewer #4); Julian Christians (Reviewer #5).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

Your revised manuscript was first reviewed by the four reviewers of the original submission. While stark improvements on multiple fronts were noted, three major concerns – one related to each of the three de novo gene experiments – remained. These concerns are detailed below. In two cases, we sought specific input from reviewers with appropriate expertise to help finalize our decision.

First, the 3kb Undg1 knockout also removes two strongly conserved non-coding sequences (e.g. chr2:18,026,405-18,026,620 of the mouse mm10 assembly), the potentially-relevant functional consequences of which are not mentioned. Concern about the inability of this experimental design to account for the consequences of removal of DNA sequence involved in gene regulatory processes was mentioned briefly in the previous decision letter, with deeper investigation now revealing these specific conserved sequences.

As an editor, while I appreciate that efficient techniques to effectively limit knockout footprints have only recently become practicable, the appropriate interpretative approach should still be to attribute any observed phenotypic consequences to simultaneously affected, deeply-conserved sequences than to the absence of a protein encoded by a de novo gene. This null and conservative hypothesis should only be rejected alongside convincing evidence to the contrary. In the absence of any understanding of the biological functions of the de novo gene (i.e. one of the goals of this study) this is quite a difficult proposition within the bounds of the present experiment. Since your original interpretations may ultimately be proven correct, and because these experimental results could be a key part of that process and of developing understandings of de novo gene evolution, I still encourage the inclusion of the Undg1 experiment as a secondary and tentative result, with caveats leading the way.

Second, the proximity of the Undg2 knockout to the zinc finger protein-encoding Zpf169 gene led to concern expressed in the previous decision letter that expression levels of this gene might be affected, which could thus be a more likely explanation for the phenotypic observations. We had requested your evaluation of Zpf169 expression levels from the RNA-seq data. Indeed, Zpf169 expression is significantly reduced (with approximately 0.75-fold expression) in the Undg2 knockout experiment. In the revision, you conclude that an expression difference of this magnitude would be very unlikely to result in the observed phenotypic effects because Zpf169 is a KRAB-Zn-finger protein, family members of which bind to transposable elements to silence them. After seeking further input, we conclude that this statement is unwarranted. Even if Zpf169 is ultimately confirmed to be TE-controlling, such loci can have marked transcriptome effects across various developmental stages and cell types.

Third, your revised analysis of the reproductive phenotype data from the Undg3 targeted knockout experiment was considered a substantial improvement. An additional reviewer described the revised analytical approach as appropriate and considered the results compelling, helping to resolve our remaining questions concerning this experiment.

Based on this evaluation process, I am recommending that your manuscript be considered further for publication if you choose to make the following essential revisions. The first three essential revision points are summary conclusions based on the above detailed report of the review process.

Essential revisions:

1) Focus the manuscript on the Undg3 experiments and results.

2) Remove the Undg2 knockout experiment and results from the main text and figures of the paper. If presented in supplementary information to help support the description of the overall process of the investigation, the strong (and null hypothesis) possibility that phenotypic observations are explained by the Zpf169 expression change should be used to help frame the presentation and discussion.

3) The Undg1 knockout experiment and results could be presented as secondary in the main text or as part of the supplementary information. In either case the approach described above should be followed.

4) The selection process for the de novo genes chosen for detailed examination (e.g. as discussed in the responses to reviewer comments) should be presented more fully in the main text of the paper.

5) Figure 2 should differentiate between the total absence of a gene from the presence of a disabled version of the gene.

[Editors' note: a further round of revisions were suggested.]

Thank you for resubmitting your work entitled "A de novo evolved gene in the house mouse regulates female pregnancy cycles" for further consideration at eLife. Your revised article has been evaluated by Detlef Weigel (Senior Editor), George Perry (Reviewing Editor), and four reviewers.

You extensively revised your manuscript in response to reviewer and editor feedback, and when I read the manuscript I am impressed with the current approach and results. I see a thorough, multi-step, novel investigation into a difficult-to-study phenomenon (de novo gene emergence), with analyses spanning from genome-scale to locus specific, and approaches across the spectrum from transcriptomic to detailed molecular biology to in vivo.

While the reviewers were generally pleased with how the paper was revised, there remains some skepticism concerning the ultimate in vivo reproductive phenotype (and they also raise some specific concerns and questions, including on the newly-presented data, that require your attention; see below). However, I do recognize the multiple in vitro results that altogether help strengthen that case, and I (George Perry) take responsibility for the decision to proceed with the understanding that (as always) future experiments may further clarify the biology. Overall, I conclude that the multiple layers of valuable insight that will provided by this eventual publication are likely to represent a substantial advance in this field of research, and thus I would encourage you to consider the below remaining issues to be addressed before acceptance.

Essential revisions from reviewer comments:

1) A data reporting / statistical inconsistency was identified by that needs to be addressed: The text states that they set up 10 mating pairs with homozygous mutant females and 10 mating pairs with WT females. However, Figure 5 (and the supplemental data) show that they actually collected data from 16 mating pairs with homozygous females, 13 matings pairs with WT females, and 7 mating pairs with heterozygous females. The p-value of 0.042 was obtained by combining the WT and Het data and comparing it against the homozygous mutant data. I get the same p-value (0.042) when I used the same statistical test used by the authors (two-tailed Wilcoxon rank sum test). When only WT is compared against mutant, the p-value is 0.12.

2) The authors have observed that deletion of Gm13030 affects gene expression in only one phase of the estrous cycle (when there is high progesterone receptor and estrogen receptor 1 expression, and low G protein-coupled estrogen receptor 1 expression). How does this phase correspond to the period between ovulation and implantation (when the deletion would presumably exert its effect)?

3) Subsection “Transcriptome and phenotype changes”: "We identified a possible direct link between the identified phenotype of a shorter gestation length in the knockouts" Gestation length (i.e., between fertilization and birth) was not measured; a difference was observed in the timing between the birth of the first and second litter.

4) The authors suggest that Gm13030 may be involved in a parent-offspring conflict of interest, i.e., "this is a system where a selfish tendency of embryos in expense of the resources of the mothers could develop." (subsection “Transcriptome and phenotype changes” and related statements in this section and “Male bias versus female bias”). However, it does not appear that there is a conflict of interest here, and it is not clear how such a conflict might work. The pups that are born earlier are more likely to be cannibalized/ neglected by the mother, in addition to being more likely to compete with older siblings. Thus, there is potentially a trade-off between having a second litter quickly (more reproduction sooner) vs waiting (when second litters will be more likely to survive), but this isn't a conflict between parent and offspring. Parent-offspring conflicts generally involve situations where individual offspring would like the parent to invest more in them, whereas the parent would like to preserve resources for other offspring. While traveling down the oviduct, it is not clear how conceptuses could extract more resources from the mother in a way that compromised the mother's future reproduction without impacting their own survival.

5) Provide a clearer justification/statement explaining why you selected Gm13030 for detailed analysis from the broader set of candidates and/or why you desired to study the function of a gene expressed in the female reproductive tract.

6) Figure 3 describes the expected number of substitutions expected between species. However, the number of coding and non-coding substitutions should be provided separately to support the statement "there is no bias towards coding mutations".

https://doi.org/10.7554/eLife.44392.023

Author response

We appreciate the decision and the suggestion to put less emphasis on two of the three genes studied. We have carefully considered this, but in balance, we believe that the breadth of the results for all three genes is sufficiently broad that they should also be fully reported. However, we have added additional analyses and descriptions for these genes. In addition, we would like to point out that our knockout method for Udng1 and Udng2 was the most widely used method before CRISPR/Cas9 occurring which allowed the discovery of lots of true knowledge, and it is still valuable now due to that CRISPR/Cas9 also has its own limitation. Besides, previous functional studies of de novo genes were mainly using RNAi knockdown method, our knockout procedure for Udng1 and Udng2 is already one step forward.

Essential revisions:

1) For two of the gene knockout models used for this study, Udng1 and Udng2, the introduced mutations are not small and they overlap and/or are nearby different annotated transcripts. While the specific details and genomic contexts are not provided by the authors (this information should be provided and illustrated in the revision, and carefully discussed), for example the Undg1 knockout mutation used in this study is a deletion >3kb that also removes part of the annotated ORF A930004D18Rik. Udng2 is located directly upstream of the zinc finger protein encoding Zpf169. Therefore, based on current results it is uncertain whether reported effects are explained by the knockout of Udng1 and 2, knockout of (in one case) the overlapping transcript, removal of functional regulatory sequence for neighboring transcripts, or combinations thereof.

We apologize that we have not provided sufficiently clear details on the genes and the knockouts. We have now added appropriate sketches and further details. We have also clarified the annotation for Udng1 in the database. Note that we cannot confirm the annotated first exon of A930004D18Rik (Udng1) from any of the available transcriptome samples. Its annotation appears to be derived from an EST and a cDNA from retina sample. But we realized now that the new transcript that we had reconstructed has the capacity to code for an ORF that we had not considered before. It turns out that the ribosome profiling data support the translation of this ORF, as well as the other ORF that we had analyzed before. The knockout construct deletes both ORFs. We have now added all these descriptions in detail.

For Udng2, we had indeed failed to point out that it is an example for a de novo gene having apparently emerged from a bidirectional promotor and we have added this information now. It is known that this is one of the emergence mechanisms for novel transcripts.

Zfp169 (chr13:48,487,647-48,513,451) is indeed expressed lower in the Udng2 knockout mice (0.78 fold in males, 0.766 fold in females). However, ZFP169 belongs to the KRAB-Zn-finger proteins, which bind to transposable elements to silence them. Hence, it is very unlikely that the small expression difference could cause the effects we see for the Udng2 knockout. We added this information to the text.

2) The CRISPR/Cas9 induced Udng3 knockout is a 7bp deletion, less likely than the Udng1 and 2 knockouts to have incidental effects on the expression of other genes which could instead explain the findings. We suggest that the authors focus the manuscript on Udng3 primarily, but with two additional requests.

i) Verify perturbed Dcpp expression with Udng3 knockout by an independent experimental analysis (this is the additional experimental requirement mentioned above). The knockout effects on expression levels of Dcpp family genes, alongside the interesting discussion context of the pseudogenization of both Udng3 and Dcpp3 in M. m. domesticus, are potentially exciting and valuable components of the study, but one that reviewers determined should be verified given questions about other results.

We note that our originally presented results had already provided a kind of replicated evidence in the sense that all three Dcpp genes were affected in the same way. We have now added a quantitative PCR experiment to confirm this (data are summarized in Figure 4—figure supplement 1).

ii) Verify that the expression levels of genes neighboring Udng3 are not affected by the 7 bp deletion.

The expression differences for the neighboring genes are indeed only very minor and not significant (fold changes: upstream gene Pla2g2e = 0.95, downstream gene Pla2g5 = 1.06). We added a note to to the text that these genes are not among the differentially expressed genes.

3) If desired, more speculative results on Udng2 could still be offered secondarily, as long as the authors verify that expression levels of the neighboring zinc finger gene are not altered by the deletion.

As pointed out above, the ZFP169 expression is slightly lowered in the knockouts, but the gene is not expected to act as a regulator of transcription.

4) The manuscript should be revised (from the interpretation through to the discussion) to acknowledge that prior to the origin of these ORFs, the non-coding sequences might have had some function and, in fact, that any such ancestral non-coding function might still exist.

This is probably true for many genes. But we acknowledge this now and emphasize that the genes are predicted to function via their proteins. We also discuss specifically the possibility (and evidence) for prior emergence as non-coding RNAs. Further we have removed sections that relate to the question of protein function. And we have added a section in the discussion addressing the question of coding versus non-coding RNAs.

5) The phenotyping and behavioral study designs, data, and results are under-reported, and the reviewers raised questions about the statistical analyses performed on these data. How many (and which specific) morphological phenotypes and behavioral traits were measured and tested for each knockout experiment?

We had indeed done several other pre-tests on smaller number of animals, mostly on the Udng2 line, to see whether any of them would show a tendency. We then continued only with the test that showed a tendency and used double the number of animals to confirm this tendency. Now we fully added this information in the text, including a new table (Table 4) and a new supplemental table (Table 4—source data 1). We have also improved our statistical analysis by adding the analysis only using the expanded samples. Last but not least, we added a statement in the text that we are following the RRR principles of animal experiments in reducing them to the minimally necessary numbers, even with the risk that we might have missed a phenotype.

What were the methods/experimental designs for how each measurement was collected or observation recorded? Importantly, the authors should introduce a multiple test-correction framework and revise their discussion of the results accordingly. Please also provide the individual-level data for each variable tested.

We have further updated the Materials and methods section. Now we have added also the multiple testing corrections for all phenotyping results, and revised the text accordingly. The individual level data were included in a supplementary file, but this was not very clear. We have now reorganized the supplements to make this clearer.

6) A specific concern was raised about the reproductive phenotype highlighted from the Udng3 experiment, which needs to be addressed. Specifically, a large timing difference between the birth of the first and second litters (23 days for KO vs 36 for WT) is reported. Yet WT females often ovulate within a day of giving birth; if a male is present, mating can produce a second litter within 21 days. If the female does not ovulate or mating is not successful, then ovulation is suppressed while the female is lactating, causing a substantial delay in the timing of the next pregnancy. Given the established non-continuous nature of pregnancy intervals in this model system, it is unclear that the authors have robustly identified a reproductive phenotype from these data.

We appreciate that the reviewer pointed out the non-continuous nature of pregnancy intervals. We have now specifically analyzed our data for this. Our data show indeed the non-continuous nature of the time to the next litter. It is either smaller than or equal to 25 days (early group) or larger than or equal to 35 days (late group). We have now added additional text and an additional figure (Figure 5) to make this point clear.

7) The evolutionary analysis of gene gain needs to be strengthened beyond the current 'intact/not intact' assessment presented in Figure 2, and should also include an expanded set of available genomes in the comparison. At present it is assumed that the functional genes emerged in the M. musculus lineage following divergence from common ancestors shared with other studied taxa, yet without more detailed analysis it remains possible that these genes emerged earlier and then accumulated pseudogenizing mutations in the other lineage(s). Here, the analysis should include the specific pattern of ancestral vs. derived mutations that open up (or maintain as closed) the reading frame.

We acknowledge that we have too much focused the text on presence-absence descriptions. Further, the sequence alignments originally provided in the data were based on direct PCR and re-sequencing, since short read sequences from genome projects are not fully reliable. We have now added also the genomic information into the alignments (Figure 2—figure supplements 3-6) and redesigned Figure 2. We have also extended the text to make clear that the ORFs could indeed have emerged earlier, but carry partially secondary disabling mutations. Still, rat and Apodemus clearly do not show the ORFs, i.e. the genes are very young.

In addition, the authors use an older version of the mouse genome annotation, Ensembl 80 (from 2015), while the current is Ensembl 95. Between these two versions a new set of mouse species annotations have become available, which should be included in the revised Figure 2 analysis and will help inform the study. For example, the 3 genes studied in this paper have annotated protein coding orthologs in Mus caroli (3 MY divergence with M. musculus) and M. Pahari (6 MY), suggesting earlier emergence timing for these genes than currently presented in the paper.

As pointed out above, we added this genomic information now. In addition, we note that although these are denoted as coding orthologs, the Ensembl website (Version 95) lists the reading frames as interrupted:

>MGP_CAROLIEiJ_T0049751.1 (Udng1 ORF2, Mus caroli)

VDRQVTLIFPSSVRMPKSCFKAFL*LESKL*LFFIIRK*L*AWDEIGQSRNLLIPSKFVN

RCTYYRTGPRVQPVWDPRLAPSATSARCYPILFFSSFFLESTDKHMKTLGSVLHELQLLG

GNPMNEQVKFREVSVYSLSVGAWEHGAADCPWTMAHP

>MGP_PahariEiJ_T0060766.1 (Udng1 ORF2, Mus pahari)

VDRQVTLIFPSSVRGCLNLASRGFYNWKANFYFSLLYVNDCEHGMRLANLRIFSLQQTRG

QVHVLEDWTTRPACVGSTSCIFSDLCSVLSYTFFFLSFLNRLINI*ELEVSSTNFSFLEE

TL*MSK*SSARSQSVPSRLAPGSMGLQTVPGLWHTH

>MGP_CAROLIEiJ_T0034271.1 (Udng2, Mus caroli)

MGKHCTRKQWRNISDVDNK*SEQRTPLVRNRSGTKQRRESRARPGGQPETKPGPWGNQGS

LSSKDSTKDQRNPQRCSGPFLGRSPNTDSTGHTAAPSQGKSPGENVSGNKGGEEQHLFL*

GHKAFRVVHNIWKWFHLDRKTRWGP*AFLVSPKKMQN*A*KRSNCLQV

>MGP_CAROLIEiJ_T0065417.1 (Udng3, Mus caroli)

MCRFHLLQAIKPPEKQMEQKTSALGSIMKLSQRHATETTWVFPSQGLRAYLLHPACFHHF

RKEEKPD*RPANMIYGFDKIHPRSC*TVLLVQPCLLMLSRDLGPEQLQ*LQLIPDDITSS

SLSYGSSQNLSQALNFPKHVDTG

(There are no annotated coding orthologs for Udng1 ORF1 in Mus caroli and Mus pahari in Ensembl 95; there are no annotated coding orthologs for Udng2 and Udng3 in Mus pahari in Ensembl 95.)

8) In the initial analysis of the ENCODE RNA-seq data a very low transcription threshold of 0.1 FPKM was used, whereas most mouse protein coding genes have expression levels higher than 5 FPKM in the 35 tissues investigated. With such a low threshold, pervasive transcription of much of the genome could potentially be observed. The authors should comment on this issue, their choice of threshold, and consider further analyses to justify / confirm that transcripts would not be spuriously identified.

This is a continuous issue in the de novo gene research field: new genes are known to be expressed at lower levels, i.e., the thresholds that are used for well conserved genes are not the best to be applied. But we agree that no matter how to select the threshold, it will be artificial, and there is no gold standard. FPKM value itself should not be a cutoff to distinguish real transcripts and transcriptional noise, because even a low expressed transcript could be well supported by reads if the sequencing depth is deep enough. We chose 0.1 as the threshold because we found through manual checking that the mapped RNA-Seq reads from ENCODE data for some low expressed de novo genes (FPKM value between 0.1 and 0.2), including Udng3, and their gene structures are well supported. We also kept minimum usage of this cutoff, i.e., only for the result showed in Figure 1B.

The process used to select the three genes for detailed study from among the larger set of candidates should be detailed more precisely, especially given that the expression levels for two of the loci are relatively low. Were these the only three candidates meeting all of the (general) criteria mentioned? Are here any additional data (e.g. antibody staining of tissues) that protein products are produced?

Yes, other genes could also have been used, but we had to take a decision. Many considerations went into this decision, a major one was the availability of targeted lines in public resources (EMMA and KOMP). Note that also Udng3 had been targeted, but could not be revived. This is why we then decided to use a CRISPR approach.

There are no antibody data available for any of these genes. But by today´s standards, an antibody staining would anyway be seen very critically, due to the possibility of unaccounted cross-reactions.

9) The transcriptome analysis from the knockout experiments should be strengthened. Especially, it would be more convincing that the magnitude of the observed differences and the specific results themselves do not simply reflect experimental noise (i.e. on a genomic scale, sets of differentially expressed transcripts are always observed even between sets of technical replicates) if the authors were to compare two sets of control experiments to each other, or for example to randomly select WT and KO sample size-matched sets of individuals from the whole sample (which can then be permuted), and evaluate accordingly.

We include now also the permutation analysis as requested to support that our differential expression signals are valid (Figure 3—figure supplement 1). But we need to point out that we have also taken utmost care and experimental effort (much more than usual) to distinguish noise from real signal.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Essential revisions:

1) Focus the manuscript on the Undg3 experiments and results.

2) Remove the Undg2 knockout experiment and results from the main text and figures of the paper. If presented in supplementary information to help support the description of the overall process of the investigation, the strong (and null hypothesis) possibility that phenotypic observations are explained by the Zpf169 expression change should be used to help frame the presentation and discussion.

3) The Undg1 knockout experiment and results could be presented as secondary in the main text or as part of the supplementary information. In either case the approach described above should be followed.

4) The selection process for the de novo genes chosen for detailed examination (e.g. as discussed in the responses to reviewer comments) should be presented more fully in the main text of the paper.

5) Figure 2 should differentiate between the total absence of a gene from the presence of a disabled version of the gene.

We have followed your advice to concentrate the presentation on only one of the genes (formerly called Udng3, we use now the ENSEMBL designation Gm13030). This has led to a major rearrangement of the manuscript, including a new title, as well as an additional discussion on female-specific genes. Further, material that was previously in the supplementary files has partly been moved into the main text. We have also added a new figure showing the evolutionary tree and substitutions of the gene (Figure 3). Apart of this, we have not included any new analyses.

As discussed with the editors, our only reference to the two other genes is now in the discussion. We plan to do CRSPR knockouts for these to confirm the initial findings and we would hope that these can then be published as a Research Advance to the present paper.

[Editors' note: a further round of revisions were suggested.]

While the reviewers were generally pleased with how the paper was revised, there remains some skepticism concerning the ultimate in vivo reproductive phenotype (and they also raise some specific concerns and questions, including on the newly-presented data, that require your attention; see below). However, I do recognize the multiple in vitro results that altogether help strengthen that case, and I (George Perry) take responsibility for the decision to proceed with the understanding that (as always) future experiments may further clarify the biology. Overall, I conclude that the multiple layers of valuable insight that will provided by this eventual publication are likely to represent a substantial advance in this field of research, and thus I would encourage you to consider the below remaining issues to be addressed before acceptance.

We appreciate your positive decision and careful comments. After revisions following your suggestions, we are submitting the revised manuscript, with detailed point-to-point responses.

We submit the revised version of the manuscript plus an annotated version that shows all the changes. In addition, we replaced Figure 3 with a slightly optimized version (the contours in the tree were optimized), as well as a new file as Figure 3—figure supplement 3 which is a table that shows the statistics for the coding/noncoding substitutions.

Essential revisions from reviewer comments:

1) A data reporting / statistical inconsistency was identified by that needs to be addressed: The text states that they set up 10 mating pairs with homozygous mutant females and 10 mating pairs with WT females. However, Figure 5 (and the supplemental data) show that they actually collected data from 16 mating pairs with homozygous females, 13 matings pairs with WT females, and 7 mating pairs with heterozygous females. The p-value of 0.042 was obtained by combining the WT and Het data and comparing it against the homozygous mutant data. I get the same p-value (0.042) when I used the same statistical test used by the authors (two-tailed Wilcoxon rank sum test). When only WT is compared against mutant, the p-value is 0.12.

We apologize that this was a bit confusing, since we combined the data from two breeding rounds into one table and had described only one of them in the Materials and methods. But as described in the text, we analyzed first the data that we had already obtained from the normal stock breeding (coded under UC in the table, and they include also heterozygous animals). To confirm them, we set up another 10 dedicated breeding pairs (coded under WT and KO in the table). The P-Value was then indeed calculated across both sets of results, using a test that can be applied for such a data structure. We have clarified this in the text and the Materials and methods.

2) The authors have observed that deletion of Gm13030 affects gene expression in only one phase of the estrous cycle (when there is high progesterone receptor and estrogen receptor 1 expression, and low G protein-coupled estrogen receptor 1 expression). How does this phase correspond to the period between ovulation and implantation (when the deletion would presumably exert its effect)?

This is the period of proestrus or the starting of estrus, i.e., when the females start to become receptive for implantation. We have clarified this in the text.

3) Subsection “Transcriptome and phenotype changes”: "We identified a possible direct link between the identified phenotype of a shorter gestation length in the knockouts" Gestation length (i.e., between fertilization and birth) was not measured; a difference was observed in the timing between the birth of the first and second litter.

We corrected this into “interval to second birth”.

4) The authors suggest that Gm13030 may be involved in a parent-offspring conflict of interest, i.e., "this is a system where a selfish tendency of embryos in expense of the resources of the mothers could develop." (subsection “Transcriptome and phenotype changes” and related statements in this section and “Male bias versus female bias”). However, it does not appear that there is a conflict of interest here, and it is not clear how such a conflict might work. The pups that are born earlier are more likely to be cannibalized/ neglected by the mother, in addition to being more likely to compete with older siblings. Thus, there is potentially a trade-off between having a second litter quickly (more reproduction sooner) vs waiting (when second litters will be more likely to survive), but this isn't a conflict between parent and offspring. Parent-offspring conflicts generally involve situations where individual offspring would like the parent to invest more in them, whereas the parent would like to preserve resources for other offspring. While traveling down the oviduct, it is not clear how conceptuses could extract more resources from the mother in a way that compromised the mother's future reproduction without impacting their own survival.

We agree that this argument was presented a bit too superficially. It is not the embryo that exerts the conflict, but the selfish interest of Dcpp genes. As described in the text, they are induced by the embryo passage and stronger expression favors embryo implantation. In a wildtype situation an early birth could indeed be an advantage, given the high predation pressure, i.e., the mother does often not live long enough for a second birth. But this is rather speculative, i.e., we have corrected only the wording with respect to pointing out that it is the selfish nature of Dcpp expression that causes the conflict.

5) Provide a clearer justification/statement explaining why you selected Gm13030 for detailed analysis from the broader set of candidates and/or why you desired to study the function of a gene expressed in the female reproductive tract.

We had related to this in the Introduction: “There is generally a tendency to focus on male testis effects for newly evolved genes. However, considering that the females have much more complex reproduction than that in other organisms, including the morphology, physiology and behavior responding to mating choice, pregnancy, and parenting, de novo genes in mammals should also be expected to have a function in female-specific organs and affect female fertility and reproductive behavior as well.” which provides the framework for the choice of the gene. We have now modified this statement slightly and added: “From this list, we have then chosen a gene specifically expressed in the female reproductive system to address the question of the role of de novo gene evolution in this as yet little studied context.”

6) Figure 3 describes the expected number of substitutions expected between species. However, the number of coding and non-coding substitutions should be provided separately to support the statement "there is no bias towards coding mutations".

The expected number of substitutions refers to near neutral substitutions derived from whole genome comparisons, i.e., independent of them being coding or non-coding. We have clarified this. We have now also added a supplementary Table (Figure 3—figure supplement 3) that lists in all relevant pairwise comparisons the coding and non-coding substitutions for the given reading frame, as well as the corresponding P-Values (which are all non-significant).

https://doi.org/10.7554/eLife.44392.024

Article and author information

Author details

  1. Chen Xie

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6183-7301
  2. Cemalettin Bekpen

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  3. Sven Künzel

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Investigation, Methodology
    Competing interests
    No competing interests declared
  4. Maryam Keshavarz

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  5. Rebecca Krebs-Wheaton

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  6. Neva Skrabar

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Present address
    International Centre for Genetic Engineering and Biotechnology, Trieste, Italy
    Contribution
    Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  7. Kristian Karsten Ullrich

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4308-9626
  8. Diethard Tautz

    Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Writing—review and editing
    For correspondence
    tautz@evolbio.mpg.de
    Competing interests
    Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0460-5344

Funding

Max Planck Institute for Evolutionary Biology (Open-access funding)

  • Diethard Tautz

H2020 European Research Council (NewGenes - 322564)

  • Diethard Tautz

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors are grateful to R Neme for generating a first version of the candidate gene list; J Ruiz-Orera for generating the bam files from Ribo-Seq datasets; C Pfeifle, A Vock, A Jonas, H Harre, C Medina for keeping the mice used in this project; E Blohm-Sievers for helping mouse genotyping and Sanger sequencing; C Burghardt, E McConnell, and H Buhtz for helping Illumina sequencing. The Gm13030 knockout line used for this project was obtained from MBP. This work was supported by a European Research Council advanced grant to DT (NewGenes - 322564).

Ethics

Animal experimentation: The behavioral studies were approved by the supervising authority (Ministerium für Energiewende, Landwirtschaftliche Räume und Umwelt, Kiel) under the registration numbers V244-71173/2015, V244-4415/2017 and V244-47238/17. Animals were kept according to FELASA (Federation of European Laboratory Animal Science Association) guidelines, with the permit from the Veterinäramt Kreis Plön: 1401-144/PLÖ-004697. The respective animal welfare officer at the University of Kiel was informed about the sacrifice of the animals for this study.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Developmental Biology, Germany

Reviewing Editor

  1. George H Perry, Pennsylvania State University, United States

Reviewers

  1. George H Perry, Pennsylvania State University, United States
  2. Douglas B Menke, University of Georgia, United States

Version history

  1. Received: December 13, 2018
  2. Accepted: August 21, 2019
  3. Accepted Manuscript published: August 22, 2019 (version 1)
  4. Version of Record published: September 25, 2019 (version 2)

Copyright

© 2019, Xie et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,302
    Page views
  • 383
    Downloads
  • 18
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Chen Xie
  2. Cemalettin Bekpen
  3. Sven Künzel
  4. Maryam Keshavarz
  5. Rebecca Krebs-Wheaton
  6. Neva Skrabar
  7. Kristian Karsten Ullrich
  8. Diethard Tautz
(2019)
A de novo evolved gene in the house mouse regulates female pregnancy cycles
eLife 8:e44392.
https://doi.org/10.7554/eLife.44392

Further reading

    1. Ecology
    2. Evolutionary Biology
    Hannah J Williams, Vivek H Sridhar ... Amanda D Melin
    Review Article

    Groups of animals inhabit vastly different sensory worlds, or umwelten, which shape fundamental aspects of their behaviour. Yet the sensory ecology of species is rarely incorporated into the emerging field of collective behaviour, which studies the movements, population-level behaviours, and emergent properties of animal groups. Here, we review the contributions of sensory ecology and collective behaviour to understanding how animals move and interact within the context of their social and physical environments. Our goal is to advance and bridge these two areas of inquiry and highlight the potential for their creative integration. To achieve this goal, we organise our review around the following themes: (1) identifying the promise of integrating collective behaviour and sensory ecology; (2) defining and exploring the concept of a ‘sensory collective’; (3) considering the potential for sensory collectives to shape the evolution of sensory systems; (4) exploring examples from diverse taxa to illustrate neural circuits involved in sensing and collective behaviour; and (5) suggesting the need for creative conceptual and methodological advances to quantify ‘sensescapes’. In the final section, (6) applications to biological conservation, we argue that these topics are timely, given the ongoing anthropogenic changes to sensory stimuli (e.g. via light, sound, and chemical pollution) which are anticipated to impact animal collectives and group-level behaviour and, in turn, ecosystem composition and function. Our synthesis seeks to provide a forward-looking perspective on how sensory ecologists and collective behaviourists can both learn from and inspire one another to advance our understanding of animal behaviour, ecology, adaptation, and evolution.

    1. Evolutionary Biology
    John S Favate, Kyle S Skalenko ... Premal Shah
    Research Article

    Changes in an organism’s environment, genome, or gene expression patterns can lead to changes in its metabolism. The metabolic phenotype can be under selection and contributes to adaptation. However, the networked and convoluted nature of an organism’s metabolism makes relating mutations, metabolic changes, and effects on fitness challenging. To overcome this challenge, we use the long-term evolution experiment (LTEE) with E. coli as a model to understand how mutations can eventually affect metabolism and perhaps fitness. We used mass spectrometry to broadly survey the metabolomes of the ancestral strains and all 12 evolved lines. We combined this metabolic data with mutation and expression data to suggest how mutations that alter specific reaction pathways, such as the biosynthesis of nicotinamide adenine dinucleotide, might increase fitness in the system. Our work provides a better understanding of how mutations might affect fitness through the metabolic changes in the LTEE and thus provides a major step in developing a complete genotype–phenotype map for this experimental system.