Allele-specific gene expression can underlie altered transcript abundance in zebrafish mutants
Abstract
In model organisms, RNA-sequencing (RNA-seq) is frequently used to assess the effect of genetic mutations on cellular and developmental processes. Typically, animals heterozygous for a mutation are crossed to produce offspring with different genotypes. Resultant embryos are grouped by genotype to compare homozygous mutant embryos to heterozygous and wild-type siblings. Genes that are differentially expressed between the groups are assumed to reveal insights into the pathways affected by the mutation. Here we show that in zebrafish, differentially expressed genes are often over-represented on the same chromosome as the mutation due to different levels of expression of alleles from different genetic backgrounds. Using an incross of haplotype-resolved wild-type fish, we found evidence of widespread allele-specific expression, which appears as differential expression when comparing embryos homozygous for a region of the genome to their siblings. When analysing mutant transcriptomes, this means that the differential expression of genes on the same chromosome as a mutation of interest may not be caused by that mutation. Typically, the genomic location of a differentially expressed gene is not considered when interpreting its importance with respect to the phenotype. This could lead to pathways being erroneously implicated or overlooked due to the noise of spurious differentially expressed genes on the same chromosome as the mutation. These observations have implications for the interpretation of RNA-seq experiments involving outbred animals and non-inbred model organisms.
Editor's evaluation
Zebrafish strains are typically considerably polymorphic. White and colleagues tested the hypothesis that genes in linkage with a mutant allele might show allele-specific expression differences and thus potentially confound the interpretation of mutant effects. Using a variety of mutant and wild-type alleles with sophisticated analysis of RNA-seq data in zebrafish embryos they demonstrate over-representation of expression changes of genes in linkage with the mutant allele on the same chromosome. The authors provide Gene Ontology analyses to demonstrate how the allele-specific expression differences may impact on the interpretation of differential gene expression caused by a mutation. The data are extensive, carefully analysed and of sufficient depth and quality to support their main claim of frequent occurrence of allele-specific gene expression in outcross experiments. The findings of this study will be of interest to geneticists working with zebrafish strains or with strains of other polymorphic species.
https://doi.org/10.7554/eLife.72825.sa0Introduction
Large-scale genetic screens to identify gene function by randomly introducing mutations have been a staple of zebrafish genetics for several decades (Driever et al., 1996; Haffter et al., 1996; Kettleborough et al., 2013). The advent of RNA-sequencing (RNA-seq) has enabled investigators to estimate the location of such mutations in the genome, while also providing information regarding gene expression levels and affected cellular pathways in the mutants. The bioinformatics pipelines which process RNA-seq data to generate gene expression information focus on transcript abundance, differential splicing, and gene set enrichments, and, in general, genomic location is not considered when assessing genes that are differentially expressed (DE) in a mutant context. Here, we report that physical location can impact a gene’s likelihood of being DE in mutant zebrafish.
In the typical protocol for introducing random point mutations, male zebrafish from a laboratory wild-type strain are treated with N-ethyl-N-nitrosourea (ENU) to mutagenise sperm (Kettleborough et al., 2011; Mullins et al., 1994). The mutagenised fish (G0) are mated with wild-type females to produce F1 offspring, each heterozygous at random novel mutation sites. F1 fish are outcrossed with wild types to produce clutches of F2 offspring, which are subsequently incrossed to produce F3 embryos. The F3 clutches contain the novel mutations in Mendelian ratios, and in a forward genetics approach are screened for recessive phenotypes of interest which appear in approximately 25% of embryos (Mullins et al., 1994). These embryos are referred to as ‘mutants’ whereas those without phenotypes are ‘siblings’.
Mutant embryos are homozygous for a novel allele (the ‘causative mutation’) and due to genetic linkage, they are likely to be homozygous for alleles physically nearby on the chromosome. The location encompassing the causative mutation therefore lies in a region which is highly homozygous in mutants, yet heterozygous in siblings. This is referred to as linkage disequilibrium (LD). The region of high LD can be mapped using high-throughput sequencing and bioinformatics pipelines (Mackay and Schulte-Merker, 2014; Minevich et al., 2012; Obholzer et al., 2012) whereas prior efforts involved painstaking genotyping of simple sequence length polymorphisms and genome walks using bacterial or P1 artificial chromosome libraries or subsequently, microarrays (Stickney et al., 2002; Zhang et al., 1998).
All mapping processes rely on identification of polymorphic loci throughout the genome. Laboratory zebrafish strains have a high degree of intra-strain polymorphism (Guryev et al., 2006), but mapping is aided by the introduction of alleles from other strains. Thus, mutagenised males are often paired with females from a different strain. As a result, in a mapping cross, alleles in the mutants and siblings are inherited from two different strains. This remains true throughout the multiple generations that a mutant line is maintained in a laboratory.
In this study, we report that the highly polymorphic nature of zebrafish strains can lead to gene expression differences between mutant and sibling embryos through allele-specific expression (ASE). The effect of ASE is well documented across many species, and can be tissue- and condition-specific (Ayroles et al., 2009; Doss et al., 2005; Fu et al., 2009; Battle et al., 2017; Huang et al., 2015; Kim-Hellmuth et al., 2020; Storey et al., 2005). Here, this phenomenon manifests as a cluster of DE genes located near to the causative mutation site in many different unrelated mutant lines. The differential transcript levels of these local genes are likely due to expression differences between wild-type strains rather than altered transcription due to the mutation. We confirm the high prevalence of ASE in zebrafish in the SAT line which is derived from only two haplotypes. This observation has implications for researchers attempting to use differential expression to explain phenotypes of interest, not only in zebrafish, but also in other outbred model organisms, as these local genes may simply be a red herring.
Results
Differentially expressed genes are often enriched on the mutant chromosome
To map the causal mutations for a number of different mutants from forward genetic screens, we used RNA-seq and LD mapping, based on Cloudmap (Minevich et al., 2012). A representative LD mapping plot (taken from the mutant line u426) is shown in Figure 1. We observed a high degree of LD on chromosome 7 at approximately 22 Mbp, suggesting the phenotype-causing mutation is near this position. DESeq2 reported 209 genes as DE (adjusted p-value < 0.05) between mutants and siblings. Annotating the LD mapping plot with the position of these genes showed a cluster of DE genes near the LD mapping peak on chromosome 7. Indeed, we found 15 DE genes in an arbitrarily sized 20 Mbp window centred on the mapping peak at 22 Mbp, representing 7% of all DE genes. For comparison, a 20 Mbp window randomly sampled (1000 iterations) from the zebrafish genome contains approximately 1.4% of known genes.
We then used a logistic regression model to examine the effect of LD on the probability of an individual gene being DE. A summary of each line and the regression results are presented in Table 1. Of nine mutant lines analysed (Supplementary file 1), seven samples showed a significant, positive effect of LD (Benjamini/Hochberg adjusted p-value < 0.05). To help visualise the effect of LD on DE probability, we calculated an odds ratio for each sample by comparing the DE probability at the site of maximum LD with the probability at a site of median LD. In the most extreme case (the sample nl14), the likelihood of finding a DE gene near to the mutation site was over 100-fold higher than the likelihood of finding one at a random other location in the genome.
In parallel, we were analysing a separate catalogue of 3’ tag sequencing experiments of zebrafish mutant lines (115 experiments), most of which were generated and made available as part of the Zebrafish Mutation Project (Collins et al., 2015; Dooley et al., 2019; Kettleborough et al., 2013). These were analysed for differential expression, producing a large collection of DE gene lists. We noticed that, often, the mutant chromosome had a large proportion of the total number of DE genes in the experiment. For example, comparing mitfaw2/w2 embryos to siblings produces 116 DE genes, 48 of which are present on chromosome 6, which is the chromosome where mitfa is located (Figure 2A).
-
Figure 2—source data 1
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig2-data1-v2.txt
-
Figure 2—source data 2
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig2-data2-v2.xlsx
-
Figure 2—source data 3
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig2-data3-v2.xlsx
To investigate this, we tested for chromosomes that had an enrichment of DE genes under the null hypothesis that they are randomly distributed across the genome. In all, 60 chromosomes from 37 lines had a statistically significant enrichment of the DE genes (binomial test, Bonferroni adjusted p < 0.05). Of these, 44 were on the chromosome carrying the mutation being investigated in the experiment (Supplementary file 2). Of the other 16, 7 had an enrichment on chromosome 9. This was driven by expression of γ-crystallin genes (Supplementary file 3), which are expressed in the lens and present in a cluster on chromosome 9 (Greiling et al., 2009) that we have previously observed as being co-regulated (White et al., 2017). This suggests that the eyes are affected in some of the analysed mutants. Whether there was an enrichment of DE genes on the mutant chromosome did not depend on the total number of DE genes found in the experiment, although experiments with very high numbers of DE genes tended not to show an enrichment (Figure 2B).
In one experiment, we noticed that the differential expression of some genes was linked to one of the wild-type chromosomes in the experiment. This experiment was an intercross of two different sox10 alleles, t3 (Dutton et al., 2001) and baz1 (Carney et al., 2006) sampled at 24 hr post-fertilisation (hpf). Embryos were genotyped for both sox10 alleles, which allowed us to also track the wild-type chromosomes in the cross. We noticed that two of the genotypes had expression levels for some genes on the same chromosome as sox10 that were different from the other two genotypes (Figure 2C). The groups with higher expression shared the wild-type chromosome from the baz1/+ parent (Figure 2C, yellow chromosome) whereas the others shared the chromosome carrying the baz1 allele (Figure 2C, blue chromosome). One explanation for this is that there is higher expression from the si:ch73–308m11.1 allele on the wild-type chromosome (Figure 2C, yellow chromosome), which led us to hypothesise that the enrichment of DE genes on the mutant chromosome is not necessarily dependent on the mutant gene.
Our hypothesis is that ASE, that is, polymorphism-driven variation in expression levels of genes, is common across the genome. This would manifest as differential expression when a genomic locus is driven to homozygosity in some individuals and the expression levels of genes in this locus are compared to those in individuals that are heterozygous, or homozygous for the other allele.
ASE is common in a wild-type cross
To test the hypothesis that the over-representation of DE genes on the mutant chromosome can be driven by ASE independently of the mutated gene, we investigated gene expression in wild-type fish with defined haplotypes to enable easy identification of the different alleles in the cross. We used the SAT line, which was generated from an intercross of one fully sequenced double haploid AB fish and one fully sequenced double haploid Tübingen fish (Howe et al., 2013). This means that for any position in the genome there are up to two possible alleles. The original haplotypes have recombined through the generations that the SAT line has been maintained by incrossing.
We incrossed two SAT fish, fin-clipped them to isolate DNA for exome sequencing, collected 96 morphologically normal embryos at 5 days post-fertilisation (dpf), extracted RNA from the individual embryos, and did RNA-seq on the 96 samples. We used the exome sequence of the SAT parent fish for this cross to call SNPs and identify regions that are either homozygous for the AB haplotype, homozygous for the Tübingen haplotype, or heterozygous. Using the RNA-seq reads and SNPs identified in the parental exome data, we genotyped the embryos at locations that distinguish the AB and Tübingen haplotypes. Aggregating these data in 1 Mbp regions allowed us to determine the haplotypes of each individual embryo. We identified regions of the parental genomes where at least two genotypes, and thus potentially ASE, are possible in the offspring (informative regions) and where we had sufficient read depth to unambiguously identify the haplotypes in the offspring. We grouped the 96 RNA-seq samples according to their haplotype in that region (Figure 3A–B). Across the genome, this resulted in 82 different groupings of embryos according to local genotype. Embryos that had evidence of a recombination event within the informative region were assigned to a genotype group according to the largest contiguous section of the region.
-
Figure 3—source data 1
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig3-data1-v2.xlsx
-
Figure 3—source data 2
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig3-data2-v2.xlsx
-
Figure 3—source data 3
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig3-data3-v2.xlsx
-
Figure 3—source data 4
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig3-data4-v2.xlsx
Differential gene expression analysis on each different embryo grouping revealed DE genes located in or close to the region of the genome that was used to define the embryo groups (Figure 3 and Figure 3—figure supplement 1, Supplementary file 4). The log2(fold changes) of affected genes varied widely but had an absolute mean of 0.5 for the homozygous versus homozygous comparison (Figure 3E). This demonstrates that genes can show ASE in a wild-type context (Figure 3C–E).
Through these analyses, it was also possible to see the consequences of meiotic recombination in individual embryos (Figure 3B–C). For example, two samples (C7 and E5) showed recombination in the 31–37 Mbp region of chromosome 5 (red ovals in Figure 3B). The genotypes near the myhc4 gene were the opposite of that called for the whole region and this is evident in the count plot – C7 has expression comparable with the Tu/Tu haplotype, despite being assigned AB/Tu, and E5 has expression similar to the AB/Tu samples despite being assigned Tu/Tu based on the entire 31–37 Mbp region (Figure 3C).
ASE can alter interpretation of experiments
To assess what impact ASE might have on the interpretation of RNA-seq experiments, we looked at the effect on Gene Ontology (GO) enrichments if DE genes on the same chromosome as the mutation were removed from the DE gene list. To do this, we ran GO enrichment on two different gene lists for each experiment. The first list comprised all the DE genes and the second excluded genes on the same chromosome as the mutation. The gene harbouring the mutation was not removed if it was DE. It is important to note that removing all the genes on the same chromosome potentially removes genes that are misregulated due to the mutation as well as those caused by mutation-independent ASE; for almost all experiments it is not possible to distinguish between the two without further experimental analyses (see next section). The enrichment of GO terms from the two lists was compared using the Jaccard similarity coefficient (Jaccard, 1912).
These analyses showed that ASE could affect enriched GO terms, but that this effect was very variable (Figure 4A). This is not unexpected and will depend on how many of the DE genes are on the same chromosome as the mutation and whether the genes on the same chromosome contribute to any of the enriched GO terms using the full list. Experiments where there wasn't an enrichment of DE genes on the mutant chromosome generally did not show as large an effect, which again makes sense as the DE genes linked to the mutation were a smaller fraction of the gene list.
Overall, experiments with fewer DE genes showed larger effects. However, there were experiments with hundreds to thousands of DE genes where only 50% of GO terms were shared between both sets. For example, in a sox10 t3/baz1 intercross at 36 hpf, 302 genes were DE, 32 of which were on chromosome 3 (the same chromosome as sox10). GO term enrichment using the full list of genes produced 92 enriched GO terms, only 49 of which were also enriched if the genes on chromosome 3 had been removed from the list (Figure 4B–C). In addition, 28 extra GO terms were enriched using the shorter gene list.
Distinguishing response genes from ASE
Having established that ASE is widespread and can significantly alter the transcriptional profiles of mutant zebrafish, we wondered whether there is a way to distinguish potential ‘true’ response genes located on the same chromosome as the mutation, that is, those that change expression due to the altered function of the mutated gene, from those DE genes that arise through ASE. We went back to the expression data from the compound heterozygous sox10t3;sox10baz1 cross and found that the genes that were DE between sox10t3/baz1 individuals and their siblings and located on chromosome 3 fell into different groups with respect to their expression levels across the four different genotypes (Figure 5). Ten genes showed expression patterns as shown in Figure 2C, where increased expression was linked to the presence of a specific allele (Figure 5A and C). Only one gene (ENSDARG00000110416) located on another chromosome, encoding an miRNA, showed a similar pattern (Figure 5—figure supplement 1). By contrast, the other 15 DE genes (excluding sox10 itself) on chromosome 3 showed genotype-dependent transcript levels that were consistent with (though do not prove) a response to loss of sox10 function, that is, the wild types and the compound heterozygous individuals had opposing expression levels whereas both heterozygous genotypes had intermediate levels or the same as wild types (Figure 5B and C). Sox10 is a key regulator of neural crest development, so we looked for published spatial expression data at 24 hpf on ZFIN (zfin.org). Of the genes we speculated to be downstream of sox10, all four with data on ZFIN are expressed in neural (kctd13 and cygb1) and neural crest (syngr1a and vasnb) derivatives, whereas the three ASE candidates with available data are not spatially restricted (traf7, polr3h, and polr2f). Consequently, for genes showing single allele-linked expression patterns, it is likely that ASE is the primary driver of their differential expression and that they are probably red herrings.
-
Figure 5—source data 1
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig5-data1-v2.xlsx
-
Figure 5—source data 2
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig5-data2-v2.xlsx
-
Figure 5—source data 3
- https://cdn.elifesciences.org/articles/72825/elife-72825-fig5-data3-v2.xlsx
Discussion
Transcriptional profiling is a powerful and popular technique to investigate the gene expression changes resulting from organismal insults such as drug treatments, infections, or altered gene function. To gain mechanistic insight into gene regulatory events affected by a particular mutation, it is paramount to distinguish specific responses due to altered function of the mutated genes from other causes that change transcript abundance, such as developmental delay or technical artefacts such as batch effects. In this work, we describe a previously under-appreciated effect of ASE on the transcriptomes of zebrafish mutants. In 51 out of 124 transcriptional profiling experiments comparing zebrafish mutants and siblings at different stages of development, we found a statistically significant enrichment of DE genes on the same chromosome as the mutated gene. In a previous study using RNA-seq to map ENU mutations (Miller et al., 2013), it was noted that very few genes were detected as being DE in regions linked to the mutation. This difference is likely the result of methodological differences between the two studies, the most significant of which is the sample size. Miller et al. used one mutant and one wild-type sample, whereas our study has a median sample size of 10 per condition.
The physical arrangement of genes in an organism’s genome is not random. Co-expression of functionally related genes using shared regulatory elements and/or transcription factors provides an evolutionary pressure to keep these genes clustered in physical proximity within a chromosome (Thévenin et al., 2014). Consequently, it is possible that a mutation affecting one gene could alter expression levels of nearby genes if they form a functionally related cluster. However, the neighbouring DE genes in the tested mutant lines showed no obvious functional connections. Of note, 7/16 chromosomal enrichments that were not linked to the mutated genes affected a chromosome 9 cluster of crystallin genes that are expressed in the eye. Instead we found that the enrichments were driven by ASE, which has been widely demonstrated across different tissues and organisms (Battle et al., 2017; Huang et al., 2015; Kim-Hellmuth et al., 2020) and can play a role in developmental and disease processes (Libioulle et al., 2007; Moffatt et al., 2007; Nicolae et al., 2010).
ASE is often tissue-dependent and the average log2(fold change) between alleles in human ASE is about 0.6 as measured in different tissues (Battle et al., 2017). Here, we have observed ASE at similar magnitudes even when averaged across all tissues through whole embryo RNA-seq. This suggests that the expression differences between alleles would be even larger when looking at individual tissues.
Zebrafish wild-type ‘strains’ are not strains in the same sense as the well-characterised inbred lines in mouse or medaka, for example. Zebrafish are highly polymorphic, such that ASE is evident even in lines that were not outcrossed to another genetic background before the experiment. Consequently, ASE could potentially impact the penetrance or expressivity of phenotypes caused by the same mutation in different genetic backgrounds (Sanders and Whitlock, 2003; Sheehan-Rooney et al., 2013; Young et al., 2019). Indeed, Sheehan-Rooney et al., 2013, showed that the expression of ahsa1a differed by more than threefold in two different backgrounds (WIK and EkkWill) and was responsible for a difference in severity of the phenotype caused by a mutation in gata3. The effect of ASE is expected to be much less pronounced in RNA-seq data from inbred mouse strains in which allelic polymorphism is much less common. Indeed, in our work on RNA-seq data from mouse knockouts (Collins et al., 2019), we did not observe enrichment of DE genes on the mutant chromosome. However, ASE should be considered when working with wild mouse strains, crosses between different genetic backgrounds, or indeed with any organism that isn’t fully inbred.
Given that ASE can lead to differential expression between mutants and siblings, can we correct for it in transcript profiling experiments? The solution is not as simple as removing any DE genes in the same region of the chromosome as the mutation being studied. This is because the DE genes on the same chromosome as the mutation are likely to be a mix of genuine responses to the mutation and linkage of ASE unrelated to the mutation. One way to resolve this would be to use two different mutant alleles of the same gene to generate compound heterozygotes and enable tracking of parental alleles. This would allow genotyping for both alleles and the ability therefore to also identify the different wild-type chromosomes in the cross. As shown in Figure 5, this makes it possible to distinguish between potential genuine responses to the mutation and spurious ones. Another possibility would be to identify an informative SNP in the wild-type alleles of the mutant gene being studied to allow genotyping of both the mutation and the wild-type alleles. There are also complementary approaches to investigate gene function that avoid the confounding effects of ASE. Transgenic overexpression of the gene of interest could validate true target gene responses and should leave ASE genes unaffected. Alternatively, analysing morpholino- or CRISPR/Cas9-injected G0 embryos (Eisen and Smith, 2008; Kroll et al., 2021; Wu et al., 2018) should avoid the ASE effect as the injected embryos will have a mix of background alleles. Note that although using G0 CRISPR/Cas9 mutants avoids the effect of ASE, F2 fish carrying stable CRISPR/Cas9-induced mutations will again show the effects of ASE when comparing homozygous mutants to siblings.
All these methods involve extra effort and expense, as well as having their own specific caveats and drawbacks (such as off-targets effects and mosaicism), and so would need careful consideration with respect to the need to validate specific gene expression changes for the conclusions of the study. As a first step, we recommend that, whatever analysis pipeline is used, the output of DE genes contains the locations of the genes, making it possible to easily see which genes are on the same chromosome as the mutation and therefore candidates for ASE.
Materials and methods
RNA-seq and LD mapping
Request a detailed protocolEight independent mutant fish lines under study by groups at UCL (zebrafishucl.org) were analysed by RNA-seq in order to simultaneously gain gene expression data and to measure alleles across the genome in order to help map the causative mutation. Seven of these lines were the product of ENU random mutagenesis, one was created by a random viral insertion, and one by a targeted CRISPR insertion. An additional sample was taken from the literature (Armant et al., 2016) at random by searching Pubmed for papers where RNA-seq data had been uploaded to the European Nucleotide Archive.
In preparation for RNA-seq, embryos or larvae were sorted into two groups based on their phenotypes (mutant and sibling), each comprising three pools of at least 15 individuals. RNA was extracted from these six samples and sequenced using the IIlumina NextSeq platform (2 × 75 bp reads, approximately 75 million reads per sample). Reads were aligned to the GRCz10 genome using HISAT2 (Kim et al., 2019). To measure differential expression, transcripts were counted from the aligned RNA-seq reads using featureCounts (Liao et al., 2014) and compared using DESeq2 (Love et al., 2014). A gene was considered DE if the adjusted p-value from DESeq2 was below 0.05.
To perform LD mapping, the three samples in each group were analysed as a single pooled sample for single nucleotide polymorphisms (SNPs) by BCFtools (Li, 2011), calculating the allele ratio at each SNP location. SNPs which appeared in only one of the two genotype pools were filtered out, as were those with a quality score below 100. The absolute difference between a given SNP’s mutant and sibling allele ratio indicates the degree of segregation of that allele (Mackay and Schulte-Merker, 2014). These values can be smoothed using LOESS, producing maps of the genome showing regions of high LD (Minevich et al., 2012). The physical location of each gene’s start codon in the GRCz10 genome assembly was downloaded from Ensembl BioMart and appended to the DESeq2 table. The LD value was estimated at each gene’s position based on interpolation of the LOESS-smoothed SNP data. Finally, a logistic regression model was used to test the effect of LD on a gene’s probability of being DE. This was performed using the Logit function of the Python module statsmodels.
DeTCT sequencing
Request a detailed protocolDeTCT libraries were generated, sequenced, and analysed as described previously (Collins et al., 2015). The resulting genomic regions and putative 3′ ends were filtered using DeTCT’s filter_output (v0.2.0)script (https://github.com/iansealy/DETCT/blob/master/script/filter_output.pl, Sealy, 2020) in its --strict mode. --strict mode removes 3’ ends in coding sequence, transposons, if nearby sequence is enriched for As or if not near a primary hexamer. Regions not associated with 3′ ends are also removed. Differential expression analysis was done using DeTCT’s run_pipeline (v0.2.0)script (https://github.com/iansealy/DETCT/blob/master/script/run_pipeline.pl) using DESeq2 (Love et al., 2014) with an adjusted p-value cut-off of 0.05. Sequence data were deposited in the European Nucleotide Archive (ENA) under accessions ERP001656, ERP004581, ERP006132, ERP003802, ERP004579, ERP005517, ERP008771, ERP005564, ERP009868, ERP006133, ERP009078, and ERP013835. Details on the experiments are in Supplementary file 5.
DNA sequencing
Request a detailed protocolDouble haploid AB and Tübingen fish were produced and sequenced as described in Howe et al., 2013. Whole genome sequencing data (SRA Study: ERP000232) was downloaded from the European Nucleotide Archive. Exome sequencing on parents for the wild-type SAT cross was done as described (Kettleborough et al., 2013). Reads were mapped to the GRCz11 genome assembly using BWA (Li and Durbin, 2010, v0.5.10) and duplicates were marked with biobambam (Tischler and Leonard, 2014). SNPs were called using a modified version of the 1000 Genomes Project variant calling pipeline (Abecasis et al., 2010). Initial calls were done by SAMtools mpileup (Li, 2011), QCALL (Le and Durbin, 2011), and the GATK Unified Genotyper (DePristo et al., 2011). SNPs not called by all three callers were removed from the analysis, along with any SNP that did not pass a caller’s standard filters. Additionally, SNPs were removed where the genotype quality was lower than 100 for GATK and lower than 50 for QCALL and SAMtools mpileup and where the mean read depth per sample was less than 10. These SNP calls were then filtered for positions that are informative of the parental background in the SAT cross, that is, ones that are homozygous reference in one double haploid fish and homozygous alternate in the other.
RNA-seq of wild-type SAT embryos
Request a detailed protocolRNA was extracted from 5 dpf larvae as described previously (Wali et al., 2022). Briefly, RNA was extracted from individual embryos by mechanical lysis in RLT buffer (Qiagen) containing 1 μl of 14.3 M β-mercaptoethanol (Sigma). The lysate was combined with 1.8 volumes of Agencourt RNAClean XP (Beckman Coulter) beads and allowed to bind for 10 min. The plate was applied to a plate magnet (Invitrogen) until the solution cleared and the supernatant was removed without disturbing the beads. This was followed by washing the beads three times with 70% ethanol. After the last wash, the pellet was allowed to air-dry for 10 min and then resuspended in 50 μl of RNAse-free water. RNA was eluted from the beads by applying the plate to the magnetic rack. Samples were DNase-I treated to remove genomic DNA. RNA was quantified using Quant-IT RNA assay (Invitrogen). Stranded RNA-seq libraries were constructed using the Illumina TruSeq Stranded RNA protocol after treatment with Ribozero. Libraries were pooled and sequenced on six Illumina HiSeq 2500 lanes in 75 bp paired-end mode. Sequence data were deposited in ENA under accession ERP011556. Reads for each sample were aggregated across lanes (median reads per embryo = 18.1 M) and mapped to the GRCz11 zebrafish genome assembly using TopHat (Kim et al., 2013, v2.0.13, options: --library-type fr-firststrand). The data were assessed for technical quality (GC-content, insert size, proper pairs, etc.) using QoRTs (Hartley and Mullikin, 2015). Counts for genes were produced using htseq-count (Anders et al., 2015, v0.6.0 options: --stranded = reverse) with the Ensembl v97 annotation as a reference. Differential expression analysis was done in R (R Development Core Team, 2019) with DESeq2 (Love et al., 2014) using a cut-off for adjusted p-values of 0.05.
The samples were genotyped at the positions that were determined to be informative using the double haploid sequence using GATK’s SplitNCigarReads tool followed by the HaplotypeCaller (Poplin et al., 2017) on the RNA-seq data. The genotype calls were converted to their strain of origin (either DHAB or DHTu) and haplotypes were called by taking the most frequent genotype call in 1 Mbp windows. Any haplotypes that were not consistent with the parental haplotypes were removed.
Data availability
Sequencing data have been deposited in ENA under the accessions shown in the Materials and Methods. Differentially expressed gene lists for all the experiments are available at https://doi.org/10.6084/m9.figshare.15082239.
-
ENAID ERP011556. Transcriptome_profiling_of_zebrafish_embryos_from_the_SAT__Sanger_AB_T_bingen__strain.
-
ENAID ERP003802. Transcriptome_profiling_of_zebrafish_neural_crest_mutants.
-
ENAID ERP004579. Transcriptome_profiling_of_zebrafish_neural_crest_mutants.
-
ENAID ERP005517. Transcriptome_profiling_of_zebrafish_neural_crest_mutants.
-
ENAID ERP008771. Transcriptome_profiling_of_zebrafish_neural_crest_mutants.
-
ENAID ERP001656. Transcriptome_profiling_of_mutants_from_the_zebrafish_genome_project.
-
ENAID ERP004581. Transcriptome_profiling_of_mutants_from_the_zebrafish_genome_project.
-
ENAID ERP006132. Transcriptome_profiling_of_embryos_collected_for_one_or_more_alleles_identified_by_the_zebrafish_mut.
-
ENAID ERP005564. Transcriptome_profiling_of_zebrafish_muscle_mutants.
-
ENAID ERP009868. Transcriptome_profiling_of_zebrafish_muscle_mutants.
-
ENAID ERP006133. Transcriptome_profiling_of_embryos_genotyped_for_one_or_more_alleles_in_genes_involved_in_DNA_methyl.
-
ENAID ERP009078. Transcriptome_profiling_of_zebrafish_metabolic_mutants.
-
ENAID ERP013835. Transcriptome_profiling_of_zebrafish_hesx1_knockout_embryos.
-
ENAID ERP000232. The Sequence of the Two Most Common Zebrafish Laboratory Strains: AB and Tuebingen.
References
-
HTSeq--A Python framework to work with high-throughput sequencing dataBioinformatics (Oxford, England) 31:166–169.https://doi.org/10.1093/bioinformatics/btu638
-
Systems genetics of complex traits in Drosophila melanogasterNature Genetics 41:299–307.https://doi.org/10.1038/ng.332
-
A direct role for Sox10 in specification of neural crest-derived sensory neuronsDevelopment (Cambridge, England) 133:4619–4630.https://doi.org/10.1242/dev.02668
-
Cis-acting expression quantitative trait loci in miceGenome Research 15:681–691.https://doi.org/10.1101/gr.3216905
-
A genetic screen for mutations affecting embryogenesis in zebrafishDevelopment (Cambridge, England) 123:37–46.https://doi.org/10.1242/dev.123.1.37
-
Zebrafish colourless encodes sox10 and specifies non-ectomesenchymal neural crest fatesDevelopment (Cambridge, England) 128:4113–4125.https://doi.org/10.1242/dev.128.21.4113
-
Controlling morpholino experiments: don’t stop making antisenseDevelopment (Cambridge, England) 135:1735–1743.https://doi.org/10.1242/dev.001115
-
System-wide molecular evidence for phenotypic buffering in ArabidopsisNature Genetics 41:166–167.https://doi.org/10.1038/ng.308
-
The zebrafish lens proteome during development and agingMolecular Vision 15:2313–2325.
-
High-throughput target-selected gene inactivation in zebrafishMethods in Cell Biology 104:121–127.https://doi.org/10.1016/B978-0-12-374814-0.00006-9
-
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotypeNature Biotechnology 37:907–915.https://doi.org/10.1038/s41587-019-0201-4
-
Cell type-specific genetic regulation of gene expression across human tissuesScience (New York, N.Y.) 369:eaaz8528.https://doi.org/10.1126/science.aaz8528
-
Fast and accurate long-read alignment with Burrows-Wheeler transformBioinformatics (Oxford, England) 26:589–595.https://doi.org/10.1093/bioinformatics/btp698
-
featureCounts: An efficient general purpose program for assigning sequence reads to genomic featuresBioinformatics (Oxford, England) 30:923–930.https://doi.org/10.1093/bioinformatics/btt656
-
A statistical approach to mutation detection in zebrafish with next-generation sequencingJournal of Applied Ichthyology 30:696–700.https://doi.org/10.1111/jai.12528
-
Yap and Taz regulate retinal pigment epithelial cell fateDevelopment (Cambridge, England) 142:3021–3032.https://doi.org/10.1242/dev.119008
-
Rapid positional cloning of zebrafish mutations by linkage and homozygosity mapping using whole-genome sequencingDevelopment (Cambridge, England) 139:4280–4290.https://doi.org/10.1242/dev.083931
-
SoftwareR: A language and environment for statistical computingR Foundation for Statistical Computing, Vienna, Austria.
-
Phenotype of the zebrafish masterblind (mbl) mutant is dependent on genetic backgroundDevelopmental Dynamics 227:291–300.https://doi.org/10.1002/dvdy.10308
-
Rapid mapping of zebrafish mutations with SNPs and oligonucleotide microarraysGenome Research 12:1929–1934.https://doi.org/10.1101/gr.777302
-
biobambam: tools for read pair collation based algorithms on BAM filesSource Code for Biology and Medicine 9:13.https://doi.org/10.1186/1751-0473-9-13
-
A Rapid Method for Directed Gene Knockout for Screening in G0 ZebrafishDevelopmental Cell 46:112–125.https://doi.org/10.1016/j.devcel.2018.06.003
Article and author information
Author details
Funding
Medical Research Council (MR/L003775/1)
- Stephen W Wilson
Medical Research Council (MR/T020164/1)
- Stephen W Wilson
Wellcome Trust (095722/Z/11/Z)
- Stephen W Wilson
Wellcome Trust (206194)
- Richard J White
- Elisabeth M Busch-Nentwich
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank G Gestri, L Tucker, R Martinho, A Faro, G Powell, M Khosravi, I Barlow, J Rihel and T Hawkins for providing RNA-seq datasets from mutant lines, Alex Nechiporuk for providing mutant lines, technical staff in our fish facilities for animal care, and Ian Sealy and Munise Merteroglu for critical reading of the manuscript. This study was supported by MRC Programme Grant support to Gaia Gestri and SW (MR/L003775/1 and MR/T020164/1), and a Wellcome Trust Investigator Award to SW (095722/Z/11/Z). EBN and RJW were supported by core funding to the Sanger Institute by the Wellcome Trust (206194).
Copyright
© 2022, White et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 2,980
- views
-
- 220
- downloads
-
- 10
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Chromosomes and Gene Expression
To function effectively as an integrated system, the transcriptional and post-transcriptional machineries must communicate through mechanisms that are still poorly understood. Here, we focus on the zinc-finger Sfp1, known to regulate transcription of proliferation-related genes. We show that Sfp1 can regulate transcription either by binding to promoters, like most known transcription activators, or by binding to the transcribed regions (gene bodies), probably via RNA polymerase II (Pol II). We further studied the first mode of Sfp1 activity and found that, following promoter binding, Sfp1 binds to gene bodies and affects Pol II configuration, manifested by dissociation or conformational change of its Rpb4 subunit and increased backtracking. Surprisingly, Sfp1 binds to a subset of mRNAs co-transcriptionally and stabilizes them. The interaction between Sfp1 and its client mRNAs is controlled by their respective promoters and coincides with Sfp1’s dissociation from chromatin. Intriguingly, Sfp1 dissociation from the chromatin correlates with the extent of the backtracked Pol II. We propose that, following promoter recruitment, Sfp1 accompanies Pol II and regulates backtracking. The backtracked Pol II is more compatible with Sfp1’s relocation to the nascent transcripts, whereupon Sfp1 accompanies these mRNAs to the cytoplasm and regulates their stability. Thus, Sfp1’s co-transcriptional binding imprints the mRNA fate, serving as a paradigm for the cross-talk between the synthesis and decay of specific mRNAs, and a paradigm for the dual-role of some zinc-finger proteins. The interplay between Sfp1’s two modes of transcription regulation remains to be examined.
-
- Chromosomes and Gene Expression
Previously, we showed that the germ cell-specific nuclear protein RBMXL2 represses cryptic splicing patterns during meiosis and is required for male fertility (Ehrmann et al., 2019). Here, we show that in somatic cells the similar yet ubiquitously expressed RBMX protein has similar functions. RBMX regulates a distinct class of exons that exceed the median human exon size. RBMX protein-RNA interactions are enriched within ultra-long exons, particularly within genes involved in genome stability, and repress the selection of cryptic splice sites that would compromise gene function. The RBMX gene is silenced during male meiosis due to sex chromosome inactivation. To test whether RBMXL2 might replace the function of RBMX during meiosis we induced expression of RBMXL2 and the more distantly related RBMY protein in somatic cells, finding each could rescue aberrant patterns of RNA processing caused by RBMX depletion. The C-terminal disordered domain of RBMXL2 is sufficient to rescue proper splicing control after RBMX depletion. Our data indicate that RBMX and RBMXL2 have parallel roles in somatic tissues and the germline that must have been conserved for at least 200 million years of mammalian evolution. We propose RBMX family proteins are particularly important for the splicing inclusion of some ultra-long exons with increased intrinsic susceptibility to cryptic splice site selection.