Refining the resolution of the yeast genotype-phenotype map using single-cell RNA-sequencing

Arnaud N’Guessan; Wen Yuan Tong; Hamed Heydari; Alex N Nguyen Ba

doi:10.7554/eLife.93906.1

eLife assessment

This useful study describes expression profiling by scRNA-seq of thousands of cells of recombinant yeast genotypes from a system that models natural genetic variation. The rigorous new method presented here holds promise for improving the efficiency of genotype-to-phenotype mapping in yeast, providing convincing evidence for its efficacy. This manuscript focuses on overcoming technical challenges with this approach. It currently offers somewhat limited new biological insights to place the work within the broader context of the field. Doing so would broaden the interest of the work for all geneticists and evolutionary biologists.

https://doi.org/10.7554/eLife.93906.1.sa2

Significance of findings

useful: Findings that have focused importance and scope

landmark
fundamental
important
valuable
useful

Strength of evidence

convincing: Appropriate and validated methodology in line with current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Genotype-phenotype mapping (GPM) or the association of trait variation to genetic variation has been a long-lasting problem in biology. The existing approaches to this problem allowed researchers to partially understand within- and between-species variation as well as the emergence or evolution of phenotypes. However, traditional GPM methods typically ignore the transcriptome or have low statistical power due to challenges related to dataset scale. Thus, it is not clear to what extent selection modulates transcriptomes and whether cis- or trans-regulatory elements are more important. To overcome these challenges, we leveraged the cost efficiency and scalability of single-cell RNA sequencing (scRNA-seq) by collecting data from 18,233 yeast cells from 4,489 segregants of a cross between the laboratory strain BY4741 and the vineyard strain RM11-1a. More precisely, we performed eQTL mapping with the scRNA-seq data to identify single-cell eQTL (sc-eQTL) and transcriptome variation patterns associated to fitness variation inferred from the segregants’ bulk fitness assay. Due to the larger scale of our dataset, we were able to recapitulate results from decades of work in GPM from yeast bulk assays while revealing new associations between phenotypic and transcriptomic variations. The multidimensionality of this dataset also allowed us to measure phenotype and expression heritability and partition the variance of cell fitness into genotype and expression components to highlight selective pressure at both levels. Altogether these results suggest that integrating large-scale scRNA-seq data into GPM improves our understanding of trait variation in the context of transcriptomic regulation.

Introduction

The process by which DNA encodes proteins via transcription and translation has been studied for decades to make sense of organisms’ phenotypes. However, being able to explain organisms’ phenotypes from their genetic material, i.e. genotype-phenotype mapping (GPM), has been a long-lasting problem with important applications (1,2). Indeed, making sense of genetic variation at the phenotypic level enables the understanding of trait variation between and within species as well as the emergence and evolution of phenotypes (3). For instance, reverse genetics approaches, e.g. gene knockout or transgenic technologies, and forward genetics approaches like GWAS and QTL mapping helped in determining the function of multiple genes and the effects of mutations on growth in different environments (4). However, reverse genetics approaches typically fail to account for natural variation and forward genetics approaches like QTL mapping typically focus on genetic and phenotypic variation so they cannot highlight selection on the transcriptome.

An essential characteristic of this problem is the multi-layered organization of the GPM. Indeed, GPM is not strictly restricted to the direct association between genotypes and phenotypes. This association is better resolved and complemented by understanding the intermediary transcriptome layer, e.g. cell mechanisms at the transcriptomic level are involved in diseases and pathogenicity (2,5–8). However, it is not clear to what extent transcriptomic changes relate to phenotypic changes or selection. Pioneering work from Mary-Claire King and Allan Charles Wilson set the tone for investigating this question by proposing that variations in morphological and behavioral traits arise more often through gene expression regulation than evolution at the protein-coding level (9). François Jacob then postulated an essay that stemmed from this theory in which he highlights how evolution acts as a tinkerer that works from already available material, i.e. through regulation of gene expression, to create new adaptations (10). This constituted the core of the evolutionary developmental biology which matured into the still-debated claim that new adaptations mainly emerge through cis-regulation of gene expression, i.e. through noncoding DNA regulating a neighbor gene contrarily to trans-regulators acting on distant genes (11–14). This debate has been reinforced by the technical difficulties and complexity of assessing the evolution and outcome of mutations in non-coding regions (11,12). Advances in sequencing technologies have clarified some of these hypotheses, particularly in the context of transcriptome analyses of the model organism Saccharomyces cerevisiae. For instance, Brem et al (2002) used microarray technology to relate the gene expression profiles of 40 yeast segregants from a lab (BY) and natural vineyard strain (RM) to their genetic markers (15). They found that cis-acting modulation is the main mechanism for regulating gene expression. Nearly two decades later, by greatly increasing statistical power, Albert and collaborators (2018) found that most of the expression variation arise through trans-regulation using non-multiplexed RNA-seq to analyze 5,720 genes in 1,012 yeast segregants generated by a crossing between RM and BY (16). The analysis method they used, i.e. expression quantitative trait loci (eQTL) mapping, consists in correlating allele frequencies to gene expression levels to find the loci modulating expression.

Although eQTL mapping is a traditional GPM analysis that accounts for the transcriptomic layer, it is typically realized through non-multiplexed RNA-seq which tends to have low statistical power due to challenges with experimental scale and confounding factors (17,18). Thus, eQTL mapping traditionally cannot identify significant low-effect regulatory mutations that are important for understanding the genetic bases of complex traits and diseases (19,20). Furthermore, most eQTL studies only assess the average transcriptomic profile of bulk populations without being able to capture the profile of rare cell lineages within a population. This is a critical limitation in heterogenous populations such as cancer or microbial populations where rare lineages can drive relapse or drug resistance (21).

Here, we sought to circumvent the challenges of non-multiplexed bulk RNA-seq imposed by the scale and population heterogeneity by performing eQTL mapping through single-cell RNA sequencing (scRNA-seq) of a pool of ∼4500 well-characterized F1 segregants of a yeast cross (16,22,23). In the same way that combinatorial indexing/barcoding and multiplexing enable the collection of large-scale fitness and genotype data (24), we hypothesized that scRNA-seq can help us collect both genotype and expression data on a large pool of segregants. We employ several strategies to overcome previous obstacles of eQTL mapping studies: i) we pool cells from thousands of segregants during the growth step and perform a single scRNA-seq run on the culture to account for environmental effects, and ii) from the exome sequencing data of single-cells we take advantage of the reference panel to validate that we accurately infer the genotype of each cell from extremely low number of reads mapping to polymorphic sites per cell (effectively ∼0.2x coverage).

Using this approach, we integrated the resulting transcriptomic data from growth in rich media with a pre-existing yeast GPM. We estimated the heritability of the transcriptome and the extent at which transcriptome is associated with fitness. We show that this increased scale from scRNA-seq enables eQTL mapping directly without the use of a reference genotype panel, and relate identified single-cell eQTL (sc-eQTL) to previously identified QTL. We also exploit the identified sc-eQTL to analyze the patterns of cis- and trans- regulation in the GPM.

Our single-cell RNA-seq approach is consistent with yeast GPM results from non-multiplexed assays

We initially aimed to show that performing scRNA-seq at a large scale can generate data that are consistent with non-multiplexed DNA and RNA sequencing. To do so, we analyzed a dataset of thousands of yeast lineages generated by Nguyen Ba and collaborators (2022) (24). To understand the yeast GPM, they collected fitness and genotype data from ∼100,000 segregants of an F1 cross between a laboratory strain of yeast (BY) and a natural vineyard strain (RM) (Figure 1A).

Yeast segregants datasets.
A) Reference panel from the barcoded bulk sequencing. The 99,995 yeast segregants in the reference panel come from a F1 cross between a laboratory strain of yeast (BY) and a natural vineyard strain (RM) (24). Thus, they only have 2 possible alleles at each of the 41,594 polymorphic sites. The lineages barcodes enabled fitness estimation from competition assays in 18 environments recapitulating the adaptation to temperature gradients, the ability to process different sources of carbon and the resistance to antifungal compounds. B) Pooled scRNA-seq dataset from a single batch. We performed scRNA-seq of the first batch of segregants (n=4,489) to obtain genotypes that are similar to the reference panel and single cell’s expression profiles. Non-covered sites, sequencing errors and the presence of reads in the wrong library (index swapping) are corrected for using the HMM described in Figure S1.

Using this approach named barcoded bulk QTL mapping or BB-QTL mapping, they revealed the complex polygenic and pleiotropic nature of phenotypes as well as an unprecedented number of pairwise epistatic interactions. To integrate transcriptomic data to that GPM, we performed scRNA-seq using the 10X Genomics Chromium microfluidics platform and obtained both genotype and expression profiles from 18,233 cells of the first batch of segregants (Figure 1B). This short-read scRNA-seq method comes with challenges like low-coverage sites due to technical sequencing biases and low sequencing depth in some cells (25,26). To overcome these challenges, the unique molecular identifiers (UMIs) of the 10X Genomics platform provide a control for technical biases by quantifying gene expression from unique transcribed molecule counts instead of reads counts (25). In addition, Hidden Markov Models (HMMs) can infer accurate genotype data even at sequencing depths as low as 0.1x (24). Nguyen Ba and collaborators (2022) designed an HMM to infer the segregants genotypes from the observed reads at low depth of DNA sequencing by accounting for sequencing error rate, recombination rate and index swapping rate (24). As there are only two ancestral lineages, there are only two possible alleles for the strains at each of the 41,594 polymorphic sites. Thus, the genotype of the segregants can be represented by the frequency of only one of the parental alleles, which is RM in the dataset. Applying this model to low-coverage segregants yielded genotypes that are significantly similar to high-coverage replicates (24). We sought to use a similar model to infer genotypes from scRNA-seq data, but we anticipated that some of these parameters may differ due to increased error rate of the reverse-transcriptase, increased index swapping due to pooled-reaction, etc (Figure S1). In Nguyen Ba et al, those rates were heuristically determined, but here we estimated these from the read mapping data and found that re-estimated parameters from data increase the proportion of recovered strains in the single cell data from 58.6% to 72.0%.

After adapting the HMM to the scRNA-seq data, we sought to validate that the resulting cell genotypes relate well to their corresponding strain in the reference panel obtained by non-multiplexed DNA sequencing strategies. Ideally, each single-cell barcode (from 10x Genomics Chromium) should be associated with a single cell and a cell should have a clear match with a unique strain in the reference panel. However, several factors can obscure these associations, e.g. a single-cell droplet containing cells from 2 different strains, a low-coverage cell, uncertainty in the allele of the reference genotype, etc. Thus, we designed an approach to clearly assign cells to the correct reference panel strain (see Methods). This approach relies on two metrics of similarity between the cells and the strains’ genotypes, i.e. the expected distance between them, which should be minimized for the best match, and the relatedness (R²). The statistical significance of the relatedness between single cells and reference lineages was determined by a permutation test (Figure S2). From the read mapping alone, we obtained a mean R² of 0.59 (σ = 0.19 and median = 0.64), which was significantly improved after applying our HMM to correct for mis-identified alleles and imputing data in low-coverage sites using recombination probability. Indeed, the single-cell HMM genotypes yield a mean R² of 0.73 (σ = 0.18 and median = 0.81; Figure2A). We found that the distribution of relatedness after HMM was still left-skewed, with many cells statistically significantly assigned to a reference genotype despite having what appeared to be low relatedness. Upon investigation, it was found that these could be explained by genotyping uncertainty either in the single-cell and/or in the reference panel genotype (s) (Table S1).

To further establish that the genotyping obtained from scRNA-seq data was comparable to previous non-multiplexed genotyping of the reference genotype panel, we estimated the contribution of genetic variation to the phenotypic variation, i.e. fitness heritability. Nguyen Ba and collaborators (2022) estimated the narrow- and broad-sense heritabilities of complex phenotypes associated with temperature gradient, carbon source and chemical resistance for which RM and BY segregants exhibit a significant level of diversity (24). We used our lineage assignment to that panel to obtain fitness but used our single-cell genotyping to perform this association. Encouragingly, most GCTA-REML estimates of narrow-sense heritability are within the confidence intervals of Nguyen Ba and collaborators (2022) estimates (Figure 2B).

Single-cell RNA-seq data recapitulate bulk DNA and RNA assays results.
A) Effect of the HMM on the relatedness between single cell genotypes and their closest reference lineage. The single-cell original genotype represents the genotype of the cells before the correction with the HMM. The relatedness to the closest lineage in batch1 has been measured with the adjusted R². To control for genotype uncertainty, only the 13,069 barcodes with a significant lineage assignment (lineage-barcode genotype correlation FDR<0.05) and a reference lineage with a lower uncertainty than the single cell HMM are selected, which represents 72.2% of the barcodes. We then rounded the genotypes to remove the uncertainty during the comparison. Wilcoxon signed test p-value is indicated above the violin plots. B) Narrow-sense heritability measured with non-multiplexed DNA sequencing and scRNA-seq. The grey bars represent the scRNA-seq estimates of narrow-sense heritability while the red dots represent the estimates from bulk DNA sequencing. The interval of confidence of the bulk DNA sequencing is indicated by the red line around the red dot and was obtained from genotype and phenotype measurement error in the BB-QTL paper (24). The 23C-37C represents the temperature for the competition assay in YPD media while the other phenotypes represent growth on YNB, molasses (mol), mannose (Mann) or raffinose (raff) and chemical resistance to copper sulfate (Cu), ethanol (eth), guanidinium chloride (gu), lithium acetate (Li), Sodium dodecyl sulfate (SDS) and suloctidil (suloc) (24).

Although the variance partitioning is consistent with previous studies, it only provides a broad view of the genotype-phenotype map as it does not allow to identify the loci that significantly explain phenotype variation. If the genotypes obtained by scRNA-seq were of high-quality, then we would expect that a QTL mapping model from scRNA-seq would yield a similar model than non-multiplexed DNA sequencing data. To do so, we used a cross-validated stepwise forward linear regression on the strain fitness and consensus genotypes data from single-cells that shared the same lineage assignment (Methods). Performing the QTL mapping on the batch 1 scRNA-seq dataset enabled the identification of 29 QTL compared to 31 QTL identified with the bulk barcoded approach (Tables S2 and S3) (24). These QTL were largely similar as shown by the non-significant difference between the effect sizes (Wilcoxon signed rank test p = 0.29) and by a model similarity metric (24) that considers the recombination distance between matched QTL, the similarity of the effect sizes and the allele frequencies (Methods). Using this approach, we estimated that the similarity score between the batch 1 single cells QTL and the batch 1 BB-QTL is 86.2% while each model respectively had a similarity score of 78.7% and 78.2% with the full BB-QTL mapping performed on 99,950 segregants (24) (Figure S4). The QTL identified from the scRNA-seq dataset also recapitulated several important biological features of the reference panel such as an enrichment of non-synonymous and disordered region QTL (24) (Figure S5).

Finally, the variance partitioning model can also be modified to include gene expression as the response variable and cell genotypes as the only random effect (Methods). This enables the quantification of expression heritability, i.e. the variance of expression explained by genotype. Using this approach, we estimated that genotype explains 72.3% of expression variance, which is consistent with results from previous non-multiplexed eQTL mapping studies. Indeed, Albert and collaborators (2018) estimated that genotype explains 70% of expression variance using a dataset of 5720 genes in 1012 yeast segregants generated by the same parental strains (RM and BY).

Integrating scRNA-seq data to an existing GPM highlights selection on the transcriptome

Having shown that scRNA-seq is consistent with non-multiplexed assays while being more scalable, we next sought to highlight new associations within the BY/RM GPM. Selection is often highlighted at the genotype level through convergent evolution, increase in allele frequency within a population or population genetics metric (26–28). However, the central dogma of molecular biology and evolution tinkering entail that phenotype variation should be linked to transcriptomic variation. As our dataset included all these variables, we sought to provide a variance partitioning framework to evaluate the association between the transcriptome and trait variation (Methods) with the 30C phenotype as an example (Figure 3).

Variance partitioning of the 30C phenotype from scRNA-seq data.
The percentages represent the proportion of fitness variance (whole rectangle area) explained by the components. The ellipse area represents the phenotype variance explained by genotype variation and the circle area represents the phenotype variance explained by expression variation. The black area of the rectangle represents the residual of the model while the other colored areas represent the shared and exclusive components explaining fitness variation.

The components of this variance partitioning all relate to at least one biological phenomenon. Indeed, the portion of trait variation explained exclusively by the genotype variation (red in Figure 3) represents the effect of mutations on fitness via several biological phenomena such as protein stability, enzymatic function etc, independent of expression level. For the 30C phenotype, this component explains 31.2% of the fitness variation in the BY/RM background which is similar to the 29.7% explained by the shared component between phenotype, genotype and expression variations (purple in Figure 3). The latter represents the association between selection (fitness) and the transcriptome either through loci influencing fitness via expression directly or through loci affecting expression via an effect on cell fitness (indirectly) (29,30). Its considerable association to fitness variation thus supports the evolution tinkering model. As for the phenotype variation explained exclusively by gene expression (blue in Figure 3), it could represent epigenetics and stochastic gene expression, which weakly explain variations in the 30C phenotype.

Although this model accurately estimates the narrow-sense heritability of 30C, the residuals still represent 37.7% of fitness variation. This could be explained by unmeasured factors like high-order epistasis, mitochondrial mutations or protein properties but the broad-sense heritability of this phenotype is similar to the narrow-sense heritability, suggesting that the residuals are mostly not explained by genotype and expression (24). Nguyen Ba et al. (2022) also estimated that epistasis only explained around 5% of fitness (24). These results suggest that a single run of scRNA-seq on a single batch of yeast segregants converge with bulk DNA sequencing results while revealing previously hidden components of the GPM.

Revealing hidden components of the yeast GPM with scRNA-seq

Our integrative scRNA-seq approach is not limited to enabling the quantification of the association between transcriptomic changes and trait variation. Indeed, the same approach we used to identify QTL can be used to detect loci regulating gene expression which can reveal the cell mechanisms underlying trait variation through transcriptomic changes. We thus modified the QTL mapping framework such that the response variable is the level of expression of a single gene in the single cells (Methods). This approach is a cost-efficient way to perform eQTL mapping from the expression profile and genotype of cells from thousands of lineages in a multiplexed way (sc-eQTL mapping).

Consistent with yeast non-multiplexed eQTL results, the genes with the highest expression heritability are enriched in functions related to carbohydrate catabolic process (GO:0016052) and cellular biosynthetic process (translation GO:0006412, organelle assembly GO:0070925, ribosome biogenesis GO:0042254 and gene expression GO:0010467) (Fisher’s exact test FDR<0.05; Methods). In both datasets, these genes are also highly expressed, which reflects the positive correlation between expression heritability and expression levels (R² = 0.66 and p < 2.2e-16). Conversely, genes with the lowest expression heritability observed in the RM/BY background, which we defined as the bottom 10% expression heritability, are enriched in functions related to the cell cycle biological process (GO:0007049, Fisher’s exact test FDR<0.05) (16,31).

Because of the increased scale of our collection, our approach is more powered to estimate the gene heritability. We were thus able to detect new overrepresented biological processes, i.e. DNA metabolic process (GO:0006259) and the response to nutrient levels (GO:0031667), for which the variation of expression levels is weakly associated to the genetic variation observed across the RM/BY segregants.

The functional enrichment analysis using scRNA-seq data revealed new associations between expression heritability and biological processes in the RM/BY genetic background. However, while it suggests that many eQTL are also QTL, it cannot accurately point to the specific loci involved in trait variation and cannot address whether mutations on regulatory hubs have stronger effects on traits. To investigate this, we mapped the QTL to hotspots of gene regulation (or regulatory hubs), which we defined as 25 kb genomic windows that were repeatedly identified in the eQTL mapping procedure (for different genes). This was done to acknowledge the uncertainty in the exact position of the eQTL due to linkage disequilibrium and power. We then ranked the 30C QTL identified by Nguyen Ba and collaborators (2022) based on their absolute effect size and correlated it to the rank of the eQTL hotspots based on the number of regulated genes. This resulted in a positive correlation (Spearman ρ = 0.33 and p = 5.21e-5), suggesting that larger effects on the regulatory network translate into larger trait variation. Indeed, we observed that some previously reported high-effect-size QTL genes are located in eQTL hotspots, eg MKT1, HAP1, and IRA2 (Figure 4A).

eQTL features underlying trait variation across the BY/RM segregants.
A) Mapping of the 30C QTL in the eQTL hotspots. We represent the hotspots of expression regulation as genomic windows (25 kb) to acknowledge the uncertainty around the real position of the eQTL due to linkage disequilibrium. We annotated the 5 top eQTL hotspots and the eQTL hotspots in which the top additive QTL identified by the BB-QTL mapping of the 30C phenotype are located. In these regions, we represented the most affected trans-regulated genes in red, the most affect cis- regulated gene in blue and the genes of the top QTL in black. The double quotation characters represent the absence of such genes in the associated region. We also represented the rank of the QTL in the set of 159 QTL of the 30C phenotype. B) Partitioning of the expression heritability or explained variance (R²) among cis- and trans-eQTL. Each pair of points connected by a line represents a gene. Green lines represent the genes that are only have trans-eQTL and orange lines represent the genes that have both trans- and cis-eQTL. C) Comparison of the mean effect size between cis- and trans-eQTL. Each pair of points connected by a line represents a gene. The ratio of the average effect size between cis- and trans-eQTL is represented by the line color. The sample size of each eQTL category is represented in the x axis. This is the number of trans-eQTL and cis- eQTL used for calculating the average effect sizes per gene not the number of points per distribution.

Performing this rank-test on individual genes also yielded the result that eQTL effect is correlated with fitness effect for 35.1% of the genes (permutation test p < 0.05, see Methods). Although this correlation does not apply to most genes, it reveals potential regulatory mechanisms explaining the importance of the strongest growth loci or QTL. For instance, MKT1, i.e. the strongest growth loci, is part of a regulation hotspot affecting genes that are important for yeast growth like ENP1 which is involved in RNA processing and HXT6 which is involved in glucose uptake (32–34). Among the strongest growth loci, VPS70 is part of a hotspot of regulation that strongly affects the expression of RSF2, a zinc-finger protein regulating glycerol-based growth and respiration (35). Furthermore, the highest peak for expression regulation contains important growth loci in chromosome IV around the mating type loci. This suggests the presence of cells with different mating types in the dataset which we confirmed from the read mapping to Mat-a and Mat-α genes. This is consistent with previous budding yeast eQTL mapping and is also expected because the mating types in yeast express sets of genes that are “turned off” in other mating types (15,16,36). This peak of expression regulation is also responsible for regulating TDH3 which is involved in glycolysis and glucogenesis and can have important effect on fitness (37).

These hotspots suggest that expression differences in BY/RM would predominantly be due to mutations in trans-regulatory elements. To test this, we partitioned the variation in gene expression between cis- and trans- regulatory loci for each gene (see Methods). This analysis revealed that all the genes are affected by at least one polymorphic trans-regulatory locus and that these polymorphic trans-regulatory loci explain most of that gene’s expression (Figure 4B). It is well known that mutations in promoters and nearby enhancers can influence gene expression (38,39). Indeed, we identified many genes that contained an allele in a cis-regulatory element that strongly explain that gene’s expression variation (n=750 genes out of 6088, Figure 4B). As expected, mutations in cis-regulatory elements were of stronger effect size than trans-eQTL individually, but the cumulative aggregate effect of all trans-eQTL acting on that gene was comparable to the few cis-eQTL they had (Figure 4C). This can be explained by the fact that there are more opportunities for mutations to arise in trans-regulatory elements. Finally, we found that trans-eQTL have two times higher odds of affecting cell fitness than cis-eQTL (χ² p = 0.01). Taken together, the link between the genetic basis of transcription variation across RM/BY segregants and fitness could only be revealed by integrating large-scale transcriptomic data to an existing GPM, which scRNA-seq facilitates.

Conclusion

By leveraging the scalability of scRNA-seq, we obtained thousands of transcriptomes from a reference pool of strains in a single experiment. This enabled the analysis of association between genotype, transcriptome, and phenotype at an unprecedented scale. Questions surrounding transcriptomic variation and phenotypic variation have been at the center of many previous quantitative genetics studies (15,16,22,36,40). These ideas and discoveries all support the fact that researchers can gain valuable insight about the evolution of traits by integrating the transcriptome in GPM analyses, which can translate into fundamental knowledge or other important applications where phenotypes evolve.

In this study, we took advantage of a previously characterized BY/RM cross where the genetic basis of growth in various environments was examined in detail (24). By integrating transcriptomic data in this genotype-phenotype map, we revealed how transcriptomic components are involved in trait variation. Similar to a previous study, which obtained transcriptomes by individual strain sequencing, we found that gene expression is highly heritable. Further, our study design also allowed us to conclude that gene expression contributes to a significant portion of the phenotypic variation in this strain collection.

This finding is corroborated by our findings that most eQTL detected in our study were previously shown to be QTL. This is perhaps not surprising given that QTL in this cross were previously inferred to be in regulatory genes, but this provides a more mechanistic view of the effect of an allele on phenotype. Indeed, we find a bias for trans-regulation for generating transcription innovation where the cumulative effect of trans-eQTL on gene expression are significant. That is not to say that cis-regulatory alleles are dispensable as cis-regulatory alleles often have large effect on gene expression. This genome-wide view of the genetic basis of transcriptional variation has consequences for the evolution of phenotypes, as the target size afforded by trans-eQTL is far larger than cis-eQTL. Thus, adaptation to small and fluctuation environmental changes may proceed preferentially through allelic changes or recombination of many small-effect trans-eQTL, but large expression changes are likely to require some cis-eQTL.

In this study, we leveraged the fact that our pool of strain was previously genotyped and phenotyped. This was obtained by liquid handling robotics and pooled competitive growth assay with barcode sequencing. While this was performed on a very large scale, it was essentially obtained by brute-force and through approaches that are not necessarily applicable to other systems. While it is clear from our results that genotyping single-cells can achieve the same genotype quality as single-reaction genotyping, it is much harder to obtain phenotyping data from scRNA-seq. Thus, our framework might not be readily translatable to other systems where similar studies on the GPM are desirable. However, two observations from this cross can be used to suggest an experimental approach. First, while epistasis is important, it contributes to a relatively small portion of the phenotypic variance. Further, transcriptomic variation contributes little to the missing heritability. Thus, it may be possible to use predicted fitness instead of observed fitness and recapitulate essentially similar results as this study. Predicted fitness could be obtained from bulk-segregant analysis where the additive effect of loci can be inferred from whole-genome sequencing (23,41). While it is not clear if these observations are generalizable, it may be possible to verify this for a study system of interest with some modest time-course single-cell based sequencing where low-coverage genotyping is possible.

However, despite the study’s limitation on generalizability, our scRNA-seq framework helps bridge understanding of how genetic variation influences transcriptomic variation. Our framework relies on identifying the genome of single-cells from the transcriptome, which is going to be possible from low-coverage sequencing when genetic variation within the pool is high (such as this cross, microbiome sequencing, or cancer cells with extensive copy number variation), and from low cell diversity with sufficient transcriptomic variation such that aggregation of single-cells with similar transcriptomes can afford pseudo-high coverage sequencing. Thus, integrating genotype, transcriptome, and phenotype using scRNA-seq data can be particularly efficient for developing a more fundamental understanding of other important traits or diseases.

Material and methods

Yeast strains and segregants

We analyzed cells from a single batch (batch 1) of 4,489 segregants obtained from a F1 cross between the yeast laboratory strain BY4741 and the vineyard strain RM11-1a generated in a previous study (24). These strains have been selected to generate this collection of segregants because they exhibit differences in multiple phenotypes including the adaptation to temperature, the ability to process different sources of carbon and the ability to resist antifungal compounds. Therefore, the genetic variation observed across the segregants can be correlated to the differences in growth rate observed in the 18 environments recapitulating these phenotypes in the Nguyen et al (2022) study (24). The selection of the batch is random and the fact that we performed the analyses on a single batch eliminates batch effects that could obscure variable associations. Genotypes and fitness data used were the same ones obtained in the previous study.

Yeast growth and single-cell RNA-sequencing protocol

To prepare strains for scRNA sequencing, we unfroze the batch of segregants and inoculated approximately 5*10^6 cells in YPD (1% Yeast Extract, 2% Peptone, 2% Dextrose) to saturation. The next day, about 10^7 cells were passaged to 5 mL of fresh YPD and grew for 4 hours to bring cells to log-phase. We then pelleted 100 ul of cells and resuspended them in spheroplasting solution (5 mg/mL zymolyase 20T, 10 mM DTT, 1 M Sorbitol, 100 mM Sodium Phosphate pH 7.4) at a concentration of 10^7 cells/mL. The cells were incubated at 37 degrees Celcius for approximately 10 minutes at which point spheroplasting was verified by mixing a small aliquot of cells with detergent to observe lysis. The cells at this point were quantified using a hemocytometer and prepared using the standard 10x Genomics Gel Beads-in-emulsion (GEM) protocol. We used the Chromium Next GEM Single-cell 3’ Reagent Kit to prepare the sequencing libraries and sequenced on a NextSeq 500 high-output flow cell.

We note that the cells analyzed here were grown in bulk and assayed for their transcriptome in log-phase. Our fitness data was obtained from competitive bulk fitness assays which includes several whole growth cycle over multiple days and thus captures lag phase, exponential growth, and saturation. Nevertheless, previous experiments had shown that fitness was mostly determined by exponential growth which suggests that our analysis is adequate even if the cells were prepared for sequencing at a single time point.

Single-cell RNA-sequencing data parsing

From the scRNA-seq reads, we obtained gene expression levels and allele counts using the pipeline count from CellRanger version 3.1.0 (42). For each of the ancestral strain, i.e RM11-1a and BY4741, the pipeline mapped the scRNA-seq reads to the reference genome, filtered the barcodes by comparing the UMI count per barcode distribution to a background model of empty gel-bead in-emulsion, and counted the number of UMI per gene per barcode. The barcode filtering retained 18,233 barcodes. For each barcode, we then counted the number of RM and BY alleles at each polymorphic site by parsing the RM and BY bam files using a python script (https://github.com/arnaud00013/sc-eQTL/tree/main/II_scRNA-seq_genotyping). This script only keeps reads that mapped at the same loci on both reference genomes to increase the level of confidence of the mapping.

Correction and imputation of single-cell genotypes with a Hidden Markov Model

Because there are only two possible alleles at each polymorphic sites of the RM/BY segregants, their genotype can be recapitulated by a quantitative variable measuring the proportion of reads from one of the parental strains, which is RM in our dataset. The raw allele count data provides a first estimate of this RM allele frequency at each polymorphic site. However, due to the low mean depth of coverage of scRNA-seq data (0.2x), the absence of reads in some polymorphic sites and the biases introduced during sequencing like index hopping/swapping, we expect that the raw data can be imputed and corrected for errors and uncertainty in the observed alleles. Therefore, we applied a Hidden Markov Model (HMM) on the observed allele count. Such model can infer accurate genotype data at sequencing depths as low as 0.1x (24,25,43). Nguyen Ba and collaborators (2022) designed an HMM to infer the segregants genotypes from bulk DNA sequencing by accounting for sequencing error rate, recombination rate and index swapping rate (24). Because scRNA-seq uses the reverse transcriptase, which has a higher error rate, and because it is a pooled assay with higher chances of index swapping, we expected the HMM parameter to differ for the single cell data. Therefore, we adapted the HMM to scRNA-seq data by measuring its parameters in our dataset (Figure S1). The scripts are available on GitHub (https://github.com/arnaud00013/sc-eQTL/tree/main/II_scRNA-seq_genotyping).

Assigning single cells to the reference panel strains

To evaluate the level of relatedness between the reference panel strains and the imputed single cell genotypes, we used the expected distance to identify the strain that best relate to each single cell:

where 𝑔_𝑐 is the cell genotype and 𝑔_𝑠 is the strain genotype. Next, we assigned the single cell to its best match in the studied batch of 4,489 trains only if this match is better than the best match in randomly generated batches of the same size (Figure S2). This procedure is implemented and available at https://github.com/arnaud00013/sc-eQTL/tree/main/III_Genotype_analysis).

Partitioning the phenotypic variance into genetic and transcriptomic components

To analyze the yeast GPM at a broad scale and to evaluate the association between selection and the transcriptome, we estimated the contribution of genetic and transcriptomic variations to phenotypic variation from scRNA-seq data. More precisely, we performed a Genome-wide Complex Trait Analysis (GCTA) by fitting a linear mixed model to the data using the restricted maximum-likelihood (REML) method (44):

where y is the fitness vector for the n cells, X is the nxk matrix of k fixed effects, 𝛽 is the vector of k coefficients of the fixed effects, 𝑊_𝑔 is the nxp genotype matrix, 𝑢_𝑔 is the vector of p SNP effects, 𝑊_𝑒is the nxm expression matrix, 𝑢_𝑒 is the vector of m gene expression effects and 𝜀 is the error term. Because the dataset does not include fixed effects, we set the fixed effect to a vector of ones such that its coefficients represent the mean fitness while the genotype and expression data are the random effects that explain the fitness variance along with the error terms. The REML solution assumes that the data follow a Gaussian distribution, so the data are standardized before fitting the model. We also divided the standardized expression counts by the cell sum of expression counts to control for molecule count biases across cells. The cell fitness is based on the fitness of the closest segregant in batch 1 as measured by the expected distance. Because this model is linear and additive, it can be compared to the estimates of narrow-sense heritability obtained by Nguyen Ba and collaborators (2022) (24). The difference between the variance explained in equation 4 and equations 2 or 3 allow to infer the variance explained only by the genotype or the expression component of the model. The code for the variance partitioning is available on GitHub (https://github.com/arnaud00013/sc-eQTL/tree/main/IV_variance_partitioning).

Estimating the expression heritability from scRNA-seq

To obtain this estimate from scRNA-seq data, we needed to consider the fact that GCTA-REML only takes a vector as a response variable while the gene expression matrix is multi-dimensional. To solve this, we orthogonalized the gene expression matrix using principal component analysis (PCA), and used each of the PC one at a time as a response variable of the model. Indeed, if the expression PCs recapitulate the total expression variance and are orthogonal or independent to each other, then the sum of the PCs variance explained by genotype should be the expression heritability. To save time, we only used the 898 expression PCs that explain 99% of expression variance:

QTL mapping

To identify the loci that influence cell fitness, we performed a linear regression on the consensus genotypes of the strains from the single cell data and the strain fitness. We decided to use the consensus genotypes of the strains as they relate better to the bulk segregant genomes. To build the consensus genotypes, we defined cells from the same lineage as the ones that shared the same closest segregant in batch 1. Next, we used the median to obtain cells’ consensus genotypes as it is less sensitive to outliers and because it yields the best relatedness to the batch 1 reference genotypes (median R² = 87.0%; μ=79.5%; σ=18.2; Figure S3). We selected the QTL in the linear models using cross-validation on the scRNA-seq data. This analysis consists in dividing the dataset into 10 random partitions of similar sample sizes and running a cross-validated stepwise forward linear regression on each partition. For each partition, the model starts with no QTL and a linear model “Fitness ∼ Genotype” is fitted using the genotype data at each polymorphic site, where the correlation coefficient represents the effect size of the SNP. Then, the forward search starts and at each iteration, a new locus with the minimum linear model residual sum of squares (RSS) is added to the QTL model, which is updated with new effect sizes after the addition of a new SNP. Because the order of addition of QTL matters in the forward search and because some QTL are linked or collinear, the model can be refined by exploring different QTL around the local optima. These steps are repeated until the model RSS cannot be improved anymore or until the number of QTL reaches an arbitrary maximum far from the cross-validated number of QTL. After the forward search is completed in each partition, the algorithm calculates the optimal λ values that minimizes the objective function 𝐹_𝑜:

where 𝛽 is the vector of SNP effect sizes in the QTL model, ‖𝑌 − 𝑋𝛽‖² is the RSS of the linear QTL model, λ defines the penalty for adding a new SNP to the model and ‖𝛽‖₀ is the number of SNPs in the QTL model. This objective function has the property to add sparsity in the QTL model and thus avoid overestimating the number of QTL while being consistent (24). The optimal λ has a minimum of log(n) which corresponds to the Bayesian Information Criterion (BIC), which is known to yield correct models asymptotically (45). This allows to consider the possibility that a sparser model than the one found using the BIC could yield better predictive power on a test set while avoiding overfitting. The optimal λ values found in all the partitions are then averaged and the resulting mean λ is used to solve the objective function in the full dataset, which yields the optimal QTL model. The cross-validation assumes that the partitions are independent, such that the variance explained by the model and the number of relevant QTL are unbiased estimates.

Highlighting hotspots of gene regulation through eQTL mapping

To identify the loci regulating gene expression regulation, we adapted the QTL mapping framework using expression as the predicted phenotype. Because this approach had to be repeated for each of the 6,240 genes, we needed to modify it so that the execution time is convenient. To do so, the parameter λ was not estimated using cross validation but rather from the Bayesian Inference Criterion (BIC), i.e. λ = log(n) where n is the number of cells. We found that the BIC was often selected by the cross-validation procedure when tested on a few genes and thus we do not believe that this approach will significantly change our results.

To acknowledge the uncertainty around the exact position of eQTL due to linkage disequilibrium, we define eQTL hotspots as 25 kb genomic windows that were repeatedly identified in the eQTL mapping procedure. The code for the single cell eQTL mapping is available on GitHub (https://github.com/arnaud00013/sc-eQTL/tree/main/V_sc_eQTL_mapping).

Functional enrichment analysis by gene ontology annotation

To highlight gene functions enriched at different levels of expression or expression heritability, we performed the panther database binomial test for statistical overrepresentation of gene ontology biological processes (31,46). A low level was defined as within the 25% bottom part of the distribution (<Q1) while a high level was defined as within the top 25% part of the distribution (>Q3). The p-values were corrected for multiple testing using the false discovery rate correction (FDR).

Matching QTL to eQTL

To evaluate the contribution of gene expression regulation to fitness variation, we created a model to match QTL and eQTL based on the similarity of loci and the similarity of predicted effect on gene expression. More precisely, for each of the 6,088 genes for which we could detect eQTL, we performed a new eQTL model by correlating the expression level of the gene to the genetic variation at QTL positions. This allowed us to measure the predicted effect of the QTL on gene expression. We then calculated the distance between the QTL and the real eQTL of the gene based on recombination distance within each chromosome, which decreases exponentially with genetic distance, and the difference in the predicted effect on the gene expression using the formulation developed by Nguyen Ba et al (2022) (24). Next, we used the same Needleman-Wunsch algorithm to find the most likely set of pairing between QTL and eQTL, where an unmatched QTL is also possible but penalized. Finally, we determined the proportion of genes for which gene expression regulation is associated with higher fitness. To do so, for each gene, we performed a permutation test by comparing the average rank of the matched QTL of the gene to the average rank of 999 random subsets of unmatched QTL of the same size. The p-value is the proportion of random subsets of unmatched QTL with a higher average QTL rank than the set of matched QTL.

Comparing cis- and trans-eQTL contribution to expression variation

We used the definition of local eQTL in Albert et al. (2018) to define cis-eQTL, i.e. any eQTL between 1,000 bp upstream of the gene and 200 bp downstream of the gene. Thus, we defined trans-eQTL as the eQTL that do not follow this criterion. For each gene, we then performed variance partitioning using the GCTA:

where y is the vector of expression level of the gene across the n cells, X is the nxk matrix of k fixed effects, 𝛽 is the vector of k coefficients of the fixed effects, 𝑊_{𝑔_𝑐𝑖𝑠}is the nxp cis-eQTL genotype matrix, 𝑢_{𝑔_𝑐𝑖𝑠} is the vector of p cis-eQTL effects on expression, 𝑊_{𝑔_𝑡𝑟𝑎𝑛𝑠} is the nxm trans-eQTL expression matrix, 𝑢_𝑒is the vector of m trans-eQTL effects on expression and 𝜀 represent the error terms. Because the dataset does not include fixed effects, we set the fixed effect to a vector of ones such that its coefficients represent the mean expression level while the cis- eQTL and trans-eQTL genotypes are the random effects that explain the expression variance along with the error terms. We can infer the variance explained by the cis-eQTL by the difference in variance explained between the models in equations 9 and 8. Likewise, the difference of variance explained by the models in equations 9 and 7 can help us estimate the variance explained by the trans-eQTL. Finally, we estimate the effect sizes using the absolute value of the correlation coefficients of each loci and compare the mean between the cis- and trans-eQTL from the same gene (paired data) with a Wilcoxon signed rank test.

Data availability

The code used for this study is available and explained at https://github.com/arnaud00013/sc-eQTL and the original single-cell reads from the pooled segregants scRNA-seq assay have been uploaded in the NCBI BioProject database with the accession number PRJNA1022775. The single-cell barcodes expression data are also available at https://github.com/arnaud00013/sc-eQTL as an archive file named Matrix_gene_expression_barcodes_1_to_9000.csv.tar.gz or Matrix_gene_expression_barcodes_9001_to_18233.csv.tar.gz.

Supporting information

Supplementary Tables S1, S2 and S3

Supplementary Figures S1 to S5

Acknowledgements

ANNB acknowledges support from the Natural Science and Engineering Research Council of Canada (NSERC RGPIN-2021–02716 and DGECR-2021–00117) and AN acknowledges support from the Natural Science and Engineering Research Council (NSERC CGS-D App Id: 569340-2022, UTF Fellowship from the University of Toronto and Rustom H. Dasture Graduate Scholarship in Cell and Systems Biology). The computations in this paper were enabled by resources provided by Compute Canada. We also thank the Centre for the Analysis of Genome Evolution and Function (CAGEF) and TCAG at SickKids for sequencing support.

References

1.
1. Bartoli C
2. Roux F
2017Genome-Wide Association Studies In Plant Pathosystems: Toward an Ecological Genomics ApproachFront Plant Sci [Internet 8Google Scholar
2.
1. Ferreira MA
2. Gamazon ER
3. Al-Ejeh F
4. Aittomaki K
5. Andrulis IL
6. Anton-Culver H
7. et al.
2019Genome-wide association and transcriptome studies identify target genes and risk loci for breast cancerNat Commun [Internet 10Google Scholar
3.
1. Aguet F
2. Alasoo K
3. Li YI
4. Battle A
5. Im HK
6. Montgomery SB
7. et al.
2023Molecular quantitative trait lociNat Rev Methods Primer 3:1–22Google Scholar
4.
1. Tarantino LM
2. Eisener-Dorman AF
2012Forward Genetic Approaches to Understanding Complex BehaviorsCurr Top Behav Neurosci 12:25–58Google Scholar
5.
1. Casamassimi A
2. Federico A
3. Rienzo M
4. Esposito S
5. Ciccodicola A
2017Transcriptome Profiling in Human Diseases: New Advances and PerspectivesInt J Mol Sci 18:1652Google Scholar
6.
1. Wainberg M
2. Sinnott-Armstrong N
3. Mancuso N
4. Barbeira AN
5. Knowles DA
6. Golan D
7. et al.
2019Opportunities and challenges for transcriptome-wide association studiesNat Genet 51:592–9Google Scholar
7.
Transcriptome: Connecting the Genome to Gene FunctionLearn Science at Scitable https://www.nature.com/scitable/topicpage/transcriptome-connecting-the-genome-to-gene-function-605/
8.
1. Williams CG
2. Lee HJ
3. Asatsuma T
4. Vento-Tormo R
5. Haque A
2022An introduction to spatial transcriptomics for biomedical researchGenome Med 14:68Google Scholar
9.
1. King MC
2. Wilson AC
1975Evolution at Two Levels in Humans and ChimpanzeesScience 188:107–16Google Scholar
10.
1. Jacob F
1977Evolution and TinkeringScience 196:1161–6Google Scholar
11.
1. Hoekstra HE
2. Coyne JA
2007The Locus of Evolution: Evo Devo and the Genetics of AdaptationEvolution 61:995–1016Google Scholar
12.
1. Kratochwil CF
2. Meyer A.
2015Evolution: Tinkering within Gene Regulatory LandscapesCurr Biol 25:R285–8Google Scholar
13.
1. Primig M
2. Williams RM
3. Winzeler EA
4. Tevzadze GG
5. Conway AR
6. Hwang SY
7. et al.
2000The core meiotic transcriptome in budding yeastsNat Genet 26:415–23Google Scholar
14.
1. Cavalieri D
2. Townsend JP
3. Hartl DL
2000Manifold anomalies in gene expression in a vineyard isolate of Saccharomyces cerevisiae revealed by DNA microarray analysisProc Natl Acad Sci U S A 97:12369–74Google Scholar
15.
1. Brem RB
2. Yvert G
3. Clinton R
4. Kruglyak L.
2002Genetic Dissection of Transcriptional Regulation in Budding YeastScience https://www.science.org/doi/pdf/10.1126/science.1069516
16.
1. Albert FW
2. Bloom JS
3. Siegel J
4. Day L
5. Kruglyak L.
6. Wittkopp PJ
2018Genetics of trans-regulatory variation in gene expressioneLife 7:e35471Google Scholar
17.
1. Schwarz T
2. Boltz T
3. Hou K
4. Bot M
5. Duan C
6. Loohuis LO
7. et al.
2022Powerful eQTL mapping through low-coverage RNA sequencingHum Genet Genomics Adv 3:100103Google Scholar
18.
1. Fan Y
2. Zhu H
3. Song Y
4. Peng Q
5. Zhou X
2020Efficient and effective control of confounding in eQTL mapping studies through joint differential expression and Mendelian randomization analysesBioinformatics 37:296–302Google Scholar
19.
1. Bush WS
2. Moore JH
2012Chapter 11: Genome-Wide Association StudiesPlos Comput Biol [Internet 8Google Scholar
20.
1. Lorenz K
2. Cohen BA
2012Small- and Large-Effect Quantitative Trait Locus Interactions Underlie Variation in Yeast Sporulation EfficiencyGenetics 192:1123Google Scholar
21.
1. Shaffer SM
2. Dunagin MC
3. Torborg SR
4. Torre EA
5. Emert B
6. Krepler C
7. et al.
2017Rare cell variability and drug- induced reprogramming as a mode of cancer drug resistanceNature 546:431Google Scholar
22.
1. Bloom JS
2. Ehrenreich IM
3. Loo WT
4. Lite TLV
5. Kruglyak L
2013Finding the sources of missing heritability in a yeast crossNature 494:234–7Google Scholar
23.
1. Ehrenreich IM
2. Torabi N
3. Jia Y
4. Kent J
5. Martis S
6. Shapiro JA
7. et al.
2010Dissection of genetically complex traits with extremely large pools of yeast segregantsNature 464:1039–42Google Scholar
24.
1. Nguyen Ba AN
2. Lawrence KR
3. Rego-Costa A
4. Gopalakrishnan S
5. Temko D
6. Michor F
7. et al.
8. Verstrepen KJ
2022Barcoded Bulk QTL mapping reveals highly polygenic and epistatic architecture of complex traits in yeasteLife 11:e73983Google Scholar
25.
1. Hwang B
2. Lee JH
3. Bang D
2018Single-cell RNA sequencing technologies and bioinformatics pipelinesExp Mol Med [Internet 50Google Scholar
26.
1. Bergström A
2. Simpson JT
3. Salinas F
4. Barré B
5. Parts L
6. Zia A
7. et al.
2014A High-Definition View of Functional Genetic Variation from Natural Yeast GenomesMol Biol Evol 31:872–88Google Scholar
27.
1. Good BH
2. McDonald MJ
3. Barrick JE
4. Lenski RE
5. Desai MM.
2017The dynamics of molecular evolution over 60,000 generationsNature 551:45–50Google Scholar
28.
1. Johnson MS
2. Gopalakrishnan S
3. Goyal J
4. Dillingham ME
5. Bakerlee CW
6. Humphrey PT
7. et al.
8. Verstrepen KJ
9. Wittkopp PJ
10. Verstrepen KJ
11. Hodgins-Davis A
2021Phenotypic and molecular evolution across 10,000 generations in laboratory budding yeast populationseLife 10:e63910Google Scholar
29.
1. Sun XM
2. Bowman A
3. Priestman M
4. Bertaux F
5. Martinez-Segura A
6. Tang W
7. et al.
2020Size-Dependent Increase in RNA Polymerase II Initiation Rates Mediates Gene Expression Scaling with Cell SizeCurr Biol 30:1217–1230Google Scholar
30.
1. Marguerat S
2. Bähler J
2012Coordinating genome expression with cell sizeTrends Genet 28:560–5Google Scholar
31.
1. Mi H
2. Muruganujan A
3. Huang X
4. Ebert D
5. Mills C
6. Guo X
7. et al.
2019Protocol Update for large-scale genome and gene function analysis with the PANTHER classification system (v.14.0)Nat Protoc. 14:703–21Google Scholar
32.
1. Roos J
2. Luz JM
3. Centoducati S
4. Sternglanz R
5. Lennarz WJ
1997ENP1, an essential gene encoding a nuclear protein that is highly conserved from yeast to humansGene 185:137–46Google Scholar
33.
1. Chen W
2. Bucaria J
3. Band DA
4. Sutton A
5. Sternglanz R
2003Enp1, a yeast protein associated with U3 and U14 snoRNAs, is required for pre-rRNA processing and 40S subunit synthesisNucleic Acids Res 31:690–9Google Scholar
34.
1. Roy A
2. Dement AD
3. Cho KH
4. Kim JH
2015Assessing Glucose Uptake through the Yeast Hexose Transporter 1 (Hxt1)PLOS ONE 10:e0121985Google Scholar
35.
1. Lu L
2. Roberts GG
3. Oszust C
4. Hudson AP
2005The YJR127C/ZMS1 gene product is involved in glycerol-based respiratory growth of the yeast Saccharomyces cerevisiaeCurr Genet 48:235–46Google Scholar
36.
1. Brem RB
2. Storey JD
3. Whittle J
4. Kruglyak L
2005Genetic interactions between polymorphisms that affect gene expression in yeastNature 436:701–3Google Scholar
37.
1. Vande Zande P
2. Hill MS
3. Wittkopp PJ
2022Pleiotropic effects of trans-regulatory mutations on fitness and gene expressionScience 377:105–9Google Scholar
38.
1. Mattioli K
2. Oliveros W
3. Gerhardinger C
4. Andergassen D
5. Maass PG
6. Rinn JL
7. et al.
2020Cis and trans effects differentially contribute to the evolution of promoters and enhancersGenome Biol 21:210Google Scholar
39.
1. Romero IG
2. Ruvinsky I
3. Gilad Y
2012Comparative studies of gene expression and the evolution of gene regulationNat Rev Genet 13:505–16Google Scholar
40.
1. Bloom JS
2. Boocock J
3. Treusch S
4. Sadhu MJ
5. Day L
6. Oates-Barker H
7. et al.
8. Landry CR
9. Barkai N
2019Rare variants contribute disproportionately to quantitative trait variation in yeasteLife 8:e49212Google Scholar
41.
1. Brauer MJ
2. Christianson CM
3. Pai DA
4. Dunham MJ
2006Mapping novel traits by array-assisted bulk segregant analysis in Saccharomyces cerevisiaeGenetics 173:1813–6Google Scholar
42.
1. Zheng GX
2. Terry JM
3. Belgrader P
4. Ryvkin P
5. Bent ZW
6. Wilson R
7. et al.
2017Massively parallel digital transcriptional profiling of single cellsNat Commun. 8:14049Google Scholar
43.
1. Stark R
2. Grzelak M
3. Hadfield J
2019RNA sequencing: the teenage yearsNat Rev Genet 20:631–56Google Scholar
44.
1. Yang J
2. Lee SH
3. Goddard ME
4. Visscher PM.
2011GCTA: a tool for genome-wide complex trait analysisAm J Hum Genet 88:76–82Google Scholar
45.
1. Harrell FE.
2001Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival AnalysisNew York, NY: Springer http://link.springer.com/10.1007/978-1-4757-3462-1 Google Scholar
46.
PANTHER: Making genome-scale phylogenetics accessible to allhttps://onlinelibrary.wiley.com/doi/epdf/10.1002/pro.4218
47.
1. Kanehisa M
2. Goto S
2000KEGG: Kyoto Encyclopedia of Genes and GenomesNucleic Acids Res 28:27–30Google Scholar

Article and author information

Author information

Arnaud N’Guessan
Department of Cell and Systems Biology, University of Toronto, Ramsay Wright Laboratories, 25 Harbord St, M5S3G5, Toronto, Ontario, Canada
ORCID iD: 0000-0002-3385-725X
Wen Yuan Tong
Department of Biology, University of Toronto at Mississauga, 3359 Mississauga Rd, L5L 1C5, Mississauga, Ontario, Canada
- Center of Molecular and Cellular Oncology, Yale University, 300 George Street Suite 6400, New Haven, CT
Hamed Heydari
Department of Molecular Genetics, University of Toronto, Medical Science Building, Room 4386, 1 King’s College Cir, M5S1A8, Toronto, Ontario, Canada, Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Room 230, M5S3E1, Toronto, Ontario, Canada
ORCID iD: 0000-0001-6708-9116
Alex N Nguyen Ba
Department of Cell and Systems Biology, University of Toronto, Ramsay Wright Laboratories, 25 Harbord St, M5S3G5, Toronto, Ontario, Canada, Department of Biology, University of Toronto at Mississauga, 3359 Mississauga Rd, L5L 1C5, Mississauga, Ontario, Canada
ORCID iD: 0000-0003-1357-6386
- Corresponding author; email: alex.nguyenba@utoronto.ca

Version history

Preprint posted: October 21, 2023
Sent for peer review: November 21, 2023
Reviewed Preprint version 1: March 20, 2024
Reviewed Preprint version 2: September 20, 2024
Reviewed Preprint version 3: May 21, 2025
Reviewed Preprint version 4: July 7, 2025
Version of Record published: July 28, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.93906. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Vaughn Cooper
No institution, No City, United States of America
Senior Editor
Detlef Weigel
Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public Review):

In this paper, N'Guessan et al report a study of expression QTL (eQTL) mapping in yeast using single cells. The authors make use of advances in single-cell RNAseq (scRNAseq) in yeast to increase the efficiency with which this type of analysis can be undertaken. Building on prior research led by the senior author that entailed genotyping and fitness profiling of almost 100,000 cells derived from a cross between two yeast strains (BY and RM) they performed scRNAseq on a subset of 4,489 individual cells. To address the sparsity of genotype data in the expression profiling they used a Hidden Markov Model (HMM) to infer genotypes and then identify the most likely known lineage genotype from the original dataset. To address the relationship between variance in fitness and gene expression the authors partition the variance to investigate the sources of variation. They then perform eQTL mapping and study the relationship between eQTL and fitness QTL identified in the earlier study.

This paper seeks to address the challenging question of how quantitative trait variation and expression variation are related. scRNAseq represents an appealing approach to eQTL mapping as it is possible to simultaneously genotype individual cells and measure expression in the same cell. As eQTL mapping requires large sample sizes to identify statistical relationships, this approach is likely to dramatically increase the statistical power of such studies. However, there are several technical challenges associated with scRNAseq and the authors' study is focused on addressing those challenges. Although the authors present results suggesting the feasibility of the approach there are limitations in the conclusions that can be drawn in the current study owing to the lack of clarity in the presentation of the results. Ultimately, this study presents a proof of concept with limited novel biological insights that would nonetheless make a useful contribution to the literature if the following major points were addressed:

(1) There is insufficient information provided about the nature of data. At a minimum, the following information should be provided to enable assessment of the study: What is the total library size, how many genes are identified per cell, how many UMIs are found per cell, what is the doublet rate, and how are doublets identified (e.g. on the basis of heterozygous calls at polymorphic loci?), how many times is each genotype observed, and how many polymorphic sites are identified per cell that are the basis of genotype inferences?

(2) The prior study analyzed 18 different conditions, whereas this study only assays expression in a single condition. However, the power of the authors' approach is that its efficiency enables testing eQTLs in multiple conditions. The study would be greatly strengthened through analysis of at least one more condition, and ideally several more conditions. The previous fitness study would be a useful guide for choosing additional conditions as identifying those conditions that result in the greatest contrasts in fitness QTL would be best suited to testing the generalizations that can be drawn from the study.

(3) Alternatively, the authors could demonstrate the power of their approach by applying it to a cross between two other yeast strains. As the cross between BY and RM has been exhaustively studied, applying this approach to a different cross would increase the likelihood of making novel biological discoveries.

(4) Figure 1 is misleading as A presents the original study from 2022 without important details such as how genotypes were identified. It is unclear what the barcode is in this study and how it is used in the analysis. Is the barcode for each lineage transcribed so that it is identified in the scRNAseq data? Or, does the barcode in B refer to the cell index barcode? A clearer presentation and explanation of terms are needed to understand the method.

(5) The rationale for the analysis reported in Figure 2B is unclear. The fitness data are from the previous study and the goal is to estimate the heritability using the genotyping data from the scRNAseq data. What is the explanation for why the data don't agree for only one condition, i.e. 37C? And, what are we to understand from the overall result?

(6) Figure 3 presents an analysis of variance partitioning as a Venn diagram. This summarized result is very hard to understand in the absence of any examples of what the underlying raw data look like. For example, what does trait variation look like if only genotype explains the variance or if only gene expression explains the variance? The presented highly summarized data is not intuitive and its presentation is poor - the result that is currently provided would be easier to read in a table format, but the reader needs more information to be able to interpret and understand the result.

(7) I am concerned about the conclusions that can be drawn about expression heritability. The authors claim that expression heritability is correlated with expression levels. It seems likely that this reflects differing statistical power. How can this possibility be excluded?

(8) Conversely, the authors claim that the genes with the lowest heritability are genes involved in the cell cycle. However, uniquely in scRNAseq, cell cycle regulated genes appear to have the highest variance in the data as they are only expressed in a subset of cells. Without incorporating this fact one would erroneously conclude that the variation is not heritable. To test the heritability of cell cycle regulation genes the authors should partition the cells into each cell cycle stage based on expression.

(9) I do not understand Figure S5 and how eQTL sites are assigned to these specific classes given that the authors say that causative variation cannot be resolved because of linkage disequilibrium.

(10) The paragraph starting at line 305 is very confusing. In particular, the authors state that they identify a hotspot of regulation at the mating type locus. It is not obvious why this would be the case. Moreover, they claim that they find evidence for both MATa and MATalpha gene expression. Information is not provided about how segregants were isolated, but assuming that the authors did not dissect 25,000 tetrads to obtain 100,000 segregants I would infer that random spore using SGA was used. In that case, all cells should be MATa. The authors should clarify and explain this observation.

(11) Ultimately, it is not clear what new biological findings the authors have made. There are no novel findings with respect to causative variation underlying eQTLs and I would encourage the authors to make clearer statements in their abstract, introduction, and conclusion about the key discoveries. E.g. What are the "new associations between phenotypic and transcriptomic variations" mentioned in the abstract?

The following minor points should be addressed:

(1) The segregants should be referred to as F2 segregants as they are derived from an F1 cross.

(2) The connections to eQTLs in other organisms should be addressed in the introduction and conclusion. For example, in humans, there has been little evidence for trans eQTLs in contrast to what has been found in yeast.

(3) The authors state that an advantage of scRNAseq over bulk is that it captures rare cell populations (line 79), but this advantage is not exploited in this study.

(4) The authors use ~5% of the lineages from the original study. There is no rationale for why this is an appropriate sample size. Is there an argument for using more cells in eQTL mapping or conversely could the authors ask if fewer cells would provide similar conclusions by downsampling?

(5) I do not agree that the use of UMIs overcomes the challenges of low sequencing depth. UMIs mitigate the possible technical artifacts due to massive PCR amplification.

(6) There is an inadequate reference to prior work on scRNAseq in yeast that established the methods used by the authors and eQTL mapping in human cells using scRNAseq.

(7) The use of empty quotes in Figure 4A is confusing and an alternative presentation method should be used.

(8) The authors speculate about the use of predicted fitness instead of observed fitness, but this is something they could explicitly address in their current study.

https://doi.org/10.7554/eLife.93906.1.sa1

Reviewer #2 (Public Review):

Summary:

The experiments and analysis appear to be carefully done. My concerns center on the impact of the work in its current form on the research community.

The focal yeast cross here has been the subject of many previous publications (for smaller sets of recombinant progeny), by the last author and others, including phenotyping, genotyping, transcriptomics, and proteomics. This mini-literature has proven relevant to the community because it has empirically pinpointed exactly how many variants underlie a given trait, both molecular and cellular. That is, whereas in more complex organisms we try our best to estimate/infer the full genetic architecture of varying traits from the results of mapping of necessarily weaker power, the highly-powered yeast system can access a more comprehensive mapping of the dozens of loci impinging on a given trait and learn from it. The question is what exactly we learn from the current study?

Strengths and weaknesses:

Most of the figures center on methods development and validation for the authors' single-cell RNA-seq in the yeast cross, including generating the large raw data set; analysis pipelines for mapping and genotyping (Figure 1); and higher-level analyses that recapitulate previously reported trends in heritability (Figure 2) and eQTL mapping (Figure 3 and Figure 4B-C). One potential novelty of the study is the methods per se: that is, showing that scRNA-seq works for concomitant genotyping and gene expression profiling in the natural variation context. The authors' rigor and effort notwithstanding: in my view, this can be described as modest in terms of principles. That is, the authors did a good job putting the scRNA-seq idea into practice, but their success is perhaps not surprising or highly relevant for work outside of yeast (as the discussion says). The more substantive claim by the authors for the impact of the study is that they make new observations about the role of expression in phenotype (lines 333-335). The major display item of the manuscript on this theme is Figure 4A, reporting which loci that control growth phenotype (from an earlier paper) also control expression. This is solid but I regret to say that the results strike me as modest. The discussion makes some perhaps fairly big claims that the work has helped "bridge understanding of how genetic variation influences transcriptomic variation" and ultimately cellular phenotype. But with the data as they stand, the authors have missed an opportunity to crystallize exactly how a given variant affects expression (perhaps in waves of regulators affecting targets that affect more regulators) and then phenotype, except for the speculations in the text on lines 305-319. The field started down this road years ago with Bayesian causality inference methods applied to eQTL and phenotype mapping (via e.g. the work of Eric Schadt). The authors could now try Mendelian randomization-type fine-grained detailed models for more firepower toward the same end, and/or experimental tests of the genotype-to-expression-to-phenotype relationship. I would see these directions, motivated by fundamental questions that are relevant to the field at large, as leading to a major advance for this very crowded field. As it stands, I felt their absence in this manuscript especially if the authors are selling principles about linking expression and phenotype as their take-home. I also wonder whether the co-mapping of expression and growth traits in Figure 4A would have been possible with e.g. the bulk RNA-seq from Albert et al., 2018, and I recommend that the authors repeat the Figure 4A-type analyses with the latter to justify their statement that their massive scRNA data set would actually be necessary for them to bear fruit (lines 386-388).

I also read the discussion of the manuscript as bringing to the fore some of the challenges a reader has in judging the current state of the results to be of actionable impact. The discussion, and the manuscript, will be improved if the authors can put the work in context, posing concrete questions from the field and stating how they are addressed here and what's left to do.

https://doi.org/10.7554/eLife.93906.1.sa0