Refining the resolution of the yeast genotype-phenotype map using single-cell RNA-sequencing

  1. Department of Cell and Systems Biology, University of Toronto, Ramsay Wright Laboratories, 25 Harbord St, M5S3G5, Toronto, Ontario, Canada
  2. Department of Biology, University of Toronto at Mississauga, 3359 Mississauga Rd, L5L 1C5, Mississauga, Ontario, Canada
  3. Department of Molecular Genetics, University of Toronto, Medical Science Building, Room 4386, 1 King’s College Cir, M5S1A8, Toronto, Ontario, Canada
  4. Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Room 230, M5S3E1, Toronto, Ontario, Canada

Peer review process

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Vaughn Cooper
    University of Pittsburgh, Pittsburgh, United States of America
  • Senior Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public review):

In the revision of their paper, N'Guessan et al have improved the report of their study of expression QTL (eQTL) mapping in yeast using single cells. The authors make use of advances in single cell RNAseq (scRNAseq) in yeast to increase the efficiency with which this type of analysis can be undertaken. Building on prior research led by the senior author that entailed genotyping and fitness profiling of almost 100,000 cells derived from a cross between two yeast strains (BY and RM) they performed scRNAseq on a subset of ~5% (n = 4,489) individual cells. To address the sparsity of genotype data in the expression profiling they used a Hidden Markov Model (HMM) to infer genotypes and then identify the most likely known lineage genotype from the original dataset. To address the relationship between variance in fitness and gene expression the authors partition the variance to investigate the sources of variation. They then perform eQTL mapping and study the relationship between eQTL and fitness QTL identified in the earlier study.

This paper seeks to address the question of how quantitative trait variation and expression variation are related. scRNAseq represents an appealing approach to eQTL mapping as it is possible to simultaneously genotype individual cells and measure expression in the same cell. As eQTL mapping requires large sample sizes to identify statistical relationships, the use of scRNAseq is likely to dramatically increase the statistical power of such studies. However, there are several technical challenges associated with scRNAseq and the authors' study is focused on addressing those challenges. Most of the points raised by my review of the initial version have been addressed. However, one point remains and one additional point should be considered.

(1) Given that the authors overcame many technical and analytical challenges in the course of this research, the study would be greatly strengthened through analysis of at least one, and ideally several, more conditions which would expand the conclusions that could be drawn from the study and demonstrate the power of using scRNAseq to efficiently quantify expression in different environments.

(2) In this version the authors have introduced the use of data imputation using a published algorithm, DISCERN. This has greatly increased the variation explained by their model as presented in figure 3. However, it is possible that the explained variance is now an overestimation as a result of using the imputed expression data. I think that it would be appropriate to present figure 3 using the sparse data presented in the initial version of the paper and the newly presented imputed data so that the reader can draw their own conclusions about the interpretation.

Reviewer #2 (Public review):

The authors now say the main take-home for their work is (1) they have established methods for linkage mapping with scRNA-seq and that these (2) "can help gain insights about the genotype-phenotype map at a broader scale." My opinion in this revision is much the same as it was in the first round: I agree that they have met the first goal, and the second theme has been so well explored by other literature that I'm not convinced the authors' results meet the bar for novelty and impact. To my mind, success for this manuscript would be to support the claim that the scRNA-seq approach helps "reveal hidden components of the yeast genotype-to-phenotype map." I'm not sure the authors have achieved this. I agree that the new Figure 3 is a nice addition-a result that apparently hasn't been reported elsewhere (30% of growth trait variation can't be explained by expression). The caveats are that this is a negative result that needs to be interpreted with caution; and that it would be useful for the authors to clarify whether the ability to do this calculation is a product of the scRNA-seq method per se or whether they could have used any bulk eQTL study for it. Beside this, I regret to say that I still find that the results in the revision recapitulate what the bulk eQTL literature has already found, especially for the authors' focal yeast cross: heritability, expression hotspots, the role of cis and trans-acting variation, etc.

Likewise, when in the first round of review I recommended that the authors repeat their analyses on previous bulk RNA-seq data from Albert et al., my point was to lead the authors to a means to provide rigorous, compelling justification for the scRNA-seq approach. The response to reviewers and the text (starting on line 413) says the comparison in its current form doesn't serve this purpose because Albert et al. studied fewer segregants. Wouldn't down-sampling the current data set allow a fair comparison? Again, to my mind what the current manuscript needs is concrete evidence that the scRNA-seq method per se affords truly better insights relative to what has come before.

I also recommend that the authors take care to improve the main text for readability and professionalism. It would benefit from further structural revision throughout (especially in the figure captions) to allow high-impact conclusions to be highlighted and low-impact material to be eliminated. Figure 4 and the results text sections from line 319 onward could be edited for concision or perhaps moved to supplementary if they obscure the authors' case for the scRNA-seq approach. The text could also benefit from copy editing (e.g. three clauses starting with "while" in the paragraph starting on line 456; "od ratio" on line 415). I appreciate the authors' work on the discussion, including posing big picture questions for the field (lines 426-429), but I don't see how they have anything to do with the current scRNA-seq method.

Author response:

The following is the authors’ response to the original reviews.

Reviewer #1:

Minor

(MN1) The segregants should be referred to as F2 segregants as they are derived from an F1 cross.

We thank the reviewer for pointing out this important oversight. We indeed analyzed segregants of an F1 cross and have corrected this in the text.

(MN2) The connections to eQTLs in other organisms should be addressed in the introduction and conclusion. For example, in humans, there has been little evidence for trans eQTLs in contrast to what has been found in yeast.

We thank the reviewer for pointing this out and improved our introduction and conclusion with such connections.

(M3) The authors state that an advantage of scRNAseq over bulk is that it captures rare cell populations (line 79), but this advantage is not exploited in this study.

While we did not explicitly demonstrate the effect of using scRNA-seq on capturing variation in rare cell populations, the referenced literature (21, 40) provides evidence that pooled scRNA-seq captures important expression heterogeneity (which implicitly contains potentially rare expression states). In our study, this is leveraged on F2 segregants to assess expression variation within the same lineage (genotype). This impacts the partitioning of expression variance from genotype.

Thus, we mentioned this point to further support the choice of using scRNA-seq for this analysis and showed that even a few single cells enable the reconstruction of the genome and expression profile of rare cell types.

(MN4) The authors use ~5% of the lineages from the original study. There is no rationale for why this is an appropriate sample size. Is there an argument for using more cells in eQTL mapping or conversely could the authors ask if fewer cells would provide similar conclusions by downsampling?

Although scRNA-seq is highly scalable, it has limitations in terms of throughput. Indeed, a single library with 10x Genomics generates data in the order of 10^4 wellcovered cells. With these limitations, our choice of ~5% of the lineages of the original study stems from the need to recover the same lineage multiple times within these 10^4 cells (in our study, each lineage is recovered on average 4 times).

While it is possible to run multiple libraries and sequencing lanes, budget limitations prevent us from running more libraries, especially since we expect power to scale with the square-root of the number of lineages (there is diminishing returns).

(MN5) I do not agree that the use of UMIs overcomes the challenges of low sequencing depth. UMIs mitigate the possible technical artifacts due to massive PCR amplification.

We thank the reviewer for this comment and will clarify this in the manuscript. Indeed, we intended to refer to the breadth of coverage (instead of the depth), which would usually manifest with massive PCR amplification of few transcripts.

(MN6) There is an inadequate reference to prior work on scRNAseq in yeast that established the methods used by the authors and eQTL mapping in human cells using scRNAseq.

We thank the reviewer for this and have added more context on scRNA-seq methods benchmark in yeast (drop-seq etc) and sc-eQTL in human. Additionally, we have cited Jariani et al. (2020) in eLife where similar techniques were employed for scRNA-seq in yeast.

(MN7) The use of empty quotes in Figure 4A is confusing and an alternative presentation method should be used.

We will remove these empty quotes characters and replace them with a more meaningful representation like “none”.

(MN8) The authors speculate about the use of predicted fitness instead of observed fitness, but this is something they could explicitly address in their current study.

We thank the reviewer for this comment but have decided not to perform a whole new bulk-segregant analysis experiment (X-QTL) to identify QTL that way. However, we do agree that we could in principle use the QTL that were identified in our previous study (Nguyen Ba et al, 2022). Despite this, we do not see the need for this because the predicted fitness is the overlap between genotype and phenotype (within the variance partitioning framework, it is the ‘narrow-sense heritability’ if one ignores epistasis). Thus, the use of predicted fitness when partitioning for expression variation would be constrained to that overlap (as opposed to the real observed fitness). This means that within the variance partitioning framework, the overlap of genotype, expression, and fitness is fully recapitulated by using predicted fitness instead (given that this predicted fitness is accurate to the narrow-sense heritability). In our previous study, we found that the QTL essentially predict all of the narrow-sense heritability. We believe it is therefore evident that the use of predicted fitness would be sufficient if and only if the expression variation independent of genotype is not associated with observed fitness.

We note that our study cannot generalize whether the overlap between genotype and expression fully captures fitness variation explained by expression. Indeed, we believe this is not generalizable to many other contexts (for example, in development). Thus, at present, the use of predicted fitness remains a speculation.

Major:

(MJ1) There is insufficient information provided about the nature of data. At a minimum, the following information should be provided to enable assessment of the study: What is the total library size, how many genes are identified per cell, how many UMIs are found per cell, what is the doublet rate, and how are doublets identified (e.g. on the basis of heterozygous calls at polymorphic loci?), how many times is each genotype observed, and how many polymorphic sites are identified per cell that are the basis of genotype inferences?

We understand that these metrics are relevant to the reader to have an idea of the power of our approach and integrate them in the manuscript in Table 1.

(MJ2) The prior study analyzed 18 different conditions, whereas this study only assays expression in a single condition. However, the power of the authors' approach is that its efficiency enables testing eQTLs in multiple conditions. The study would be greatly strengthened through analysis of at least one more condition, and ideally several more conditions. The previous fitness study would be a useful guide for choosing additional conditions as identifying those conditions that result in the greatest contrasts in fitness QTL would be best suited to testing the generalizations that can be drawn from the study.

We agree that a major strength of our approach is that it rapidly allows eQTL mapping in several conditions. While the experiments presented here are likely less expensive than the classical eQTL mapping experiments, the cost of 10x genomics and sequencing is still an important consideration. The pleiotropy analysis of the prior study was substantially difficult to interpret and put in context, and thus we decided to focus on a proof of concept and leave room for a more thorough analysis of multiple environments for a future study. We acknowledge that this is a main weakness of our manuscript.

(MJ3) Alternatively, the authors could demonstrate the power of their approach by applying it to a cross between two other yeast strains. As the cross between BY and RM has been exhaustively studied, applying this approach to a different cross would increase the likelihood of making novel biological discoveries.

We thank the reviewers for this suggestion, and it is indeed something that our lab is considering. Currently, one of our main point of the manuscript still relies on growth measurements of segregants (the fitness), which we cannot obtain from segregants and scRNA-seq alone.

Unfortunately, in this experimental design, it is difficult to obtain the fitness of cells and the genotype simultaneously because the barcode of the segregant is not expressed and not frequently read during genotyping. Thus, we still need to perform a whole QTL panel for a new cross without substantial re-engineering.

That being said, we are working on this but feel that including a new panel in this study is beyond the scope of our manuscript.

(MJ4) Figure 1 is misleading as A presents the original study from 2022 without important details such as how genotypes were identified. It is unclear what the barcode is in this study and how it is used in the analysis. Is the barcode for each lineage transcribed so that it is identified in the scRNA-seq data? Or, does the barcode in B refer to the cell index barcode? A clearer presentation and explanation of terms are needed to understand the method.

Because F2 segregant lineage barcodes are not expressed, the barcode indicated in Figure 1B refers to cell barcodes from 10x Genomics. Our present study does not make use of the lineage barcode. We clarified this in the figure clarifying that panel A refers to the original study from 2022 and explicitly mentioning ‘cell barcodes’.

(MJ5) The rationale for the analysis reported in Figure 2B is unclear. The fitness data are from the previous study and the goal is to estimate the heritability using the genotyping data from the scRNA-Seq data. What is the explanation for why the data don't agree for only one condition, i.e. 37C? And, what are we to understand from the overall result?

The rationale of Figure 2A/B is to show that cell lineage genotyping with scRNA-seq yields consistent results with previous genotype-phenotype analyses of the same cross. While Figure 2A shows that the single-cell imputed genotypes resemble the reference panel (sequenced in the Nguyen Ba 2022 study), Figure 2B shows that the variance partitioning to associate genotype to phenotype can be performed using the single-cell genotypes themselves (bypassing the reference panel). We believe this is an interesting result given that the reads obtained by scRNA-seq are constrained to a subset of SNP. However, we note that if the imputed single-cell genotypes were perfectly matching with the reference panel, it would not be surprising that one could do genotype-phenotype mapping from the single-cell genotypes.

In Figure 2B, we tested whether the similarity of the single-cell imputed genotypes to the reference panel was enough to estimate heritabilities (another summary statistic).

In the remaining paragraphs of that result section, we further discuss that the single-cell lineage genotypes can be used for QTL mapping as well, recapitulating many of the QTL identified in the reference panel (provided that one controls for power). This result did not make it as a main Figure but is included in Figure S4.

That being said, we decided to update the figure by comparing the estimates in subsamples of batch1 scRNA-seq to subsamples of batch 1 reference panel and subsamples of the full reference panel. Subsamples were performed to control for power in the variance partitioning. We also noticed that the fitness of several F2 segregants is missing for the phenotypes 33C, 35C and 37C in the original study so we decided to exclude these environments.

(MJ6) Figure 3 presents an analysis of variance partitioning as a Venn diagram. This summarized result is very hard to understand in the absence of any examples of what the underlying raw data look like. For example, what does trait variation look like if only genotype explains the variance or if only gene expression explains the variance? The presented highly summarized data is not intuitive and its presentation is poor - the result that is currently provided would be easier to read in a table format, but the reader needs more information to be able to interpret and understand the result.

The Venn diagram is largely adopted in the context of variance partitioning (see Cohen, Jacob, and Patricia Cohen. 1975. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences.) but we realize that it has not been used often for displaying heritability estimates. To this end, we have added explanatory labels for the biological meaning of the areas or components of the diagram in the Figure and in the text.

(MJ7) I am concerned about the conclusions that can be drawn about expression heritability. The authors claim that expression heritability is correlated with expression levels. It seems likely that this reflects differing statistical power. How can this possibility be excluded?

We thank the reviewer for highlighting this. We now explicitly acknowledge this potential confounding factor in the manuscript.

(MJ8) Conversely, the authors claim that the genes with the lowest heritability are genes involved in the cell cycle. However, uniquely in scRNA-seq, cell cycle regulated genes appear to have the highest variance in the data as they are only expressed in a subset of cells. Without incorporating this fact one would erroneously conclude that the variation is not heritable. To test the heritability of cell cycle regulation genes the authors should partition the cells into each cell cycle stage based on expression.

The reviewer is right to say that the low heritability of cell cycle control genes could be explained by the fact that these genes are only expressed in a subset of the dataset. Indeed, a high transcriptomic variance does not necessarily imply a low expression heritability: the cell cycle could be the residual of the expression heritability model, i.e. it explains expression variance with low association to genetic mutation.

That being said, our result is consistent with results obtained from yeast bulk RNA-seq (Albert et al. 2018), in which cell cycle is averaged out.

In our study, we also average out the cell-cycle as we use the consensus expression and the consensus genome to estimate the heritability.

(MJ9) I do not understand Figure S5 and how eQTL sites are assigned to these specific classes given that the authors say that causative variation cannot be resolved because of linkage disequilibrium.

The rationale for Figure S5 is to show that the QTL model obtained from single-cell data is consistent with the reference panel QTL mapping experiment. Although there is uncertainty around the exact position of the QTL, we relied on the loci with the highest likelihood and showed that the datasets have consistent features. This is enabled by the fact that the QTL identified using the scRNA-seq genotypes are the ones with largest effect size in the reference panel, and are thus more likely to be mapped accurately.

(MJ10) The paragraph starting at line 305 is very confusing. In particular, the authors state that they identify a hotspot of regulation at the mating type locus. It is not obvious why this would be the case. Moreover, they claim that they find evidence for both MATa and MATalpha gene expression. Information is not provided about how segregants were isolated, but assuming that the authors did not dissect 25,000 tetrads to obtain 100,000 segregants I would infer that random spore using SGA was used. In that case, all cells should be MATa. The authors should clarify and explain this observation.

Although most of the cells have the MATa mating type (as selected by random spore using SGA), it is well known and discussed in Nguyen Ba et al. paper that there are few lineages with other mating types or diploids (they are leakers in the selection process).

Indeed, we verified that we can detect a small number of MATalpha cells or diploids within this pool.

(MJ11) Ultimately, it is not clear what new biological findings the authors have made. There are no novel findings with respect to causative variation underlying eQTLs and I would encourage the authors to make clearer statements in their abstract, introduction, and conclusion about the key discoveries. E.g. What are the "new associations between phenotypic and transcriptomic variations" mentioned in the abstract?

This paper focuses more on the proof of concept that scRNA-seq can help integrate expression data in GPM analysis to reveal broad scale associations between fitness and expression. Indeed, novel findings include new hotspots of expression regulation in the RM/BY genetic background, we find that trans-regulation of expression has more impact than cis-regulation on fitness and evaluate the strength of the association between the genome, the transcriptome and fitness (in one environment). Additionally, the analysis reveals biological questions that cannot be answered even by increasing the experimental scale of eQTL mapping experiments. For example, we find that most of the missing heritability is not explained by expression. These key points will be clarified in the abstract, introduction and conclusion as suggested by the editors.

Reviewer #2:

(MJ1) Most of the figures center on methods development and validation for the authors' single-cell RNA-seq in the yeast cross […] One potential novelty of the study is the methods per se: that is, showing that scRNA-seq works for concomitant genotyping and gene expression profiling in the natural variation context. The authors' rigor and effort notwithstanding: in my view, this can be described as modest in terms of principles. That is, the authors did a good job putting the scRNA-seq idea into practice, but their success is perhaps not surprising or highly relevant for work outside of yeast (as the discussion says).

Although the scope of the method is limited, we think that it can apply to any largescale dataset in which transcription variance and genetic diversity are not small. This can help reduce the lack of associations between trait heritability and expression regulation, which is frequent as these two parameters are often not measured within the same dataset.

We can, however, think of some other settings where a similar experiment may be interesting. This includes, for example, pooling cells from different human individuals (with enough genetic diversity) and applying the same scRNA-seq method to back-identify the individuals and matching them to a particular phenotype. We believe our proof of concept is therefore an important contribution as these other experiments might have broad implications.

(MJ2) The more substantive claim by the authors for the impact of the study is that they make new observations about the role of expression in phenotype (lines 333-335). The major display item of the manuscript on this theme is Figure 4A, reporting which loci that control growth phenotype (from an earlier paper) also control expression. This is solid but I regret to say that the results strike me as modest.

This paper focuses more on the proof of concept that scRNA-seq can help integrate expression data in GPM analysis to reveal broad scale associations between fitness and expression. Indeed, novel findings include new hotspots of expression regulation in the RM/BY genetic background, we find that trans-regulation of expression has more impact than cis-regulation on fitness and evaluate the strength of the association between the genome, the transcriptome and fitness (in one environment). Additionally, the analysis reveals biological questions that cannot be answered even by increasing the experimental scale of eQTL mapping experiments. For example, we find that most of the missing heritability is not explained by expression. These key points will be clarified in the abstract, introduction and conclusion as suggested by the editors.

(MJ3) The discussion makes some perhaps fairly big claims that the work has helped "bridge understanding of how genetic variation influences transcriptomic variation" and ultimately cellular phenotype. But with the data as they stand, the authors have missed an opportunity to crystallize exactly how a given variant affects expression (perhaps in waves of regulators affecting targets that affect more regulators) and then phenotype, except for the speculations in the text on lines 305-319. The field started down this road years ago with Bayesian causality inference methods applied to eQTL and phenotype mapping (via e.g. the work of Eric Schadt). The authors could now try Mendelian randomization-type fine-grained detailed models for more firepower toward the same end, and/or experimental tests of the genotype-to-expression-to-phenotype relationship. I would see these directions, motivated by fundamental questions that are relevant to the field at large, as leading to a major advance for this very crowded field. As it stands, I felt their absence in this manuscript especially if the authors are selling principles about linking expression and phenotype as their take-home.

We thank the reviewer for this suggestion and agree that the analysis of the genotypeto-expression-to-phenotype relationship would benefit from a more fine-grain model. While we are interested in exploring this, we decided to limit the scope of this manuscript to the proof of concept that scRNA-seq can help gain insights about the genotypephenotype map at a broader scale.

(MN1) I also wonder whether the co-mapping of expression and growth traits in Figure 4A would have been possible with e.g. the bulk RNA-seq from Albert et al., 2018, and I recommend that the authors repeat the Figure 4A-type analyses with the latter to justify their statement that their massive scRNA data set would actually be necessary for them to bear fruit (lines 386-388).

By repeating our eQTL hotspot analysis with Albert et al. (2018) data, we observed a non-significant association between eQTL hotspot and QTL (χ2 p = 0.50). That being said, there are some differences in the Albert et al. Experiment that preclude us from conclusively saying whether the bulk RNA-seq experiments by Alberts would not bear fruit. Indeed, that experiment is only 4 times smaller in scale and so we would not expect dramatic differences. To highlight power differences, the Albert et al. Paper identified about 6 eQTL per gene, while our study identified about 21 which is consistent with the power differences.

This highlights that this scRNA-seq experiment is scalable, so the technique may be useful for further studies. In addition, this pooled scRNA-seq strategy enables analysis of the association of transcription with phenotype.

(MN2) I also read the discussion of the manuscript as bringing to the fore some of the challenges a reader has in judging the current state of the results to be of actionable impact. The discussion, and the manuscript, will be improved if the authors can put the work in context, posing concrete questions from the field and stating how they are addressed here and what's left to do.

We agree with the reviewer and have summarized our answers to some of the questions in the field in the discussion section.

All that being said, we acknowledge the limitations of our study.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation