Identifying the genetic basis and molecular mechanisms underlying phenotypic correlation between complex human traits using a gene-based approach

  1. Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, United States
  2. Caribou Biosciences, Berkeley, United States
  3. Research Computing Center, University of Chicago, Chicago, United States
  4. Department of Human Genetic, University of Chicago, Chicago, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Murim Choi
    Seoul National University, Seoul, Republic of Korea
  • Senior Editor
    Murim Choi
    Seoul National University, Seoul, Republic of Korea

Reviewer #1 (Public review):

The authors tried to quantify the difference between human complex traits by calculating genetic overlap scores between a pair of traits. Sherlock-II was devised to integrate GWAS with eQTL signals. The authors claim that Sherlock-II is superior to the previous version (robustness, accuracy, etc). It appears that their framework provides a reasonable solution to this important question, although the study needs further clarification and improvements.

(1) Sherlock-II incorporates GWAS and eQTL signals to better quantify genetic signals for a given complex trait. However, this approach is based on the hypothesis that "all GWAS signals confer association to complex trait via eQTL", which is not true (PMID: 37857933). This should be acknowledged (through mentioning in the text) and incorporated into the current setup (through differential analysis - for example, with or without eQTL signals, or with strong colocalization only).

(2) When incorporating eQTL, why did the authors use the top p-value tissues for eQTL? This approach seems simpler and probably more robust. But many eQTLs are tissue-specific. Therefore, it would also be important to know if eQTLS from appropriate tissues were incorporated instead.

(3) One of the main examples is the novel association between Alzheimer's disease and breast cancer. Although the authors provided a molecular clue underlying the association, it is still hard to comprehend the association easily, as the two diseases are generally known to be exclusive to each other. This is probably because breast cancer GWAS is performed for germline variants and does not consider the contribution of somatic variants.

(4) It would help readers understand the story better if a summary figure of the entire process were provided. The current Figure 1 does not fulfil that role.

(5) Figure 2 is not very informative. The readers would want to know more quantitative information rather than a heatmap-style display. Is there directionality to the relationship, or is it always unidirectional?

(6) In Figure 3, readers may want to know more specific information. For example, what gene signals are really driving the hypoxia signal in Alzheimer's disease vs breast cancer? And what SNP signals are driving these gene-level signals?

Reviewer #2 (Public review):

Summary:

The authors introduce a gene-level framework to detect shared genetic architecture between complex traits by integrating GWAS summary statistics with eQTL data via a new algorithm, Sherlock-II, which aggregates signals from multiple (cis/trans) eSNPs to produce gene-phenotype p-values. Shared pathways are identified with Partial-Pearson-Correlation Analysis (PPCA).

Strengths:

The authors show the gene-based approach is complementary and often more sensitive than SNP-level methods, and discuss limitations (in terms of no directionality, dependence on eQTL coverage).

Weaknesses:

(1) How do the authors explain data where missing tissues or sparse eQTL mapping are available? Would that bias as to which genes/traits can be linked and may produce false negatives or tissue-specific false positives?

(2) Aggregating SNP-level signals into gene scores can be confounded by LD; for example, a nearby causal variant for a different gene or non-expression mechanism may drive a gene's score, producing spurious gene-trait links. How do the authors prevent this?

(3) How the SNPs are assigned to genes would affect results, this is because different choices can change which genes appear shared between traits. The authors can expand on these.

(4) Many reported novel trait links remain speculative without functional or orthogonal validation (e.g., colocalization, perturbation data). Thus, the manuscript's claims are inconclusive and speculative.

(5) It would be best to run LD-aware colocalization and power-matched simulations to check for robustness.

Author response:

Reviewer #1 (Public review):

The authors tried to quantify the difference between human complex traits by calculating genetic overlap scores between a pair of traits. Sherlock-II was devised to integrate GWAS with eQTL signals. The authors claim that Sherlock-II is superior to the previous version (robustness, accuracy, etc). It appears that their framework provides a reasonable solution to this important question, although the study needs further clarification and improvements.

(1) Sherlock-II incorporates GWAS and eQTL signals to better quantify genetic signals for a given complex trait. However, this approach is based on the hypothesis that "all GWAS signals confer association to complex trait via eQTL", which is not true (PMID: 37857933). This should be acknowledged (through mentioning in the text) and incorporated into the current setup (through differential analysis - for example, with or without eQTL signals, or with strong colocalization only).

The reviewer is correct that in this version of the tool, we focused on SNPs with effect on gene expression, as the majority of the SNPs identified by GWASs are non-coding SNPs. In the future improvement, we should also include coding SNPs that change the amino acid sequence of genes. We will discuss this point more in the revised manuscript.

(2) When incorporating eQTL, why did the authors use the top p-value tissues for eQTL? This approach seems simpler and probably more robust. But many eQTLs are tissue-specific. Therefore, it would also be important to know if eQTLS from appropriate tissues were incorporated instead.

This is a simple scheme to incorporate eQTL data from multiple tissues, assuming that the tissue that gives the strongest association is most relevant, or mainly mediates the effect from the SNP to the phenotype. This is a reasonable approach given that the tissues of origin for most of the phenotypes are unknown. In the future improvement, we should incorporate eQTL data from the appropriate tissue(s) if that is known.

(3) One of the main examples is the novel association between Alzheimer's disease and breast cancer. Although the authors provided a molecular clue underlying the association, it is still hard to comprehend the association easily, as the two diseases are generally known to be exclusive to each other. This is probably because breast cancer GWAS is performed for germline variants and does not consider the contribution of somatic variants.

This is due to one of the limitations of the current algorithm: no direction of association is predicted explicitly. It could be that increasing the expression of a gene reduced the risk of one disease but increase the risk of another. Currently we have to analyze the details of the SNPs to infer direction once overlapping genes are found. This needs improvement in the future.

(4) It would help readers understand the story better if a summary figure of the entire process were provided. The current Figure 1 does not fulfil that role.

We plan to incorporate reviewer's suggestion in the revised manuscript.

(5) Figure 2 is not very informative. The readers would want to know more quantitative information rather than a heatmap-style display. Is there directionality to the relationship, or is it always unidirectional?

We will consider a different presentation in the revised manuscript.

(6) In Figure 3, readers may want to know more specific information. For example, what gene signals are really driving the hypoxia signal in Alzheimer's disease vs breast cancer? And what SNP signals are driving these gene-level signals?

We will add these information in the revised manuscript.

Reviewer #2 (Public review):

Summary:

The authors introduce a gene-level framework to detect shared genetic architecture between complex traits by integrating GWAS summary statistics with eQTL data via a new algorithm, Sherlock-II, which aggregates signals from multiple (cis/trans) eSNPs to produce gene-phenotype p-values. Shared pathways are identified with Partial-Pearson-Correlation Analysis (PPCA).

Strengths:

The authors show the gene-based approach is complementary and often more sensitive than SNP-level methods, and discuss limitations (in terms of no directionality, dependence on eQTL coverage).

Weaknesses:

(1) How do the authors explain data where missing tissues or sparse eQTL mapping are available? Would that bias as to which genes/traits can be linked and may produce false negatives or tissue-specific false positives?

Missing tissues or sparse eQTL certainly can produce false negatives as the signals linking the two phenotypes are simply not captured in the data. It is less likely to produce false positives as long as the statistical test is well controlled.

(2) Aggregating SNP-level signals into gene scores can be confounded by LD; for example, a nearby causal variant for a different gene or non-expression mechanism may drive a gene's score, producing spurious gene-trait links. How do the authors prevent this?

When there are multiple SNPs in LD with multiple genes nearby, it is generally difficult to map the causal SNP and the causal gene it affected, and thus there will be spurious gene-trait links. When we calculate the global similarity based on the gene-trait association profiles, we tried to control this by simulating with random GWASs that have the same power as the real GWAS and preserve the LD structure, as the spurious links will also be present in the simulated data (but may appear in different loci) that are used to calibrate the statistical significance.

(3) How the SNPs are assigned to genes would affect results, this is because different choices can change which genes appear shared between traits. The authors can expand on these.

We assign SNPs to genes based on their strongest eQTL association from the available data. Improvement can be made if the relevant tissues for a trait are known (see response to Reviewer 1 above).

(4) Many reported novel trait links remain speculative without functional or orthogonal validation (e.g., colocalization, perturbation data). Thus, the manuscript's claims are inconclusive and speculative.

We agree with the reviewer that the reported trait links are speculative, and they should be treated as hypotheses generated from the computational analyses. To truly validate some of these proposed relationships, deeper functional analyses and experimental tests are needed.

(5) It would be best to run LD-aware colocalization and power-matched simulations to check for robustness.

We agree more control on LD and power-matched simulations will be important for testing the robustness of the predictions.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation