A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia

  1. Xiangyi Li
  2. Lin Jiang
  3. Chao Xue
  4. Mulin Jun Li
  5. Miaoxin Li  Is a corresponding author
  1. Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-sen University, China
  2. Key Laboratory of Tropical Disease Control (Sun Yat-sen University), Ministry of Education, China
  3. Center for Precision Medicine, Sun Yat-sen University, China
  4. Research Center of Medical Sciences, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, China
  5. The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, Tianjin Medical University, China
  6. Guangdong Provincial Key Laboratory of Biomedical Imaging and Guangdong Provincial Engineering Research Center of Molecular Imaging, The Fifth Affiliated Hospital, Sun Yat-sen University, China

Abstract

Linkage disequilibrium and disease-associated variants in the non-coding regions make it difficult to distinguish the truly associated genes from the redundantly associated genes for complex diseases. In this study, we proposed a new conditional gene-based framework called eDESE that leveraged an improved effective chi-squared statistic to control the type I error rates and remove the redundant associations. eDESE initially performed the association analysis by mapping variants to genes according to their physical distance. We further demonstrated that the isoform-level eQTLs could be more powerful than the gene-level eQTLs in the association analysis using a simulation study. Then the eQTL-guided strategies, that is, mapping variants to genes according to their gene/isoform-level variant-gene cis-eQTLs associations, were also integrated with eDESE. We then applied eDESE to predict the potential susceptibility genes of schizophrenia and found that the potential susceptibility genes were enriched with many neuronal or synaptic signaling-related terms in the Gene Ontology knowledgebase and antipsychotics-gene interaction terms in the drug-gene interaction database (DGIdb). More importantly, seven potential susceptibility genes identified by eDESE were the target genes of multiple antipsychotics in DrugBank. Comparing the potential susceptibility genes identified by eDESE and other benchmark approaches (i.e., MAGMA and S-PrediXcan) implied that strategy based on the isoform-level eQTLs could be an important supplement for the other two strategies (physical distance and gene-level eQTLs). We have implemented eDESE in our integrative platform KGGSEE (http://pmglab.top/kggsee/#/) and hope that eDESE can facilitate the prediction of candidate susceptibility genes and isoforms for complex diseases in a multi-tissue context.

Editor's evaluation

This manuscript describes a new method of identifying disease-relevant risk genes from genome wide association studies by using a conditional gene-based method (named eDESE). The authors apply this new approach in an analysis of schizophrenia datasets, and identify meaningful biological processes and potential drug repurposing candidates. Thus, this new method could provide improved gene prioritization for fine-mapping and functional studies of specific diseases.

https://doi.org/10.7554/eLife.70779.sa0

Introduction

Genome-wide association studies (GWASs) have been used to identify genotype-phenotype associations for over a decade, and thousands of single-nucleotide polymorphisms (SNPs) have been revealed for their associations with hundreds or thousands of complex human diseases (Visscher et al., 2017; Gallagher and Chen-Plotkin, 2018). Nevertheless, conventional GWASs analyses have limited power to produce a complete set of susceptibility variants for complex diseases (Tam et al., 2019). Because most susceptibility SNPs only have small effects on a complex phenotype, conventional SNP-based association tests are generally underpowered to reveal susceptibility variants after multiple-testing corrections. Moreover, the susceptibility variants scattering randomly throughout the genome are often in strong linkage disequilibrium (LD) with numerous neutral SNPs, which makes the discrimination of truly causal variants from GWAS hits quite difficult (Tam et al., 2019). Finally, more than 90% of the disease-associated variants are in non-coding regions of the genome, and many of them are far from the nearest known gene, and it remains a challenge to link genes and a complex phenotype through the non-coding variants (Schaub et al., 2012; Maurano et al., 2012). Accordingly, corresponding methodological strategies have been proposed to make up, at least partly, for the issues mentioned above.

First, gene-based approaches can reduce the multiple-testing burdens by considering the association between a phenotype and all variants within a gene (Neale and Sham, 2004). Assigning a variant to a gene according to the physical distance of the variant from gene boundary is one of the most popular strategies for gene-based approaches. For example, MAGMA (Multi-marker Analysis of GenoMic Annotation), one of the most popular gene-based approaches, uses a multiple regression approach to incorporate LD between markers and detect multi-markers effects to perform gene-based analysis (de Leeuw et al., 2015). VEGAS, a versatile gene-based test for GWAS, incorporates information from a full set of markers (or a defined subset) within a gene and accounts for LD between markers by simulations from the multivariate normal distribution (Liu et al., 2010). GATES, a rapid gene-based association test that uses an extended Simes procedure to assess the statistical significance of gene-level associations (Li et al., 2011). SuSiE (sum of single effects), a novel and popular approach to variable selection in linear regression, can use summary statistics and LD to produce gene-level evidence of association in terms of Bayes Factor (Wang et al., 2020).

Second, evaluating the gene-phenotype associations at one gene conditioning on other genes can isolate true susceptibility genes from the redundant non-susceptibility genes (Li et al., 2019). Yang et al., 2012 proposed an approximate conditional and joint association analysis method based on linear regression analysis to estimate the individual causal variant with GWAS summary statistics. Our previously proposed conditional gene-based association approach based on effective chi-squared statistics (ECS) could remove redundantly associated genes based on the GWAS p-values of variants. Comparing the conditional gene-based association approach with ECS, MAGMA, and VEGAS suggested that the former might be more powerful to predict biologically sensible susceptibility genes (Li et al., 2019).

Third, the observation that variants in the non-coding regions were enriched in the transcriptional regulatory regions implied that these variants might affect the disease risk by altering the genetic regulation of target genes (Gallagher and Chen-Plotkin, 2018). Integration of expression quantitative trait loci (eQTL) studies and GWAS has been used to investigate the genetic regulatory effects on complex diseases. As many complex diseases manifested themselves in certain tissues, using the eQTLs of potentially phenotype-associated tissues might help identify the true susceptibility genes in the tissue context (Hekselman and Yeger-Lotem, 2020). Based on the framework of MAGMA, a method called eMAGMA integrated genetic and transcriptomic information (e.g. eQTLs) in a tissue-specific analysis to identify risk genes and was applied to identify novel genes underlying the major depression disorder (Gerring et al., 2021; Gerring et al., 2019). S-PrediXcan was developed for imputing the genetically regulated gene expression component based on GWAS summary statistics and transcriptome prediction models built from the eQTL/splicing QTL dataset of the Genotype Tissue Expression (GTEx) project (Barbeira et al., 2018). Researchers have applied S-PrediXcan to study genetic mechanisms of multiple complex traits (Gamazon et al., 2019; Gamazon et al., 2018; Huckins et al., 2019). In contrast to the considerable research focusing on integrating gene-level eQTLs with GWAS summary statistics, little attention has been paid to integrating isoform-level eQTLs with GWAS summary statistics. Michael Gandal et al. estimated the candidate risk genes of three psychiatric disorders based on GWAS summary statistics and isoform-level expression profiles. They emphasized the importance of isoform-level gene regulatory mechanisms in defining cell type and disease specificity (Gandal et al., 2018), and similar analyses and conclusions were generated for Alzheimer’s disease (Fan et al., 2021).

Although much achievement has been attained, identifying independently phenotype-associated genes with high reliability remains challenging, especially for complex diseases. Conventional gene-based approaches mainly focus on variants close (say ±5 kilo base pairs) to genes boundary and omit the distal but important variant-gene associations. Gene-based association approaches using eQTLs to identify candidate susceptibility genes and isoforms might raise the utilization rate of the distal but important variant-gene associations. The present study aimed to build a more powerful conditional gene-based framework based on a new ECS and mainly guided by eQTLs. We also investigated whether isoform-level eQTLs in the phenotype-associated tissues can help predict more significant susceptibility genes than gene-level eQTLs. The main assumption is that isoform-level eQTLs may reflect the more real regulatory relationship than gene-level eQTLs. Thus, using isoform-level eQTLs can help predict novel susceptibility genes and isoforms that the conventional gene-based approaches and gene-level eQTLs strategy cannot find. The following is the formation procedure of the assumption: Gene-level and isoform-level eQTLs are predicted based on the gene-level and isoform-level (or transcript-level) expression profiles, and the gene-level expression profiles are computed by averaging the expression of multiple isoforms produced by the gene. Thus, gene-level expression profiles may omit the expression heterogeneity among these isoforms, neutralize the opposite effects, and lower the power of gene-level eQTLs.

Results

Overview of the present study

We previously proposed an effective chi-squared statistic called ECS for the unconditional and conditional gene-based association analysis (Li et al., 2019). Then we built a unified framework called DESE to estimate the potentially phenotype-associated tissues based on the conditional gene-based association analysis with ECS and gene selective expression analysis (Jiang et al., 2019). However, we found that the previous ECS was hindered by a potential type I error inflation issue and further undermined the accuracy of DESE. Here we proposed a new conditional gene-based association framework, eDESE (eQTL-guided DESE), which could also perform conditional gene-based association analysis and geneselective expression analysis, to systematically explore the susceptibility genes and tissues associated with complex diseases by using the GWAS summary statistics and multiple gene-variant mapping strategies. eDESE inherited the framework of DESE but had two important advantages over DESE. First, eDESE is built based on a new ECS, with which the type I error could be controlled within a proper level. Second, eDESE expands the conditional gene-based association analysis of DESE by not only using physically nearby SNPs but also using the gene-level and isoform-level cis-eQTLs associations (Figure 1).

The advantages of eDESE over ECS and DESE.

First, we proposed a new ECS and chose the best exponent c between 1 and 2 to properly control the type I error. Second, we first adopted three strategies to map SNP to genes to perform the unconditional gene-based association analysis with the improved ECS. Then the unconditional gene-based association analysis results were put into the conditional gene-based association analysis and the following iterative procedure. 5 kb: 5000 base pairs. 1 Mb: 106 base pairs.

To evaluate the performance of the new ECS and eDESE, we performed extensive simulations and a real data application to schizophrenia. Specifically, we organized the present study in several sequential parts that cover the optimizing the exponent of chi-squared statistics to control the type I error rates, applying the new ECS to perform conditional gene-based association analysis in simulation data and real-world schizophrenia GWAS summary statistics data. For simplicity and clarity, the model integrating different mapping strategies, that is, physically nearby SNPs (distance), gene-level and isoform-level variant-gene cis-eQTLs associations, were named eDESE:dist, eDESE:gene and eDESE:isoform, respectively.

Choose the favorable exponent c for the correlation matrix of chi-squared statistics to control the type I error rates

We found that the exponent c in the correlation matrix of chi-squared statistics could determine the deviation of the p-values produced by the ECS tests against the uniform distribution. As shown in Figure 2, the c = 1.0 led to deflated p-values while the c = 2.0 led to inflated p-values in the upper tail of the Q-Q plot against the uniform distribution. This pattern was independent of sample sizes, variant number and phenotype distribution (for binary or continuous traits) (Figure 2). The stable trend determined by the c value also implied that the favorable c, which could properly control the type I error rate, measured by the minimal mean log fold change (MLFC), must be within range 1 and 2. Besides, our theoretical derivation also demonstrated that the c value should be within range 1 and 2. Moreover, it seemed that given the c value, the distributions of p-values were similar at different sample sizes and phenotype distributions. Figure 3 shows the favorable c obtained by the grid search algorithm at 84 different scenarios. As shown in Figure 3, most majority p-values at the sample size 10,000 and 40,000 of binary or continuous phenotypes are overlapped. Again, the favorable c values were approximately independent of trait types, sample sizes and variant number. For the sake of simplicity, we proposed to use the averaged favorable c value, 1.432, and integrated it into the improved ECS.

Q-Q plots of the p-value of the ECS test under null hypothesis based on the two extreme exponents (i.e. 1 and 2).

(a), (b), and (c) represent the variant number of 50, 100, and 500, respectively.

The boxplots of the favorable c values at different simulation scenarios.

(a) Binary and continuous phenotypes; (b) Different sample sizes; (c) Different variant number.

The type I error and power of the conditional gene-based association analysis based on the new effective chi-squared statistics (ECS)

We then investigated the type I error and power of the conditional gene-based association analysis based on the improved ECS with the favorable exponent c value. As shown in Figure 4, in six different scenarios, the conditional p-values of the genes without truly casual loci approximately follow the uniform distribution U[0,1], regardless of the variance explained by its nearby genes. Moreover, the distribution of conditional p-value was similar to that produced by the conventional likelihood ratio test for the nested linear regression models. These results suggested that the conditional gene-based association analysis based on the improved ECS could produce valid p-values for statistical inference. In contrast, the unconditional association test based on the improved ECS produced an inflated p-value due to the indirect associations produced by the nearby causal genes in the LD block. Concerning the statistical power, we found that conditional gene-based association analysis based on the improved ECS produced smaller p-values than the likelihood ratio test (Figure 5), suggesting a higher statistical power of the former. Another advantage of conditional gene-based association analysis based on the improved ECS over the likelihood ratio test was that the former did not require individual genotypes. The reason might be that the degree of freedom in the latter was inflated by the LD among variants.

Q-Q plots of the conditional, unconditional gene-based association test and likelihood-ratio test under the null hypothesis.

(a) and (d) two gene-variant pairs with the similar variant number (SIPA1L2 with 29 variants and LOC729336 with 30 variants). (b) and (e) two gene-variant pairs with the different variant number, and the first is larger than the second (CACHD1 with 41 variants and RAVER2 with eight variants). (c) and (f) Two gene-variant pairs with the different variant number, and the second is larger than the first (LOC647132 with five variants and FAM5C with 48 variants). (a), (b) and (c) The former gene has no QTL, and QTL explained 0.5 % of heritability in the latter gene. (d), (e) and (f) The former gene has no QTL, and QTL explained 1 % of heritability in the latter gene. Ten thousand phenotype datasets were simulated for each scenario. Unconditional Eff. Chi. (the red) represents unconditional association analysis at the former gene by the improved ECS. Conditional Eff. Chi (the blue) represents conditional association analysis at the former gene conditioning on the latter gene by the improved ECS. The likelihood ratio test (the yellow) was conducted based on the nested linear regression models.

Q-Q plots of the conditional gene-based association test and likelihood-ratio test at different representative gene-variant pairs.

The variant number of the two gene-variant pairs involved in (a)-(f) are the same as that in Figure 4 legend. The difference is: in (a)-(c), the QTL in either gene (former and latter) explained 0.25% of heritability. In (d)-(f), the QTL in either gene explained 0.5% of heritability. One thousand phenotype datasets were simulated for each scenario.

Apply eDESE:dist to predict the potential susceptibility genes of schizophrenia

We had demonstrated that the conditional gene-based analysis based on the improved ECS was more powerful than the likelihood ratio test in each simulation scenario. To evaluate the performance of the ECS and eDESE in the real-world data, we used a recent large-scale GWAS summary statistics dataset (Trubetskoy et al., 2022) and gene expression profiles (GTEx v8) of multiple human tissues (Consortium, 2020) to identify the susceptibility genes of schizophrenia. We found that the improved ECS identified 739 significant genes without conditioning on gene-expression profiles. Furthermore, we also found 205 significant genes out of the above 739 genes identified by eDESE:dist based on the improved ECS by conditioning on the gene-level expression profiles (see details in Supplementary file 1a).

We then compared the significant susceptibility genes identified by eDESE:dist with that of MAGMA. We identified 619 significant susceptibility genes based on MAGMA. The significant gene count of MAGMA was about three times larger than that of eDESE:dist, which might partly result from the conditional gene-based analysis’s advantage of removing the redundantly associated genes. Besides, more than half of the significant susceptibility genes identified by eDESE:dist were also identified by MAGMA (Figure 6a).

The comparison of the potential susceptibility genes for schizophrenia identified by MAGMA and eDESE:dist.

(a) The Venn diagram shows the overlapped and unique genes identified by MAGMA and eDESE:dist. (b) The bar plot shows the top GO enrichment terms of the overlapped genes. MF: Molecular Function of GO. BP: Biological Process terms of GO. CC: Cellular Component terms of GO. The x-axis label represents the top ( ≤ 10) significant GO enrichment terms (MF, BP, and CC). The y-axis label represents the negative log10 of the adjusted p-value of each term. See also Figure 6—source data 1, Figure 6—source data 2 and Figure 6—source data 3.

Figure 6—source data 1

Significant genes identified by MAGMA.

https://cdn.elifesciences.org/articles/70779/elife-70779-fig6-data1-v1.xlsx
Figure 6—source data 2

Significant genes identified by eDESE:dist.

https://cdn.elifesciences.org/articles/70779/elife-70779-fig6-data2-v1.xlsx
Figure 6—source data 3

Enrichment results of the common 105 genes identified by MAGMA and eDESE:dist.

https://cdn.elifesciences.org/articles/70779/elife-70779-fig6-data3-v1.xlsx

Further, we performed Gene Ontology (GO) enrichment analysis to study the functional annotations of these significant genes. Interestingly, we found that most GO:BP and GO:CC enrichment terms of the overlapped genes were neuronal-, dendrite- or synaptic signaling-related terms. The GO:MF enrichment terms of the overlapped genes were all about signaling transduction (see examples in Figure 6b and details in Supplementary file 1b). We then found that the unique genes identified by eDESE:dist were enriched with three GO:CC terms, that is, dendrite, dendritic tree and distal axon, which were all dendrite-related terms. However, although the unique genes identified by MAGMA were enriched with thirty-one GO terms, none of these terms were neuronal-, dendrite-, or synaptic signaling-related terms. Moreover, systematic text-mining results in PubMed showed that 67 of the 205 (~32.7%) and 170 of the 619 (~27.5%) potential susceptibility genes had at least two search hits for eDESE:dist and MAGMA, respectively (see details in Supplementary file 1c). The GO enrichment results and the text-mining results both implied the utility of eDESE:dist.

Evaluate the type I error and power of gene-level eQTLs and isoform-level eQTLs in association analysis based on ECS

Next, we mapped variants to genes (or isoforms) according to their variant-gene/isoform eQTLs associations. Since the isoform-level and corresponding gene-level expression profiles were quantified based on the same RNA-sequencing data, we then investigated the type I error and power of the association analysis by ECS based on the gene-level and isoform-level eQTLs.

We considered the multiple different scenarios that variants affected phenotype through regulating the gene expression. We simulated genotype data, gene-level and isoform-level expression data and corresponding phenotype data. In Table 1, AllVar means that all variants are used in the gene-based association analysis based on ECS. IsoeQTL denotes that the eQTL is associated with a susceptibility isoform. GeneQTL denotes that the eQTL is associated with a gene whose expression is averaged from three susceptibility isoforms (homogeneity). Gen3eQTL and Gen6eQTL denote that the eQTL is associated with a gene whose expression is averaged by three and six isoforms, one of which is the susceptibility isoform (heterogeneity), respectively.

Table 1
Type I error and power of different simulation scenarios in association analysis.
ScenariosImportant parametersBinary traitContinuous trait
EgVgpVgeAllVarIsoeQTLGeneQTLGen3eQTLGen6eQTLAllVarIsoeQTLGeneQTLGen3eQTLGen6eQTL
Type I error
1000.0500000.002000.0010.0020.003
2000.1500.0020.001000000.0010
3000.30000.002000.0010.0010.0020.002
Power
400.0050.050.2510.0360.0220.0340.0310.2460.0320.0190.0320.038
500.0050.150.2190.0210.0130.0230.0320.3010.0250.0170.0370.043
600.0050.30.2290.0280.0170.0210.0340.2820.0240.0170.0250.039
70.100.0500.0170.0190.0060.00100.0170.0270.0090.001
80.100.1500.2130.2210.1130.0540.0020.2450.2450.1320.068
90.100.30.0180.7040.6590.5810.3880.0270.720.6860.6070.446
100.10.0050.050.2880.0520.0760.0430.0430.3130.0630.0910.050.041
110.10.0050.150.4030.330.3020.1990.1340.460.3570.3340.2290.136
120.10.0050.30.5690.7780.7380.6770.4850.620.8050.7740.7120.512
  1. Eg denotes the effect size of gene expression on phenotype. Vgp denotes phenotype variance explained by all variants. Vge denotes gene expression variance explained by all variants.

As shown in Table 1, our simulations' type I error rates are controlled within 0% ~ 0.3% (on average <0.1%) according to the p-value threshold 0.001 (scenarios 1–3). As expected, in scenarios 4–6, where gene expression cannot affect the phenotype (Eg = 0 in Table 1), AllVar is much more powerful than eQTLs. Further, in scenarios 7–12, where gene expression can affect the phenotype, our results suggest that the powers of eQTLs roughly increase with the phenotype variance explained by all variants (Vgp in Table 1) and gene expression variance explained by all variants (Vge in Table 1). Moreover, in scenarios 7–12, associations test totally based on the susceptibility isoforms (IsoeQTL and GeneQTL in Table 1) are more powerful than those based on the gene-level eQTLs. While IsoeQTL is computed based on fewer (i.e. one) susceptibility isoforms than GeneQTL (i.e. three), the power of IsoeQTL is the equal of GeneQTL. Thus, our simulation results revealed that isoform-level eQTLs were more powerful than gene-level eQTLs in association analysis in the scenarios that gene expression could affect the phenotype.

Estimate the potentially phenotype-associated tissues for schizophrenia

Like DESE, eDESE can produce phenotype-associated genes and tissues. Therefore, we firstly adopted eDESE:dist to predict the phenotype-associated tissues of schizophrenia and found that all thirteen brain regions were significantly associated with schizophrenia and ranked the top based on the gene-level and isoform-level expression profiles, respectively (see details in Supplementary file 1d).

Since all thirteen brain regions were predicted as the potentially phenotype-associated tissues of schizophrenia by eDESE:dist, removing the possible false positives would be necessary. Then we resorted to the eQTLs and assumed that if a tissue (say T1) is a phenotype-associated tissue, potential susceptibility genes identified based on the eQTLs of T1 will be more likely to be phenotype-associated genes and selectively express in T1 or similar tissues. We then computed the gene-level and isoform-level eQTLs of all thirteen brain regions and predicted the potentially phenotype-associated tissues using eDESE:gene and eDESE:isoform, respectively. Our results showed that the brain (all thirteen brain regions as a whole) was predicted as the schizophrenia-associated tissue based on the gene-level eQTLs of twelve brain regions, respectively (Table 2). However, on the more precise resolution, we found brain was predicted as schizophrenia-associated tissue only based on the isoform-level eQTLs of five brain regions, respectively (Table 2). Thus, the five brain regions were collectively optimized as the potentially phenotype-associated tissues of schizophrenia by eDESE:dist, eDESE:gene, and eDESE:isoform.

Table 2
The result about whether the brain was optimized as the schizophrenia-associated tissue based on each brain region’s gene/isoform-level eQTLs.
Brain regionsGene-level eQTLIsoform-level eQTL
Brain-Anterior cingulate cortex (BA24)YesYes
Brain-CerebellumYesYes
Brain-Frontal Cortex (BA9)YesYes
Brain-HippocampusYesYes
Brain-Spinal cord (cervical c-1)YesYes
Brain-AmygdalaYesNo
Brain-Caudate (basal ganglia)YesNo
Brain-Cerebellar HemisphereYesNo
Brain-CortexYesNo
Brain-HypothalamusYesNo
Brain-Nucleus accumbens (basal ganglia)YesNo
Brain-Putamen (basal ganglia)NoNo
Brain-Substantia nigraYesNo
  1. “Yes” denotes that brain (i.e., all thirteen brain tissues) was estimated as the significantly schizophrenia-associated tissue based on the gene/isoform-level eQTLs of the tissue. “No” denotes the contrary. The font names of the optimized brain regions are bold. See also Table 2—source data 1, Table 2—source data 2, Table 2—source data 3 and Table 2—source data 4.

Table 2—source data 1

Tissue significance estimated by eDESE:dist based on the gene-level expression profiles.

https://cdn.elifesciences.org/articles/70779/elife-70779-table2-data1-v1.xlsx
Table 2—source data 2

Tissue significance estimated by eDESE:dist based on the isoform-level expression profiles.

https://cdn.elifesciences.org/articles/70779/elife-70779-table2-data2-v1.xlsx
Table 2—source data 3

Tissue significance estimated by eDESE:gene based on the gene-level eQTLs of each brain region.

https://cdn.elifesciences.org/articles/70779/elife-70779-table2-data3-v1.xlsx
Table 2—source data 4

Tissue significance estimated by eDESE:isoform based on the isoform-level eQTLs of each brain region.

https://cdn.elifesciences.org/articles/70779/elife-70779-table2-data4-v1.xlsx

In contrast, we predicted schizophrenia-associated tissues based on the gene-level and isoform-level eQTLs of Muscle Skeletal and Skin Sun Exposed Lower Leg, whose sample sizes were bigger than all the other tissues in GTEx(v8). The prediction results of Muscle Skeletal and Skin Sun Exposed Lower Leg showed that neither of them was predicted to be significant tissues of schizophrenia by using their gene-level and isoform-level eQTLs (see details in Supplementary file 1e), which demonstrated, at least partly, our assumption. Thus, our results showed the utility of integrating the three models of eDESE for optimizing the phenotype-associated tissues.

The comparison of eDESE:isoform versus eDESE:gene and S-PrediXcan

eDESE:isoform can identify more potential susceptibility genes

Our simulation study demonstrated that association analysis based on the improved ECS with the isoform-level eQTLs was more powerful than with the gene-level eQTLs. We further tested this finding in the real data and identified the potential susceptibility genes for schizophrenia using the eDESE:isoform and eDESE:gene, respectively. We found the number of potential susceptibility genes identified by eDESE:isoform was larger than that of eDESE:gene and S-PrediXcan under the same adjustment filter cutoff (Figure 7a, see details in Supplementary file 1fg and h). We further combined the potential susceptibility genes of the five optimized brain regions identified by S-PrediXcan, eDESE:gene and eDESE:isoform, respectively. Still, we found that the susceptibility genes exclusively identified by eDESE:isoform were the most among the three models (Figure 7b).

Comparison of the potential susceptibility genes identified by S-PrediXcan, eDESE:gene, and eDESE:isoform.

(a) The bar plot shows the count of potential susceptibility genes in each of the five optimized brain regions. (b) The Venn diagram shows the count of the overlapped and unique genes identified by S-PrediXcan, eDESE:gene and eDESE:isoform in the five optimized brain regions. See also Figure 7—source data 1 and Figure 7—source data 2.

Figure 7—source data 1

The count of potential susceptibility genes in each of the five optimized brain regions.

https://cdn.elifesciences.org/articles/70779/elife-70779-fig7-data1-v1.xlsx
Figure 7—source data 2

The potential susceptibility genes identified by S-PrediXcan, eDESE:gene and eDESE:isoform in the five optimized brain regions.

https://cdn.elifesciences.org/articles/70779/elife-70779-fig7-data2-v1.xlsx

Potential susceptibility genes identified by eDESE:isoform were supported by more published studies

Then we searched the PubMed database with the combined susceptibility gene set (i.e. 391 genes for S-PrediXcan, 633 genes for eDESE:gene, and 854 genes for eDESE:isoform, Figure 7b) of all five optimized brain regions. We found that 135 genes for S-PrediXcan, 170 genes for eDESE:gene and 247 genes for eDESE:isoform had at least one search hit which reported the associations of these genes with schizophrenia in PubMed database, respectively (see details in Supplementary file 1ij and k).

We also found 138 of the 524 (26.3%) potential susceptibility genes exclusively predicted by eDESE:isoform each had a least one search hit. Moreover, we found that 19 of the 524 (3.6%) potential susceptibility genes each had at least 10 supported papers in PubMed (Table 3, see details in ). Interestingly, TCF4 (transcription factor 4), RGS4 (regulator of G protein signaling 4) and RANGAP1 (Ran GTPase activating protein 1) were reported by more than 100 papers. RGS4 is reported to be biased expressed in brain (O’Leary et al., 2016). TCF4 and RANGAP1 are broadly expressed in the brain, and TCF4 may play an important role in nervous system development (O’Leary et al., 2016).

Table 3
The important examples of potential susceptibility genes exclusively predicted by eDESE:isoform.
Gene name# of hits in PubMed
RGS4> 100
TCF4> 100
RANGAP1> 100
GRIA180
GRM376
TSPO39
TPH235
FEZ131
ZDHHC824
VRK223
KCNN320
NCAM120
MIP15
SLC39A814
DLG114
BDNF-AS13
FGA13
ADRA1A12
MAPT10
Table 3—source data 1

The PubMed search hits of the unique potential susceptibility genes of schizophrenia identified by eDESE:isoform (compared with S-PrediXcan and eDESE:gene).

https://cdn.elifesciences.org/articles/70779/elife-70779-table3-data1-v1.xlsx

Furthermore, we applied the Hetionet (v1.0) (Himmelstein et al., 2017), which encodes knowledge from millions of biomedical studies to connect diseases, genes, anatomies and more, to investigate the above top associations. We set the source node as the susceptibility gene and the target node as schizophrenia using the ‘Connectivity search’ function. We found that RGS4 and RANGAP1 both had multiple significant meta path types indicating their potential associations and mechanisms associated with schizophrenia.

Potential susceptibility genes identified by eDESE:isoform were enriched with more biologically sensible GO terms

Next, we performed the GO enrichment analysis and found that the potential susceptibility genes identified by eDESE guided by the eQTLs (eDESE:gene and eDESE:isoform) had more biologically sensible GO enrichment terms than S-PrediXcan (Table 4). In addition, the GO enrichment results also showed that the potential susceptibility genes identified by eDESE:isoform were enriched with more neuronal, dendritic or synaptic signaling-related biological process GO terms than S-PrediXcan and eDESE:gene.

Table 4
The GO enrichment terms of the potential susceptibility genes in each optimized brain region identified by S-PrediXcan, eDESE:gene and eDESE:isoform.
Tissue name*S-PrediXcaneDESE:geneeDESE:isoform
Brain-Anterior cingulate cortex (BA24)-Regulation of gap junction assembly (BP)Potassium ion transmembrane transporter activity (MF); potassium: chloride symporter activity (MF); nervous system development (BP); generation of neurons (BP); neurogenesis (BP)
Brain-CerebellumIon binding (MF); cation binding (MF); metal ion binding (MF); intracellular organelle (CC)Dendrite (CC); dendritic tree (CC); neuron projection (CC); postsynapse (CC); synapse (CC)Voltage-gated calcium channel activity involved in cardiac muscle cell action potential (MF); nervous system development (BP)
Brain-Frontal Cortex (BA9)Intracellular organelle (CC); intracellular membrane-bounded organelle (CC)Somatodendritic compartment (CC); dendrite (CC); dendritic tree (CC); synaptic vesicle membrane (CC); exocytic vesicle membrane (CC)-
Brain-HippocampusIon binding (MF)-Postsynaptic density (CC); asymmetric synapse (CC)
Brain-Spinal cord (cervical c-1)-High voltage-gated calcium channel activity (MF); voltage-gated calcium channel activity involved in AV node cell action potential (MF); regulation of B cell tolerance induction (BP); positive regulation of B cell tolerance induction (BP); L-type voltage-gated calcium channel complex (CC)Nitrogen compound transport (BP); organelle (CC)
  1. *

    MF: Molecular Function terms of GO. BP: Biological Process terms of GO. CC: Cellular Component terms of GO.

We further performed the GO enrichment analysis based on the combined gene lists of all five optimized brain regions for S-PrediXcan, eDESE:gene and eDESE:isoform. However, we found that the 55 genes commonly identified by the three models were enriched with no GO term. Furthermore, the 240 unique genes identified by S-PrediXcan were also enriched with no GO term. On the other hand, the 319 unique genes identified by the eDESE:gene were enriched with ‘integral component of synaptic vesicle membrane’ (CC) and several general GO terms. In comparison, the 524 unique genes identified by eDESE:isoform were enriched with a considerable number of GO terms, in which a few neuronal, dendritic or synaptic signaling-related GO terms were found (Supplementary file 1l).

Potential susceptibility genes identified by eDESE:isoform were significantly enriched in a biologically sensible consensus module in the brain weighted gene co-expression network

We then tested the enrichment of the potential susceptibility genes in the consensus modules of the brain weighted gene co-expression network. We found that the potential susceptibility genes identified by eDESE:gene and eDESE:isoform based on the gene-level and isoform-level eQTLs of Brain-Cerebellum were both enriched with a brain consensus module (colored ‘turquoise’) with statistical significance, that is, adjusted p-value = 0.0046 and 0.0061, respectively. Besides, the potential susceptibility genes identified by eDESE:gene based on the gene-level eQTLs of Brain-Frontal Cortex (BA9) were also enriched (adjusted p-value = 0.024) with the consensus module colored ‘turquoise’. However, no enriched consensus module was found for the potential susceptibility genes identified by S-PrediXcan. Moreover, we found that the ‘turquoise’ consensus module contained 7,726 genes and was enriched with plenty of neuronal and synaptic signaling-related GO terms (Figure 8, see details in Supplementary file 1m).

The GO enrichment terms of the genes in the consensus module (colored ‘turquoise’).

The x-axis label represents the top (≤10) significant GO enrichment terms (in MF, BP, and CC). The y-axis label represents the negative log10 of the adjusted p-value of each term. See also Figure 8—source data 1 and Figure 8—source data 2.

Figure 8—source data 1

Genes in the consensus module (colored "turquoise").

https://cdn.elifesciences.org/articles/70779/elife-70779-fig8-data1-v1.xlsx
Figure 8—source data 2

Enrichment results of the genes in the consensus module (colored ‘turquoise’).

https://cdn.elifesciences.org/articles/70779/elife-70779-fig8-data2-v1.xlsx

eDESE:isoform can predict the potential susceptibility isoforms of corresponding phenotype-associated tissues

As shown in Figure 7b, 55 genes were collectively predicted to be susceptible to schizophrenia by S-PrediXcan, eDESE:gene, and eDESE:isoform. As eDESE:isoform could output susceptibility gene-isoform pairs for corresponding tissues, we further got the corresponding susceptibility isoforms of the 55 genes (see details in Supplementary file 1n). Interestingly, we found that the number or type of susceptibility isoforms for a gene varied greatly in the different optimized brain regions. For example, NAGA and CHRNA2 were reported to be associated with schizophrenia by three and seven research papers in the PubMed database, respectively. ENST00000407991 of CHRNA2 was significantly associated with schizophrenia only in Brain-Cerebellum, while ENST00000396398 of NAGA was significantly associated with schizophrenia in all the five optimized brain regions. We also found that different isoforms of the same gene were predicted to be significantly associated with schizophrenia in the different optimized brain regions. For example, SNX19 was reported to be associated with schizophrenia by seven research papers. ENST00000527116 of SNX19 in Brain-Anterior cingulate cortex (BA24), ENST00000528555 of SNX19 in Brain-Cerebellum, ENST00000265909 of SNX19 in Brain-Frontal Cortex (BA9) and Brain-Hippocampus, and ENST00000526579 of SNX19 in Brain-Spinal cord (cervical c-1) were significantly associated with schizophrenia, respectively.

The above comparisons suggested that incorporating isoform-level eQTLs can help eDESE predict more potential susceptibility genes than gene-level eQTLs and S-PrediXcan in each optimized brain region. Our results further pointed that eDESE:isoform could help find some novel, biologically sensible susceptibility genes which S-PrediXcan and eDESE:gene cannot find. Moreover, we also found that the potential susceptibility genes identified by eDESE were enriched with a consensus module in the brain weighted gene co-expression network, which was significantly enriched with plenty of biologically sensible GO terms. Based on the isoform-level eQTLs of each phenotype-associated brain region, eDESE:isoform can also help gain insight into the potential susceptibility isoforms. Thus, our results revealed the potential advantages of eDESE:isoform, at least partly, over eDESE:gene and S-PrediXcan.

The druggability of the potential susceptibility genes

Since drug target genes with genetic support are twice or as likely to be approved than target genes with no known genetic associations (King et al., 2019; Nelson et al., 2015), we searched the DrugBank database (Wishart et al., 2018) and found that seven potential susceptibility genes identified by eDESE in total (0.976% for eDESE:dist, 0.632% for eDESE:gene and 0.703% for eDESE:isoform) were the target genes of multiple antipsychotics (Table 5). Several popular target genes, such as DRD2 and ADRA1A, were identified by different eDESE models. Besides, we found that three genes (0.485%) of the 619 potential susceptibility genes identified by MAGMA were also the target genes of multiple antipsychotics. One gene (0.256%) of the 391 potential susceptibility genes identified by S-PrediXcan was the target gene of an atypical antipsychotic (i.e. paliperidone). The results suggested that the three models of eDESE could complement each other to identify more susceptibility genes which could be the target genes of the therapeutic drugs.

Table 5
The target genes of the antipsychotics predicted as the potential susceptibility genes by MAGMA, S-PrediXcan, and eDESE.
Target geneModels
DRD2eDESE:dist & eDESE:gene & eDESE:isoform & MAGMA
ADRA1AeDESE:isoform
CHRM3eDESE:gene & eDESE:isoform
CHRM4eDESE:gene & eDESE:isoform& MAGMA
OPRD1eDESE:dist
GABRDeDESE:isoform
CYP2D6eDESE:gene & eDESE:isoform& MAGMA & S-PrediXcan

To further investigate the drug-gene interactions of the potential susceptibility genes, we searched the Drug Gene Interaction database (DGIdb v4.2.0) (Freshour et al., 2021) and kept the drug-gene interaction terms with at least one supported PubMed paper. After the filtration, we kept 30,072 unique drug-gene interaction terms and found 679 unique drug-gene interaction terms for 34 antipsychotics (see details in Supplementary file 1o). Then we put the combined potential susceptibility gene list (identified by MAGMA, S-PrediXcan, eDESE:dist, eDESE:gene and eDESE:isoform, respectively) into DGIdb to investigate if the ‘antipsychotic’-‘potential susceptibility gene’ interaction terms were enriched in the known drug-gene interaction database, that is, DGIdb. As shown in Table 6, we found that ‘antipsychotic’ - ‘potential susceptibility gene’ interaction terms identified by MAGMA and the three models of eDESE were all significantly enriched in DGIdb. We also found 452 drug-gene interaction terms for the susceptibility gene identified by S-PrediXcan. However, only 12 ‘antipsychotic’- ‘potential susceptibility gene’ interaction terms were found (hypergeometric distribution test p-value = 0.33).

Table 6
The enrichment of drug-gene interaction terms in DGIdb for the susceptibility genes identified by MAGMA, S-PrediXcan and eDESE.
Models# of antipsychotics-gene interaction terms# of total drug-gene interaction termsEnrichment p*
MAGMA579371.62e-11
S-PrediXcan124520.33
eDESE:dist342791.56e-15
eDESE:gene569681.65e-10
eDESE:isoform701,1048.74e-15
  1. *

    Enrichment p denotes the p-value of the hypergeometric distribution test. See also Table 6—source data 1, Table 6—source data 2, Table 6—source data 3, Table 6—source data 4, and Table 6—source data 5.

Table 6—source data 1

The drug-gene interaction term results of the potential susceptibility genes of schizophrenia identified by MAGMA in DGIdb.

https://cdn.elifesciences.org/articles/70779/elife-70779-table6-data1-v1.xlsx
Table 6—source data 2

The drug-gene interaction term results of the potential susceptibility genes of schizophrenia identified by S-PrediXcan in DGIdb.

https://cdn.elifesciences.org/articles/70779/elife-70779-table6-data2-v1.xlsx
Table 6—source data 3

The drug-gene interaction term results of the potential susceptibility genes of schizophrenia identified by eDESE:dist in DGIdb.

https://cdn.elifesciences.org/articles/70779/elife-70779-table6-data3-v1.xlsx
Table 6—source data 4

The drug-gene interaction term results of the potential susceptibility genes of schizophrenia identified by eDESE:gene in DGIdb.

https://cdn.elifesciences.org/articles/70779/elife-70779-table6-data4-v1.xlsx
Table 6—source data 5

The drug-gene interaction term results of the potential susceptibility genes of schizophrenia identified by eDESE:isoform in DGIdb.

https://cdn.elifesciences.org/articles/70779/elife-70779-table6-data5-v1.xlsx

Moreover, we investigated the potential druggability of the potential susceptibility genes. Among the 42 potentially druggable gene categories in DGIdb, we found that the top potentially druggable category for the potential susceptibility genes identified by MAGMA, S-PrediXcan and eDESE all were the “DRUGGABLE GENOME”. The number of “DRUGGABLE GENOME” genes were 168 (27.1%) for MAGMA, 86 (22.0%) for S-PrediXcan, 62 (30.2%) for eDESE:dist, 136 (21.5%) for eDESE:gene and 207 (24.2%) for eDESE:isoform.

Taken together, compared with MAGMA and S-PrediXcan, our results showed that the potential susceptibility genes identified by eDESE were more likely to be the target genes of therapeutic drugs. Besides, the application of eQTLs (especially the isoform-level eQTLs) could aid eDESE in identifying more potentially druggable genes.

Discussion

In this study, we proposed a multi-strategy conditional gene-based association framework, eDESE, based on a new effective chi-squared statistic (ECS) to identify the potential susceptibility tissues, genes, and isoforms for complex diseases. Compared with the unconditional association test based on the new ECS and likelihood ratio test, the conditional association test based on the new ECS showed a lower type I error rate and higher statistical power. Except the improved ECS, eDESE has another advantage of using three mapping strategies, that is, mapping based on physical position and the gene-level and isoform-level variant-gene eQTLs associations, to perform the gene-based association analysis. We implemented the three mapping strategies in corresponding three conditional gene-based association models, that is, eDESE:dist, eDESE:gene, and eDESE:isoform.

eDESE:dist and conventional MAGMA both map variants to genes based on their physical distance to perform gene-based association analysis. eDESE:dist differs from MAGMA mainly in that the former is a conditional gene-based association analysis based on the improved ECS. Although eDESE:dist predicted a smaller size of potential susceptibility genes than MAGMA, more than half of the genes identified by eDESE:dist were also identified by MAGMA. The overlapped potential susceptibility genes (accounting for 51.2% in eDESE:dist and 17.0% in MAGMA) were enriched with plenty of neuronal- or synaptic signaling-related GO terms. Similar results of a recent large-scale GWAS were obtained by gene-set analyses (Trubetskoy et al., 2022; Legge et al., 2021). The comparison of MAGMA and eDESE:dist revealed the advantage of conditional association analysis, that is, removing the redundantly associated genes. In addition, our results showed that the unique potential susceptibility genes identified by eDESE:dist were enriched with more biologically sensible GO terms than MAGMA. Moreover, the proportion of the susceptibility genes identified by eDESE:dist supported by research papers were more than that of MAGMA. Besides, eDESE:dist could produce a potential susceptibility tissue list for complex diseases because of the involvement of gene selective expression analysis.

Nevertheless, eDESE:dist might omit some distal but important variant-gene associations. We thus expanded eDESE by using eDESE:gene and eDESE:isoform. Using a simulation study, we firstly demonstrated that isoform-level eQTLs were more powerful than gene-level eQTLs in association analysis based on the improved ECS. Then taking schizophrenia as an example, we predicted the potential susceptibility tissues by integrating the results of eDESE:dist, eDESE:gene and eDESE:isoform based on the principle of ‘wisdom of the crowds’ and finally optimized five brain regions. Interestingly, four of the five optimized brain regions (i.e. Brain-Anterior cingulate cortex (BA24), Brain-Cerebellum, Brain-Frontal Cortex (BA9), and Brain-Hippocampus) were also reported by the Schizophrenia Working Group of the Psychiatric Genomics Consortium, in which the genes with high relative specificity for bulk expression were strongly enriched with schizophrenia associations (Trubetskoy et al., 2022).

Furthermore, we compared the potential susceptibility genes identified by eDESE:isoform with eDESE:gene and S-PrediXcan, based on the eQTLs of the five optimized brain regions, to demonstrate its potential gains and insights in the biology of schizophrenia. We found that eDESE:isoform could identify more susceptibility genes supported by the published papers and enriched with biologically sensible GO biological process terms. Interestingly, we also found susceptibility genes identified by eDESE:isoform in Brain-Cerebellum and eDESE:gene in both Brain-Cerebellum and Brain-Frontal Cortex (BA9), to be enriched with the same consensus module, which was stunningly enriched with plenty of neuronal- or synaptic signaling-related GO terms. Moreover, to our best knowledge, eDESE:isoform is the first conditional gene-based association model to produce a list of phenotype-associated isoforms (or transcripts) for complex diseases. The comparisons among eDESE:isoform, eDESE:gene, and S-PrediXcan revealed that our model, especially the eDESE:isoform, can gain more potential insights for schizophrenia biology.

To further investigate if eDESE can help gain more insights into the gene druggability, we compared the drug-gene interaction terms and potentially druggable category of the susceptibility genes identified by MAGMA, S-PrediXcan and eDESE. Seven susceptibility genes identified by eDESE were found to be the target genes of multiple FDA-approved antipsychotics, and six (85.7%) of the seven target genes were found by eDESE:isoform. On the other hand, only one and three susceptibility genes identified by S-PrediXcan and MAGMA were found to be the target genes of several FDA-approved antipsychotics. In addition, the ‘antipsychotics’-‘susceptibility gene’ interactions terms for MAGMA and eDESE were both significantly enriched in the known drug-gene interaction database (DGIdb). Moreover, we found that the number of potential susceptibility genes identified by eDESE:isoform and denoted as ‘DRUGGABLE GENOME’ in DGIdb, were the most among MAGMA, S-PrediXcan, and eDESE. The comparison revealed that the druggablilty of the potential susceptibility genes identified by eDESE, especially the eDESE:isoform, provided more credible supports for the utility of eDESE.

Our framework might have three potential applications. First, eDESE can be used to predict the potential susceptibility genes and isoforms for other complex diseases. Second, eDESE can help optimize the more biologically sensible phenotype-associated tissues for other complex diseases. Third, the expression profiles used by eDESE can be replaced with other profiles, such as drug-induced gene expression profiles, to investigate the potential drug mechanism of action (MOA).

The present study was also limited by several factors. First, the moderate sample size in GTEx (v8) might lead the gene/isoform-level eQTLs to be underpowered. Future genetic studies based on the increased sample sizes might alleviate this problem. Second, the size of the susceptibility genes identified by the eDESE:gene and eDESE:isoform was a little larger than that of the conventional studies. The main reasons might lie in the gene-level and isoform-level eQTLs selected based on a lenient threshold (p-value < 0.01) to involve more eQTLs in the association analysis, which was taken as the remedy against the underpowered eQTLs. As shown in the present study, conventional MAGMA also identified 619 significant genes. The comparison of MAGMA and eDESE showed the advantages of eDESE in GO enrichment analysis, text-mining analyses and druggability. Although these potential susceptibility genes identified by eDESE lacked systematically experimental validation, we shared these genes in Supplementary file 1ag and h and encouraged the follow-up studies to evaluate their functions and roles in the development of schizophrenia. Future studies based on increased sample sizes with a stricter threshold to select the eQTLs can reduce the potential susceptibility gene count, and this is also a topic for our future exploration. Third, the performance of conditional association analysis for fine-mapping can be greatly influenced by the gene orders of entering the analysis. Although the improved effective chi-squared test with eDESE can work well in the real data of schizophrenia, integrating some non-conditional fine-mapping methods, such as SuSiE, with eDESE is worth trying. Fourth, the optimal c value of the effective chi-squared test is still empirical although we derived its range and relevant factors. The optimal c value can be improved to be better suited for other specific application scenarios.

In conclusion, in this study, we proposed a new statistical framework to predict potential susceptibility genes for complex diseases based on the GWAS summary statistics and three different variant-gene mapping strategies. The application of our framework to schizophrenia revealed many novel susceptible and druggable candidate genes. Besides, our results suggested that the usage of isoform-level eQTLs can be an important supplement for the conventional gene-based approaches. The framework was packaged and implemented in our integrative platform KGGSEE. We hope our framework can facilitate researchers to gain more insights into the susceptibility genes, isoforms and tissues for complex diseases.

Materials and methods

The new effective chi-squared statistics (ECS) for conditional gene-based association analysis

Request a detailed protocol

The effective chi-squared statistics, ECS (Li et al., 2019), was improved by using a new correlation matrix of chi-squared statistics, which had two methodological advances to address the potential inflation issue, that is, a type I error-controlled correlation matrix of the observed chi-squared statistics and a non-negative least square solution for the independent chi-squared statistics. Suppose that gene set A contains n loci, to calculate the association p-value of another physically nearby gene (containing m loci) by conditioning on A, the first step of the conditional analysis was to produce the effective chi-squared statistics for A (n loci) and all the genes (n + m loci in total). Each locus had a p-value for phenotype association in the GWAS. The p-values can be converted to corresponding chi-squared statistics with the degree of freedom 1. According to Li et al., 2019, each locus could be assumed to have a virtually independent chi-squared statistic. An observed marginal chi-squared statistic of a locus was equal to the summation of its virtually independent chi-squared statistic and the weighted virtually independent chi-squared statistic of the nearby loci. The weight was related to the correlation of chi-squared statistics, a key parameter of the analysis. The correlation of chi-squared statistics between two loci was approximated by the absolute value of genotypic correlation to the power of c, that is, |r|c. We derived that exponent c ranged from 1 to 2, corresponding to different noncentrality parameters of a non-central chi-squared distribution (See the derivation in the third section of Materials and methods). The n virtually independent chi-squared statistics of the gene set A could be approximated by a linear transformation of the n observed chi-squared statistics by Formula (1),

(1) [x´12d1......x´n2dn][1...|r1,n|c...1...|rn,1|c...1]1×[x121......xn21]

where x´n2(0) , dn ( > 0), xn2 and |ri,j| denoted a virtually independent chi-squared statistic, degree of freedom of the virtually independent chi-squared statistic, an observed chi-squared statistic and the absolute value of the LD correlation coefficient (approximated by genotypic correlation), respectively. The effective chi-squared statistic S´n with the degree of freedom d´n of the n loci was then obtained by Formula (2):

(2) {S´n=Σi=1nx´i2d´n=Σi=1ndi

The effective chi-squared statistics (S´n+m) and degree of freedom (d´n+m) of the n + m loci were calculated in the same way. The effective chi-squared statistics of the m loci conditioning on the n loci were approximated by Formula (3),

(3) S´m|n=S´n+mS´n

with the degree of freedom d´m|n=d´n+md´n . Because d´m|n was no longer an integer, we used the Gamma distribution to calculate the p-values. Given the above statistics and degree of freedom, the p-value was equal to F(xS´m|n2;d´m|n2,2) , where the F(x) function was the cumulative distribution function of a Gamma distribution.

Because the virtually independent chi-squared statistics and degree of freedom were expected to be larger than 0, we adopted a sequential coordinate-wise algorithm to approximate them (Franc et al., 2005). This algorithm avoided unstable solutions in Formula (1) due to stochastic errors in the correlation matrix and observed chi-squared statistics.

Choose the favorable c value for the correlation matrix of chi-squared statistics

Request a detailed protocol

After multiple approximations, the analytic solution for the exponent c in Formula (1) was still difficult to obtain. We proposed a grid search algorithm to find a favorable value of exponent c to control the type I error rates of ECS. The type I error rate was examined by the departure between the obtained and the theoretical (under the uniform distribution) top 1% p-values given a c value, measured as mean log fold change (MLFC) (Tokheim et al., 2016). In the grid search process, we increased c from 1.00 to 2.00 by an interval of 0.01. The c value leading to the minimal MLFC was defined as the favorable c value. We considered in total 84 parameter settings, i.e., a combination of three different sample sizes (10,000, 20,000 and 40,000) and 14 different variant number (10, 30, 50, 80, 100, 125, 150, 200, 250, 300, 400, 500, 800, and 1000) for both binary and continuous traits, respectively. For a parameter setting, 40,000 datasets were simulated and used to produce p-values to determine the favorable c value. A region on chromosome 2 [chr2: 169,428,016–189,671,923] was randomly drawn for the simulation. The allele frequencies and LD structure of variants in the European panel of the 1000 Genomes Project were used as a reference to simulate genotype data by the HapSim algorithm (Montana, 2005). Each subject was randomly assigned a phenotype value under the null hypothesis according to the Bernoulli or Gaussian distribution. The Wald test, which encoded the major and minor allele as 0 and 1 under either logistic regression or linear regression, was used to produce the association p-value at each variant. The p-values of the variants were then analyzed by the effective chi-squared test for the gene-based association analysis.

Approximate the correlation of chi-square statistics under the alternative Hypothesis

Request a detailed protocol

Let two normal random variables X N(μ1,σ12) and Y N(μ2,σ22) have covariance a. Note that a squared normal random variable has non-central chi-square distribution, and the squared mean of the variable is called noncentrality parameter. The two variables can also be factorized as X=μ1+σ1U, Y=μ2+σ2(ρU+1ρ2V) where U,ViidN(0,1) and ρ=a/(σ1σ2) .

Then we can calculate the correlation of the two non-central chi-square variables X2 and Y2 by the factorized variables, cor(X2,Y2)=2μ1μ2a+a22μ12σ12+σ142μ22σ22+σ24 . Suppose X is the z-score of the true casual variant, and Y is the z-score of a non-functional variant in LD (coefficient r) with the causal variant. One can assume μ2=rμ1 , σ12=1 , σ22=1 and a = r. Therefore, the correlation of X2 and Y2 can be simplified as, c=2r2μ12+r22μ12+12μ12r2+1 .

Under the null hypothesis, μ1=0 then cor(X2,Y2)=r2 . Under the alternative hypothesis, μ12 is the noncentrality parameter (λ). According to Sham and Purcel (Sham and Purcell, 2014), the noncentrality parameter can be approximated by the following Formula for a quantitative phenotype, μ12=λ=N2β2p(1p)Residualvarianceofphenotype, where N, p and β are the sample sizes, allele frequency and the regression coefficient (corresponding to effect size), respectively. This Formula can be extended for qualitative phenotype under a liability threshold model. N is very large for a large sample, and the noncentrality parameter will be large for common variants. Therefore, the correlation of X2 and Y2 converges to |r| as λ,cor(X2,Y2)= 2|r|2+1μ122+1r2μ12+r22μ12+12r2μ12+1= 2|r|2+1λ2+1λr2+r22λ+12λr2+1|r|.

Overall, the correlation between the two (non-central) chi-square ranges can be approximated as |r|c', in which the c' ranges from 1 to 2. In simulation studies, we also showed that as the relative risk ratio increased to 1.5 in a moderate sample of 1000 cases and 1000 controls, the correlation of X2 and Y2 became close to |r| (See Figure 9).

The effect sizes of variants to the correlation of chi-squares.

(a) Under the additive model; (b) Under the multiplicative model. The allele frequencies are assigned randomly.

The conditional gene-based association analysis for genome-wide association study

Request a detailed protocol

In a GWAS, the p-values of all genes in the unconditional gene-based association analysis were firstly calculated using the above effective chi-squared statistics. Then, for a given p-value cutoff, the significant genes were extracted and subjected to the conditional gene-based association analysis. Multiple significant genes in an LD block were conditioned one by one in a pre-defined order. The order of the gene was defined according to the unconditional p-value of the gene. Here the genes within 5 Mb were assigned into the same LD block. The conditional p-value of the first gene was defined as its unconditional p-value. The conditional p-value of the second gene was obtained by conditioning on the first gene, and that of the third gene was obtained by conditioning on the first two genes. The conditional p-values of subsequent genes were calculated according to the same procedure.

Investigate the type I error and power of the conditional gene-based association analysis

Request a detailed protocol

Independent computer simulations based on a different reference population (i.e., EAS) and genomic regions were performed to investigate type I error and the power of the conditional gene-based association analysis. To approach the association redundancy pattern in realistic scenarios, we used the real genotypes and simulated phenotypes. The high-quality genotypes of 2,507 Chinese subjects from a GWAS were used (Kung et al., 2010), and phenotypes of subjects were simulated according to the genotypes under an additive model. Given total variance explained by n independent variants, Vg, the effect of an allele at a bi-allelic variant was calculated by a=Vg/[i=1n2PAi(1PAi)] , where PAi was the frequency of alternative alleles. The total expected effect A of a subject was equal to a*[the number of alternative alleles of all the n variants]. Each subject’s phenotype was simulated by P = A + e, where e was sampled from a normal distribution N(0, 1-Vg). We randomly sampled three pairs of genes, i.e., SIPA1L2 vs. LOC729336, CACHD1 vs. RAVER2, and LOC647132 vs. FAM5C, representing three scenarios where the nearby gene (i.e., the first gene) had similar (SIPA1L2 vs. LOC729336), larger (CACHD1 vs. RAVER2) and smaller (LOC647132 vs. FAM5C) variant number than the target gene (i.e., the second gene) in terms of SNP number, respectively. The target gene had no QTLs in the type I error investigation, while the nearby gene had one or two QTLs. In the investigation of the statistical power, both the target and nearby genes had QTLs.

The likelihood ratio test based on the linear regression was adopted for power comparison to perform the conditional gene-based association analysis with raw genotypes. In the full model, genotypes of all SNPs encoded as 0, 1, or 2 according to the number of alternative variants entered the regression model as explanatory variables. In the subset model, the SNPs of the nearby genes entered the regression model. The calculation of the likelihood ratio test was performed according to the conventional procedure. The R packaged "lmtest" (version 0.9.37) was adopted to perform the likelihood ratio test.

Investigate the power and type I error of gene-level eQTLs and isoform-level eQTLs in gene-based association analysis

Request a detailed protocol

The same region on chromosome 2 [chr2: 169,428,016–189,671,923] was considered for the simulation. In the EUR panel of 1000 Genomes Project (Auton et al., 2015), this region contains 1,600 common variants (MAF >0.05). Genotypes of the variants were simulated given allelic frequencies and LD correlation matrix according to the HapSim algorithm (Montana, 2005). Phenotypes were simulated under a polygenic model of random effect (Bulik-Sullivan et al., 2015). According to severe LD pruning (r2 <0.01), eighty-two independent variants were extracted from the 1,600 variants. The SNPs' genotypes (s) contributing to the phenotypes were then standardized as, g=s-2q/2q1-q , where q was the allele frequency of the alterative allele. Phenotypes were simulated under a polygenic model of random effect (Bulik-Sullivan et al., 2015). We assumed that 40% of the independent causal variants (mX) regulated gene expression (total heritability hX2). The expression of a gene (X) was simulated according to Formula (4):

(4) X=i=1mXgiβX,i+ϵX

Where βX,iN(0,hX2/mX) and ϵXN(0,1hX2).

The gene expression then contributed δ to a phenotype (Y). The phenotype value was simulated according to the Formula (5):

(5) Y=δX+i=1mYgiβY,i+ϵY

where Y was a continuous phenotype, βY,i N(0,hY2/mY) and ϵY N(0,1hY2δ2) . For a binary phenotype, a cutoff t was set according to a given disease prevalence K under a standard normal distribution and the liability threshold model (Gillett et al., 2018). Subjects with simulated Y values ≥ t were set as patients, and others were set as normal controls.

When a gene had multiple isoforms, we assumed that one of the isoforms was associated with phenotype, and we simulated the expression values of the isoform according to Formula (5). The expression values of the remaining isoforms were simulated by the standard normal distribution N(0,1). The expression profile of a gene with multiple isoforms was averaged by all the isoforms belonging to the gene. The gene-level eQTLs and isoform-level eQTLs were examined by the Wald test under the linear regression model. The variant-phenotype association analysis was performed based on the conventional association analysis procedure, and the statistical significance cutoff was p-value < 0.001. For each parameter setting, t (an integer) datasets were simulated, and the power and type I error were estimated by m/t, in which m was the number of datasets with significant p-values for testing δ.

Genome-wide association study of schizophrenia

Request a detailed protocol

The schizophrenia GWAS included 53,386 cases and 77,258 controls of European ancestry samples (hg19 assembly). Genotypes in the CEU panel from the 1000 Genomes Project were used to correct for the relatedness of the summary statistics. The variants in the major histocompatibility complex (MHC) region, i.e., chr6:27,477,797–34,448,354, were excluded in the present study because of the high polymorphism. Detailed descriptions of population cohorts, quality control methods and association analysis methods can be found in reference (Trubetskoy et al., 2022).

The Genotype-Tissue Expression (GTEx) Project

Request a detailed protocol

The GTEx project (release v8, RRID:SCR_013042) created a resource including whole-genome sequence data and RNA sequencing data from ~900 deceased adult donors (Consortium, 2020). Four tissues or cell types (i.e., whole blood, stomach, pancreas and pituitary) were filtered out in the following analyses because of their small sample sizes or weak correlations with most other tissues.

g:Profiler and Hetionet

Request a detailed protocol

All GO enrichment analyses were performed by g:Profiler (Raudvere et al., 2019). g:Profiler (RRID:SCR_006809) is based on Fisher’s one-tailed test. The statistical p-value is multiple testing-corrected. The GO enrichment analysis uses the set of all annotated protein-coding genes for Homo sapiens (Human) as background. Significant GO terms were filtered by the threshold of "Padj" < 0.05. The bar plots of GO enrichment terms were drawn based on R-4.0.3.

Hetionet (Himmelstein et al., 2017) integrates relationships among genes, compounds, diseases, and more from 29 different databases. It can help researchers refine their phenotype-gene associations by considering anatomies, biological processes, side-effects, symptoms and more.

Construct the weighted gene co-expression network for brain and perform the consensus analysis

Request a detailed protocol

The thirteen brain regions' fully processed, filtered, and normalized gene-level expression profiles in GTEx (v8) were used. Consensus network analysis and module identification were performed based on the "WGCNA" (v1.69, RRID:SCR_003302) (Langfelder and Horvath, 2008). For each dataset, WGCNA was performed to build a signed gene co-expression network following the standard procedure, and the soft-threshold was finally set as 12 after testing a series of soft-threshold powers (ranging from 2 to 20). As for constructing the block-wise consensus modules, the hierarchical cluster tree in the co-expression network was cut into gene modules using the dynamic tree cut algorithm with a minimum module size of 30 genes (Langfelder et al., 2008). The parameter of "networkCalibration" was set as "single quantile". The "consensusQuantile" and "calibrationQuantile" were both set as 0.95. The parameter of "deepSplit" was set as 3. Other parameters were used as recommended by WGCNA. The co-expression analysis and consensus clustering analysis produced eighteen consensus modules, in which the module sizes ranged from 41 to 7,726. The significant consensus modules were filtered by the threshold of "Padj" < 0.05.

Drug Gene Interaction Database (DGIdb)

Request a detailed protocol

DGIdb (v4.2.0, RRID:SCR_006608) provides a resource of genes that have the potential to be druggable and contains two main classes of druggable genome information (Freshour et al., 2021). The first class includes genes with known drug interactions, and the other includes genes that are potentially druggable according to their membership in gene categories associated with druggability.

PubMed text-mining analysis

Request a detailed protocol

To find supports from the published research, we performed a text-mining analysis based on PubMed database on August 27th, 2021 using a java script. We put each gene symbol name or its synonyms into the PubMed database with the items of “((schizophrenia[tiab]+ OR + Schizophrenia[tiab]+ OR + SCZ[tiab])+ AND + (genename[tiab])+ AND + (gene[tiab]+ OR + genes[tiab]+ OR + mRNA[tiab]+ OR + protein[tiab]+ OR + proteins[tiab]+ OR + transcription[tiab]+ OR + transcript[tiab]+ OR + transcripts[tiab]+ OR + expressed[tiab]+ OR + expression[tiab]+ OR + expressions[tiab]+ OR + locus[tiab]+ OR + loci[tiab]+ OR + SNP[tiab]))&datetype = edat&retmax = 100”. The java script output a file with the first column representing gene name, the second column representing the synonym of the gene name and the last column representing the PubMed ids of hit papers.

Identify the potentially phenotype-associated tissues of schizophrenia

Request a detailed protocol

To estimate the potentially phenotype-associated tissues, eDESE:dist, eDESE:gene and eDESE:isoform were used, respectively. The Genotypes in the EUR panel from the 1000 Genomes Project (phase 3) were downloaded from IGSR and used as reference genotype data. Three columns, that is., chromosome identifier, base-pair position and p-value in the GWAS summary statistics, were used. SNPs with minor allele frequency (MAF) less than 0.05 were excluded. Genes approved by HGNC (HGNC Database, H.G.N.C.H, 2021) were included in the following analyses. The multiple testing adjustment method was Bonferroni correction, and the cutoff for the adjusted p-value was set as p-value < 0.05. The threshold for filtering the significant phenotype-associated tissues was set as α = 0.05/50 = 1e-3. The detailed commands of eDESE to identify the potential phenotype-associated tissues are described on the KGGSEE website and the original paper (Jiang et al., 2019).

Identify gene-level and isoform-level eQTLs

Request a detailed protocol

The present study focused on the cis-eQTLs. Specifically, two files (expression profiles and corresponding genotype data file from the European ancestry subjects in GTEx v8) were put into KGGSEE to produce the gene/isoform-level eQTLs for each tissue. Two levels (gene-level and isoform-level) expression profiles of 50 tissues were downloaded from the GTEx v8 project and were normalized as GTEx did (https://gtexportal.org/home/documentationPage). Genes/isoforms were selected based on TPM >0.1 and read count ≥6 in at least 20% of all samples. Only variants with MAF ≥0.05 were included in the eQTLs identification. GTEx v8 is based on the human reference genome GRCh38/hg38. Thus, to be consistent with the GWAS results of schizophrenia (hg19 assembly), we converted the GRCh38/hg38 coordinates into hg19 by using the UCSC LiftOver (Hinrichs et al., 2006). Variants with Hardy-Weinberg disequilibrium (HWD) test p-value < 1.0e-3 were filtered out. The mapping window was defined as 1 Mb up- and downstream of the gene boundary. The covariates used in eQTLs identification include donor sex, age and death classification. The threshold for selecting the gene-level/isoform-level eQTLs was p-value < 0.01.

Estimate the potential susceptibility genes and isoforms

Request a detailed protocol

For eDESE:dist, if a variant was within ±5 kb around a gene boundary, the variant will be mapped to the gene according to a gene model, for example, RefSeqGene. For eDESE:gene and eDESE:isoform, a variant was mapped to a gene or isoform if the variant is a gene/isoform-level eQTL of the gene or isoform.

eDESE adopted the iterative procedure from DESE (Jiang et al., 2019). In the first iterative step, genes with smaller p-values generated from the unconditional association analysis (based on the improved ECS) were given higher priority to enter the following conditional gene-based association analysis. We then dealt with the three models of eDESE in different ways. For eDESE:dist and eDESE:gene, the order of a gene entering the conditional gene-based association analysis was determined by its p-value generated from the unconditional association analysis. For eDESE:isoform, assume that gene A has m isoforms. Each isoform could get a p-value generated from the unconditional association analysis, representing the overall statistical significance of all isoform-level eQTLs (simultaneously variants) associated with this isoform. If the isoform with the smallest p-value was isoform a with p-value pa among the m isoforms, we only kept the most significant isoform, that is, isoform a of gene A for the following analyses. The p-value for isoform a was adjusted to m* pa before entering the following conditional gene-based association analysis.

The second step was to compute the robust-regression z-score of each gene/isoform (see details in reference Jiang et al., 2019). The Wilcoxon rank-sum test was then performed by using the robust-regression z-score of the associated and not-associated gene/isoform set (generated by the first step) in each tissue.

In the third step, all genes/isoforms were ranked in descending order based on their tissue-selective expression score, which was computed based on the rank of this gene’s or isoform’s robust-regression z-score and the p-value of the Wilcoxon rank-sum test (generated by the second step).

In the following iteration, genes/isoforms with higher tissue-selective expression scores (generated in the third step) were given higher priority to enter the conditional gene-based association analysis (back to the first step). The above three iterative steps would not stop until the p-values of the Wilcoxon rank-sum test did not change almost. Then corresponding significant genes/isoforms and tissues were deemed to be potentially associated with the phenotype. More details about the iterative procedure can be found in the original papers (Jiang et al., 2019).

The detailed commands, input and output datasets of eDESE can be seen on the KGGSEE website. The bar plot of the comparison of potential susceptibility genes was drawn based on R-4.0.3. The venn diagram was drawn based on a web app Venny 2.1.0.

MAGMA

Request a detailed protocol

MAGMA (RRID:SCR_005757) is a popular tool for gene and generalized gene-set analysis based on the GWAS summary statistics. Here the parameters and options were used as recommended by MAGMA (v 1.09). Annotation analysis was performed based on the SNP and gene location files (hg19, build 37). The SNP location information was extracted from the GWAS summary statistics file of schizophrenia. Both gene location and reference data were downloaded from the MAGMA website. An SNP was mapped to a gene if the SNP was in the window of ±5 kb around the gene boundary (same as eDESE:dist). Then the gene analysis was performed based on the annotation results and reference data file which was created from Phase 3 of 1000 Genomes of the European population in reference to the human genome (build 37). Multiple testing was corrected by using Bonferroni correction. Significant genes were filtered by "Padj" < 0.05.

S-PrediXcan

Request a detailed protocol

S-PrediXcan is command-line based and implemented with python environment and mainly uses summary statistics. To estimate the disease-associated genes of each tissue, we prepared three input files, that is, schizophrenia GWAS summary statistics file, a transcriptome prediction model database file and a file with the covariance matrices of the SNPs within each gene model (Barbeira et al., 2018; Gamazon et al., 2015; Barbeira et al., 2021). Here, GTEx-based tissues and 1000 Genomes covariances precalculated data from the PredictDB repository were downloaded (http://predictdb.org), and the MASHR-based model based on the expression data of GTEx v8 release was used. Other options and parameters were used as recommended. Multiple testing was corrected by using Bonferroni correction. Significant genes in each tissue were filtered by "Padj" < 0.05.

Data availability statement

Request a detailed protocol

All the data used in this study are from public resources. The source data files for the main figures and tables in the manuscript have been provided and are specified in Source Data. The annotations of drug-gene interaction terms are publicly available in Drug Gene Interaction (DGIdb v4.2.0) database in https://dgidb.org. The information on FDA-approved antipsychotics was extracted from DrugBank 5.1.1, which can be freely downloaded from https://go.drugbank.com/releases/5-1-1/downloads/all-full-database with a simple registration for academic users. The functional enrichment analyses were performed by g:Profiler in https://biit.cs.ut.ee/gprofiler. Hetionet v1.0 can be freely accessed at https://het.io/. Venny is in https://bioinfogp.cnb.csic.es/tools/venny/index.html. MAGMA and corresponding reference data were freely downloaded from https://ctg.cncr.nl/software/magma. S-PrediXcan was freely downloaded from https://github.com/hakyimlab/MetaXcan (hakyimlab, 2021) copy archived at swh:1:rev:cfc9e369bbf5630e0c9488993cd877f231c5d02e. The source code of eDESE (including eDESE:dist, eDESE:gene and eDESE:isoform) is implemented in KGGSEE and can be publicly available in http://pmglab.top/kggsee/#/.The custom scripts used in this study can be freely accessed at https://github.com/pmglab/eDESE (pmglab, 2021) copy archived at swh:1:rev:68fbbe429f23011f544cdd34ce09c98a2540f68b (Li and Li, 2021).

Data availability

All the data used in this study are from public resources. The source data files for the main figures and tables in the manuscript have been provided and are specified in source data. The annotations of drug-gene interaction terms are publicly available in Drug Gene Interaction database (DGIdb v4.2.0) in https://dgidb.org/. The information on FDA-approved antipsychotics was extracted from DrugBank 5.1.1, which can be freely downloaded from https://go.drugbank.com/releases/5-1-1/downloads/all-full-database with a simple registration for academic users. The functional enrichment analyses were performed by g:Profiler in https://biit.cs.ut.ee/gprofiler. Hetionet v1.0 can be freely accessed at https://het.io/. Venny is in https://bioinfogp.cnb.csic.es/tools/venny/index.html. MAGMA and corresponding reference data were freely downloaded from https://ctg.cncr.nl/software/magma. S-PrediXcan was freely downloaded from https://github.com/hakyimlab/MetaXcan (copy archived at swh:1:rev:cfc9e369bbf5630e0c9488993cd877f231c5d02e). The source code of eDESE (including eDESE:dist, eDESE:gene and eDESE:isoform) is implemented in KGGSEE and can be publicly available in http://pmglab.top/kggsee/#/.The custom scripts used in this study can be freely accessed at https://github.com/pmglab/eDESE (copy archived at swh:1:rev:68fbbe429f23011f544cdd34ce09c98a2540f68b).

The following data sets were generated
    1. Li X
    2. Jiang L
    3. Xue C
    4. Li M
    5. Mj Li
    (2021) figshare
    Gene-level eQTLs (hg19) based on GTEx v8.
    https://doi.org/10.6084/m9.figshare.16959604
    1. Li X
    2. Jiang L
    3. Xue C
    4. Li M
    5. Mj Li
    (2021) figshare
    Isoform-level eQTLs (hg19) based on GTEx v8.
    https://doi.org/10.6084/m9.figshare.16959616
The following previously published data sets were used
    1. 1000 Genomes Project Consortium
    (2015) IGSR
    ID 1000genomes. Genotype data from the 1000 Genomes Project (phase 3).

References

    1. Auton A
    2. Abecasis GR
    3. Steering committee
    4. Altshuler DM
    5. Durbin RM
    6. Abecasis GR
    7. Bentley DR
    8. Chakravarti A
    9. Clark AG
    10. Donnelly P
    11. Eichler EE
    12. Flicek P
    13. Gabriel SB
    14. Gibbs RA
    15. Green ED
    16. Hurles ME
    17. Knoppers BM
    18. Korbel JO
    19. Lander ES
    20. Lee C
    21. Lehrach H
    22. Mardis ER
    23. Marth GT
    24. McVean GA
    25. Nickerson DA
    26. Schmidt JP
    27. Sherry ST
    28. Wang J
    29. Wilson RK
    30. Production group
    31. Baylor College of Medicine
    32. Gibbs RA
    33. Boerwinkle E
    34. Doddapaneni H
    35. Han Y
    36. Korchina V
    37. Kovar C
    38. Lee S
    39. Muzny D
    40. Reid JG
    41. Zhu Y
    42. BGI-Shenzhen
    43. Wang J
    44. Chang Y
    45. Feng Q
    46. Fang X
    47. Guo X
    48. Jian M
    49. Jiang H
    50. Jin X
    51. Lan T
    52. Li G
    53. Li J
    54. Li Y
    55. Liu S
    56. Liu X
    57. Lu Y
    58. Ma X
    59. Tang M
    60. Wang B
    61. Wang G
    62. Wu H
    63. Wu R
    64. Xu X
    65. Yin Y
    66. Zhang D
    67. Zhang W
    68. Zhao J
    69. Zhao M
    70. Zheng X
    71. Broad Institute of MIT and Harvard
    72. Lander ES
    73. Altshuler DM
    74. Gabriel SB
    75. Gupta N
    76. Coriell Institute for Medical Research
    77. Gharani N
    78. Toji LH
    79. Gerry NP
    80. Resch AM
    81. European Molecular Biology Laboratory, European Bioinformatics Institute
    82. Flicek P
    83. Barker J
    84. Clarke L
    85. Gil L
    86. Hunt SE
    87. Kelman G
    88. Kulesha E
    89. Leinonen R
    90. McLaren WM
    91. Radhakrishnan R
    92. Roa A
    93. Smirnov D
    94. Smith RE
    95. Streeter I
    96. Thormann A
    97. Toneva I
    98. Vaughan B
    99. Zheng-Bradley X
    100. Illumina
    101. Bentley DR
    102. Grocock R
    103. Humphray S
    104. James T
    105. Kingsbury Z
    106. Max Planck Institute for Molecular Genetics
    107. Lehrach H
    108. Sudbrak R
    109. Albrecht MW
    110. Amstislavskiy VS
    111. Borodina TA
    112. Lienhard M
    113. Mertes F
    114. Sultan M
    115. Timmermann B
    116. Yaspo ML
    117. McDonnell Genome Institute at Washington University
    118. Mardis ER
    119. Wilson RK
    120. Fulton L
    121. Fulton R
    122. US National Institutes of Health
    123. Sherry ST
    124. Ananiev V
    125. Belaia Z
    126. Beloslyudtsev D
    127. Bouk N
    128. Chen C
    129. Church D
    130. Cohen R
    131. Cook C
    132. Garner J
    133. Hefferon T
    134. Kimelman M
    135. Liu C
    136. Lopez J
    137. Meric P
    138. O’Sullivan C
    139. Ostapchuk Y
    140. Phan L
    141. Ponomarov S
    142. Schneider V
    143. Shekhtman E
    144. Sirotkin K
    145. Slotta D
    146. Zhang H
    147. University of Oxford
    148. McVean GA
    149. Wellcome Trust Sanger Institute
    150. Durbin RM
    151. Balasubramaniam S
    152. Burton J
    153. Danecek P
    154. Keane TM
    155. Kolb-Kokocinski A
    156. McCarthy S
    157. Stalker J
    158. Quail M
    159. Analysis group
    160. Affymetrix
    161. Schmidt JP
    162. Davies CJ
    163. Gollub J
    164. Webster T
    165. Wong B
    166. Zhan Y
    167. Albert Einstein College of Medicine
    168. Auton A
    169. Campbell CL
    170. Kong Y
    171. Marcketta A
    172. Gibbs RA
    173. Yu F
    174. Antunes L
    175. Bainbridge M
    176. Muzny D
    177. Sabo A
    178. Huang Z
    179. Wang J
    180. Coin LJM
    181. Fang L
    182. Guo X
    183. Jin X
    184. Li G
    185. Li Q
    186. Li Y
    187. Li Z
    188. Lin H
    189. Liu B
    190. Luo R
    191. Shao H
    192. Xie Y
    193. Ye C
    194. Yu C
    195. Zhang F
    196. Zheng H
    197. Zhu H
    198. Bilkent University
    199. Alkan C
    200. Dal E
    201. Kahveci F
    202. Boston College
    203. Marth GT
    204. Garrison EP
    205. Kural D
    206. Lee WP
    207. Fung Leong W
    208. Stromberg M
    209. Ward AN
    210. Wu J
    211. Zhang M
    212. Daly MJ
    213. DePristo MA
    214. Handsaker RE
    215. Altshuler DM
    216. Banks E
    217. Bhatia G
    218. del Angel G
    219. Gabriel SB
    220. Genovese G
    221. Gupta N
    222. Li H
    223. Kashin S
    224. Lander ES
    225. McCarroll SA
    226. Nemesh JC
    227. Poplin RE
    228. Cold Spring Harbor Laboratory
    229. Yoon SC
    230. Lihm J
    231. Makarov V
    232. Cornell University
    233. Clark AG
    234. Gottipati S
    235. Keinan A
    236. Rodriguez-Flores JL
    237. Korbel JO
    238. Rausch T
    239. Fritz MH
    240. Stütz AM
    241. Flicek P
    242. Beal K
    243. Clarke L
    244. Datta A
    245. Herrero J
    246. McLaren WM
    247. Ritchie GRS
    248. Smith RE
    249. Zerbino D
    250. Zheng-Bradley X
    251. Harvard University
    252. Sabeti PC
    253. Shlyakhter I
    254. Schaffner SF
    255. Vitti J
    256. Human Gene Mutation Database
    257. Cooper DN
    258. Ball EV
    259. Stenson PD
    260. Bentley DR
    261. Barnes B
    262. Bauer M
    263. Keira Cheetham R
    264. Cox A
    265. Eberle M
    266. Humphray S
    267. Kahn S
    268. Murray L
    269. Peden J
    270. Shaw R
    271. Icahn School of Medicine at Mount Sinai
    272. Kenny EE
    273. Louisiana State University
    274. Batzer MA
    275. Konkel MK
    276. Walker JA
    277. Massachusetts General Hospital
    278. MacArthur DG
    279. Lek M
    280. Sudbrak R
    281. Amstislavskiy VS
    282. Herwig R
    283. Mardis ER
    284. Ding L
    285. Koboldt DC
    286. Larson D
    287. Ye K
    288. McGill University
    289. Gravel S
    290. National Eye Institute, NIH
    291. Swaroop A
    292. Chew E
    293. New York Genome Center
    294. Lappalainen T
    295. Erlich Y
    296. Gymrek M
    297. Frederick Willems T
    298. Ontario Institute for Cancer Research
    299. Simpson JT
    300. Pennsylvania State University
    301. Shriver MD
    302. Rutgers Cancer Institute of New Jersey
    303. Rosenfeld JA
    304. Stanford University
    305. Bustamante CD
    306. Montgomery SB
    307. De La Vega FM
    308. Byrnes JK
    309. Carroll AW
    310. DeGorter MK
    311. Lacroute P
    312. Maples BK
    313. Martin AR
    314. Moreno-Estrada A
    315. Shringarpure SS
    316. Zakharia F
    317. Tel-Aviv University
    318. Halperin E
    319. Baran Y
    320. The Jackson Laboratory for Genomic Medicine
    321. Lee C
    322. Cerveira E
    323. Hwang J
    324. Malhotra A
    325. Plewczynski D
    326. Radew K
    327. Romanovitch M
    328. Zhang C
    329. Thermo Fisher Scientific
    330. Hyland FCL
    331. Translational Genomics Research Institute
    332. Craig DW
    333. Christoforides A
    334. Homer N
    335. Izatt T
    336. Kurdoglu AA
    337. Sinari SA
    338. Squire K
    339. Sherry ST
    340. Xiao C
    341. University of California, San Diego
    342. Sebat J
    343. Antaki D
    344. Gujral M
    345. Noor A
    346. Ye K
    347. University of California, San Francisco
    348. Burchard EG
    349. Hernandez RD
    350. Gignoux CR
    351. University of California, Santa Cruz
    352. Haussler D
    353. Katzman SJ
    354. James Kent W
    355. University of Chicago
    356. Howie B
    357. University College London
    358. Ruiz-Linares A
    359. University of Geneva
    360. Dermitzakis ET
    361. University of Maryland School of Medicine
    362. Devine SE
    363. University of Michigan
    364. Abecasis GR
    365. Min Kang H
    366. Kidd JM
    367. Blackwell T
    368. Caron S
    369. Chen W
    370. Emery S
    371. Fritsche L
    372. Fuchsberger C
    373. Jun G
    374. Li B
    375. Lyons R
    376. Scheller C
    377. Sidore C
    378. Song S
    379. Sliwerska E
    380. Taliun D
    381. Tan A
    382. Welch R
    383. Kate Wing M
    384. Zhan X
    385. University of Montréal
    386. Awadalla P
    387. Hodgkinson A
    388. University of North Carolina at Chapel Hill
    389. Li Y
    390. University of North Carolina at Charlotte
    391. Shi X
    392. Quitadamo A
    393. Lunter G
    394. McVean GA
    395. Marchini JL
    396. Myers S
    397. Churchhouse C
    398. Delaneau O
    399. Gupta-Hinch A
    400. Kretzschmar W
    401. Iqbal Z
    402. Mathieson I
    403. Menelaou A
    404. Rimmer A
    405. Xifara DK
    406. University of Puerto Rico
    407. Oleksyk TK
    408. University of Texas Health Sciences Center at Houston
    409. Fu Y
    410. Liu X
    411. Xiong M
    412. University of Utah
    413. Jorde L
    414. Witherspoon D
    415. Xing J
    416. University of Washington
    417. Eichler EE
    418. Browning BL
    419. Browning SR
    420. Hormozdiari F
    421. Sudmant PH
    422. Weill Cornell Medical College
    423. Khurana E
    424. Durbin RM
    425. Hurles ME
    426. Tyler-Smith C
    427. Albers CA
    428. Ayub Q
    429. Balasubramaniam S
    430. Chen Y
    431. Colonna V
    432. Danecek P
    433. Jostins L
    434. Keane TM
    435. McCarthy S
    436. Walter K
    437. Xue Y
    438. Yale University
    439. Gerstein MB
    440. Abyzov A
    441. Balasubramanian S
    442. Chen J
    443. Clarke D
    444. Fu Y
    445. Harmanci AO
    446. Jin M
    447. Lee D
    448. Liu J
    449. Jasmine Mu X
    450. Zhang J
    451. Zhang Y
    452. Structural variation group
    453. Li Y
    454. Luo R
    455. Zhu H
    456. Alkan C
    457. Dal E
    458. Kahveci F
    459. Marth GT
    460. Garrison EP
    461. Kural D
    462. Lee WP
    463. Ward AN
    464. Wu J
    465. Zhang M
    466. McCarroll SA
    467. Handsaker RE
    468. Altshuler DM
    469. Banks E
    470. del Angel G
    471. Genovese G
    472. Hartl C
    473. Li H
    474. Kashin S
    475. Nemesh JC
    476. Shakir K
    477. Yoon SC
    478. Lihm J
    479. Makarov V
    480. Degenhardt J
    481. Korbel JO
    482. Fritz MH
    483. Meiers S
    484. Raeder B
    485. Rausch T
    486. Stütz AM
    487. Flicek P
    488. Paolo Casale F
    489. Clarke L
    490. Smith RE
    491. Stegle O
    492. Zheng-Bradley X
    493. Bentley DR
    494. Barnes B
    495. Keira Cheetham R
    496. Eberle M
    497. Humphray S
    498. Kahn S
    499. Murray L
    500. Shaw R
    501. Leiden University Medical Center
    502. Lameijer EW
    503. Batzer MA
    504. Konkel MK
    505. Walker JA
    506. Ding L
    507. Hall I
    508. Ye K
    509. Lacroute P
    510. Lee C
    511. Cerveira E
    512. Malhotra A
    513. Hwang J
    514. Plewczynski D
    515. Radew K
    516. Romanovitch M
    517. Zhang C
    518. Craig DW
    519. Homer N
    520. Church D
    521. Xiao C
    522. Antaki D
    523. Bafna V
    524. Michaelson J
    525. Ye K
    526. Devine SE
    527. Gardner EJ
    528. Abecasis GR
    529. Kidd JM
    530. Mills RE
    531. Dayama G
    532. Emery S
    533. Jun G
    534. Shi X
    535. Quitadamo A
    536. Lunter G
    537. McVean GA
    538. University of Texas MD Anderson Cancer Center
    539. Chen K
    540. Fan X
    541. Chong Z
    542. Chen T
    543. Witherspoon D
    544. Xing J
    545. Eichler EE
    546. Chaisson MJ
    547. Hormozdiari F
    548. Huddleston J
    549. Malig M
    550. Nelson BJ
    551. Sudmant PH
    552. Vanderbilt University School of Medicine
    553. Parrish NF
    554. Khurana E
    555. Hurles ME
    556. Blackburne B
    557. Lindsay SJ
    558. Ning Z
    559. Walter K
    560. Zhang Y
    561. Gerstein MB
    562. Abyzov A
    563. Chen J
    564. Clarke D
    565. Lam H
    566. Jasmine Mu X
    567. Sisu C
    568. Zhang J
    569. Zhang Y
    570. Exome group
    571. Gibbs RA
    572. Yu F
    573. Bainbridge M
    574. Challis D
    575. Evani US
    576. Kovar C
    577. Lu J
    578. Muzny D
    579. Nagaswamy U
    580. Reid JG
    581. Sabo A
    582. Yu J
    583. Guo X
    584. Li W
    585. Li Y
    586. Wu R
    587. Marth GT
    588. Garrison EP
    589. Fung Leong W
    590. Ward AN
    591. del Angel G
    592. DePristo MA
    593. Gabriel SB
    594. Gupta N
    595. Hartl C
    596. Poplin RE
    597. Clark AG
    598. Rodriguez-Flores JL
    599. Flicek P
    600. Clarke L
    601. Smith RE
    602. Zheng-Bradley X
    603. MacArthur DG
    604. Mardis ER
    605. Fulton R
    606. Koboldt DC
    607. Gravel S
    608. Bustamante CD
    609. Craig DW
    610. Christoforides A
    611. Homer N
    612. Izatt T
    613. Sherry ST
    614. Xiao C
    615. Dermitzakis ET
    616. Abecasis GR
    617. Min Kang H
    618. McVean GA
    619. Gerstein MB
    620. Balasubramanian S
    621. Habegger L
    622. Functional interpretation group
    623. Yu H
    624. Flicek P
    625. Clarke L
    626. Cunningham F
    627. Dunham I
    628. Zerbino D
    629. Zheng-Bradley X
    630. Lage K
    631. Berg Jespersen J
    632. Horn H
    633. Montgomery SB
    634. DeGorter MK
    635. Khurana E
    636. Tyler-Smith C
    637. Chen Y
    638. Colonna V
    639. Xue Y
    640. Gerstein MB
    641. Balasubramanian S
    642. Fu Y
    643. Kim D
    644. Chromosome Y group
    645. Auton A
    646. Marcketta A
    647. American Museum of Natural History
    648. Desalle R
    649. Narechania A
    650. Arizona State University
    651. Wilson Sayres MA
    652. Garrison EP
    653. Handsaker RE
    654. Kashin S
    655. McCarroll SA
    656. Rodriguez-Flores JL
    657. Flicek P
    658. Clarke L
    659. Zheng-Bradley X
    660. Erlich Y
    661. Gymrek M
    662. Frederick Willems T
    663. Bustamante CD
    664. Mendez FL
    665. David Poznik G
    666. Underhill PA
    667. Lee C
    668. Cerveira E
    669. Malhotra A
    670. Romanovitch M
    671. Zhang C
    672. Abecasis GR
    673. University of Queensland
    674. Coin L
    675. Shao H
    676. Virginia Bioinformatics Institute
    677. Mittelman D
    678. Tyler-Smith C
    679. Ayub Q
    680. Banerjee R
    681. Cerezo M
    682. Chen Y
    683. Fitzgerald TW
    684. Louzada S
    685. Massaia A
    686. McCarthy S
    687. Ritchie GR
    688. Xue Y
    689. Yang F
    690. Data coordination center group
    691. Gibbs RA
    692. Kovar C
    693. Kalra D
    694. Hale W
    695. Muzny D
    696. Reid JG
    697. Wang J
    698. Dan X
    699. Guo X
    700. Li G
    701. Li Y
    702. Ye C
    703. Zheng X
    704. Altshuler DM
    705. Clarke L
    706. Zheng-Bradley X
    707. Bentley DR
    708. Cox A
    709. Humphray S
    710. Kahn S
    711. Sudbrak R
    712. Albrecht MW
    713. Lienhard M
    714. Larson D
    715. Craig DW
    716. Izatt T
    717. Kurdoglu AA
    718. Sherry ST
    719. Xiao C
    720. Haussler D
    721. Abecasis GR
    722. McVean GA
    723. Durbin RM
    724. Balasubramaniam S
    725. Keane TM
    726. McCarthy S
    727. Stalker J
    728. Samples and ELSI group
    729. Chakravarti A
    730. Knoppers BM
    731. Abecasis GR
    732. Barnes KC
    733. Beiswanger C
    734. Burchard EG
    735. Bustamante CD
    736. Cai H
    737. Cao H
    738. Durbin RM
    739. Gerry NP
    740. Gharani N
    741. Gibbs RA
    742. Gignoux CR
    743. Gravel S
    744. Henn B
    745. Jones D
    746. Jorde L
    747. Kaye JS
    748. Keinan A
    749. Kent A
    750. Kerasidou A
    751. Li Y
    752. Mathias R
    753. McVean GA
    754. Moreno-Estrada A
    755. Ossorio PN
    756. Parker M
    757. Resch AM
    758. Rotimi CN
    759. Royal CD
    760. Sandoval K
    761. Su Y
    762. Sudbrak R
    763. Tian Z
    764. Tishkoff S
    765. Toji LH
    766. Tyler-Smith C
    767. Via M
    768. Wang Y
    769. Yang H
    770. Yang L
    771. Zhu J
    772. Sample collection
    773. British from England and Scotland (GBR)
    774. Bodmer W
    775. Colombians in Medellín, Colombia (CLM)
    776. Bedoya G
    777. Ruiz-Linares A
    778. Han Chinese South (CHS)
    779. Cai Z
    780. Gao Y
    781. Chu J
    782. Finnish in Finland (FIN)
    783. Peltonen L
    784. Iberian Populations in Spain (IBS)
    785. Garcia-Montero A
    786. Orfao A
    787. Puerto Ricans in Puerto Rico (PUR)
    788. Dutil J
    789. Martinez-Cruzado JC
    790. Oleksyk TK
    791. African Caribbean in Barbados (ACB)
    792. Barnes KC
    793. Mathias RA
    794. Hennis A
    795. Watson H
    796. McKenzie C
    797. Bengali in Bangladesh (BEB)
    798. Qadri F
    799. LaRocque R
    800. Sabeti PC
    801. Chinese Dai in Xishuangbanna, China (CDX)
    802. Zhu J
    803. Deng X
    804. Esan in Nigeria (ESN)
    805. Sabeti PC
    806. Asogun D
    807. Folarin O
    808. Happi C
    809. Omoniwa O
    810. Stremlau M
    811. Tariyal R
    812. Gambian in Western Division – Mandinka (GWD)
    813. Jallow M
    814. Sisay Joof F
    815. Corrah T
    816. Rockett K
    817. Kwiatkowski D
    818. Indian Telugu in the UK (ITU) and Sri Lankan Tamil in the UK (STU)
    819. Kooner J
    820. Kinh in Ho Chi Minh City, Vietnam (KHV)
    821. Tịnh Hiê`n T
    822. Dunstan SJ
    823. Thuy Hang N
    824. Mende in Sierra Leone (MSL)
    825. Fonnie R
    826. Garry R
    827. Kanneh L
    828. Moses L
    829. Sabeti PC
    830. Schieffelin J
    831. Grant DS
    832. Peruvian in Lima, Peru (PEL)
    833. Gallo C
    834. Poletti G
    835. Punjabi in Lahore, Pakistan (PJL)
    836. Saleheen D
    837. Rasheed A
    838. Scientific management
    839. Brooks LD
    840. Felsenfeld AL
    841. McEwen JE
    842. Vaydylevich Y
    843. Green ED
    844. Duncanson A
    845. Dunn M
    846. Schloss JA
    847. Wang J
    848. Yang H
    849. Writing group
    850. Auton A
    851. Brooks LD
    852. Durbin RM
    853. Garrison EP
    854. Min Kang H
    855. Korbel JO
    856. Marchini JL
    857. McCarthy S
    858. McVean GA
    859. Abecasis GR
    (2015) A global reference for human genetic variation
    Nature 526:68–74.
    https://doi.org/10.1038/nature15393
    1. Trubetskoy V
    2. Pardiñas AF
    3. Qi T
    4. Panagiotaropoulou G
    5. Awasthi S
    6. Bigdeli TB
    7. Bryois J
    8. Chen C-Y
    9. Dennison CA
    10. Hall LS
    11. Lam M
    12. Watanabe K
    13. Frei O
    14. Ge T
    15. Harwood JC
    16. Koopmans F
    17. Magnusson S
    18. Richards AL
    19. Sidorenko J
    20. Wu Y
    21. Zeng J
    22. Grove J
    23. Kim M
    24. Li Z
    25. Voloudakis G
    26. Zhang W
    27. Adams M
    28. Agartz I
    29. Atkinson EG
    30. Agerbo E
    31. Al Eissa M
    32. Albus M
    33. Alexander M
    34. Alizadeh BZ
    35. Alptekin K
    36. Als TD
    37. Amin F
    38. Arolt V
    39. Arrojo M
    40. Athanasiu L
    41. Azevedo MH
    42. Bacanu SA
    43. Bass NJ
    44. Begemann M
    45. Belliveau RA
    46. Bene J
    47. Benyamin B
    48. Bergen SE
    49. Blasi G
    50. Bobes J
    51. Bonassi S
    52. Braun A
    53. Bressan RA
    54. Bromet EJ
    55. Bruggeman R
    56. Buckley PF
    57. Buckner RL
    58. Bybjerg-Grauholm J
    59. Cahn W
    60. Cairns MJ
    61. Calkins ME
    62. Carr VJ
    63. Castle D
    64. Catts SV
    65. Chambert KD
    66. Chan RCK
    67. Chaumette B
    68. Cheng W
    69. Cheung EFC
    70. Chong SA
    71. Cohen D
    72. Consoli A
    73. Cordeiro Q
    74. Costas J
    75. Curtis C
    76. Davidson M
    77. Davis KL
    78. de Haan L
    79. Degenhardt F
    80. DeLisi LE
    81. Demontis D
    82. Dickerson F
    83. Dikeos D
    84. Dinan T
    85. Djurovic S
    86. Duan J
    87. Ducci G
    88. Dudbridge F
    89. Eriksson JG
    90. Fañanás L
    91. Faraone SV
    92. Fiorentino A
    93. Forstner A
    94. Frank J
    95. Freimer NB
    96. Fromer M
    97. Frustaci A
    98. Gadelha A
    99. Genovese G
    100. Gershon ES
    101. Giannitelli M
    102. Giegling I
    103. Giusti-Rodríguez P
    104. Godard S
    105. Goldstein JI
    106. González Peñas J
    107. González-Pinto A
    108. Gopal S
    109. Gratten J
    110. Green MF
    111. Greenwood TA
    112. Guillin O
    113. Gülöksüz S
    114. Gur RE
    115. Gur RC
    116. Gutiérrez B
    117. Hahn E
    118. Hakonarson H
    119. Haroutunian V
    120. Hartmann AM
    121. Harvey C
    122. Hayward C
    123. Henskens FA
    124. Herms S
    125. Hoffmann P
    126. Howrigan DP
    127. Ikeda M
    128. Iyegbe C
    129. Joa I
    130. Julià A
    131. Kähler AK
    132. Kam-Thong T
    133. Kamatani Y
    134. Karachanak-Yankova S
    135. Kebir O
    136. Keller MC
    137. Kelly BJ
    138. Khrunin A
    139. Kim S-W
    140. Klovins J
    141. Kondratiev N
    142. Konte B
    143. Kraft J
    144. Kubo M
    145. Kučinskas V
    146. Kučinskiene ZA
    147. Kusumawardhani A
    148. Kuzelova-Ptackova H
    149. Landi S
    150. Lazzeroni LC
    151. Lee PH
    152. Legge SE
    153. Lehrer DS
    154. Lencer R
    155. Lerer B
    156. Li M
    157. Lieberman J
    158. Light GA
    159. Limborska S
    160. Liu C-M
    161. Lönnqvist J
    162. Loughland CM
    163. Lubinski J
    164. Luykx JJ
    165. Lynham A
    166. Macek M
    167. Mackinnon A
    168. Magnusson PKE
    169. Maher BS
    170. Maier W
    171. Malaspina D
    172. Mallet J
    173. Marder SR
    174. Marsal S
    175. Martin AR
    176. Martorell L
    177. Mattheisen M
    178. McCarley RW
    179. McDonald C
    180. McGrath JJ
    181. Medeiros H
    182. Meier S
    183. Melegh B
    184. Melle I
    185. Mesholam-Gately RI
    186. Metspalu A
    187. Michie PT
    188. Milani L
    189. Milanova V
    190. Mitjans M
    191. Molden E
    192. Molina E
    193. Molto MD
    194. Mondelli V
    195. Moreno C
    196. Morley CP
    197. Muntané G
    198. Murphy KC
    199. Myin-Germeys I
    200. Nenadić I
    201. Nestadt G
    202. Nikitina-Zake L
    203. Noto C
    204. Nuechterlein KH
    205. O’Brien NL
    206. O’Neill FA
    207. Oh S-Y
    208. Olincy A
    209. Ota VK
    210. Pantelis C
    211. Papadimitriou GN
    212. Parellada M
    213. Paunio T
    214. Pellegrino R
    215. Periyasamy S
    216. Perkins DO
    217. Pfuhlmann B
    218. Pietiläinen O
    219. Pimm J
    220. Porteous D
    221. Powell J
    222. Quattrone D
    223. Quested D
    224. Radant AD
    225. Rampino A
    226. Rapaport MH
    227. Rautanen A
    228. Reichenberg A
    229. Roe C
    230. Roffman JL
    231. Roth J
    232. Rothermundt M
    233. Rutten BPF
    234. Saker-Delye S
    235. Salomaa V
    236. Sanjuan J
    237. Santoro ML
    238. Savitz A
    239. Schall U
    240. Scott RJ
    241. Seidman LJ
    242. Sharp SI
    243. Shi J
    244. Siever LJ
    245. Sigurdsson E
    246. Sim K
    247. Skarabis N
    248. Slominsky P
    249. So H-C
    250. Sobell JL
    251. Söderman E
    252. Stain HJ
    253. Steen NE
    254. Steixner-Kumar AA
    255. Stögmann E
    256. Stone WS
    257. Straub RE
    258. Streit F
    259. Strengman E
    260. Stroup TS
    261. Subramaniam M
    262. Sugar CA
    263. Suvisaari J
    264. Svrakic DM
    265. Swerdlow NR
    266. Szatkiewicz JP
    267. Ta TMT
    268. Takahashi A
    269. Terao C
    270. Thibaut F
    271. Toncheva D
    272. Tooney PA
    273. Torretta S
    274. Tosato S
    275. Tura GB
    276. Turetsky BI
    277. Üçok A
    278. Vaaler A
    279. van Amelsvoort T
    280. van Winkel R
    281. Veijola J
    282. Waddington J
    283. Walter H
    284. Waterreus A
    285. Webb BT
    286. Weiser M
    287. Williams NM
    288. Witt SH
    289. Wormley BK
    290. Wu JQ
    291. Xu Z
    292. Yolken R
    293. Zai CC
    294. Zhou W
    295. Zhu F
    296. Zimprich F
    297. Atbaşoğlu EC
    298. Ayub M
    299. Benner C
    300. Bertolino A
    301. Black DW
    302. Bray NJ
    303. Breen G
    304. Buccola NG
    305. Byerley WF
    306. Chen WJ
    307. Cloninger CR
    308. Crespo-Facorro B
    309. Donohoe G
    310. Freedman R
    311. Galletly C
    312. Gandal MJ
    313. Gennarelli M
    314. Hougaard DM
    315. Hwu H-G
    316. Jablensky AV
    317. McCarroll SA
    318. Moran JL
    319. Mors O
    320. Mortensen PB
    321. Müller-Myhsok B
    322. Neil AL
    323. Nordentoft M
    324. Pato MT
    325. Petryshen TL
    326. Pirinen M
    327. Pulver AE
    328. Schulze TG
    329. Silverman JM
    330. Smoller JW
    331. Stahl EA
    332. Tsuang DW
    333. Vilella E
    334. Wang S-H
    335. Xu S
    336. Indonesia Schizophrenia Consortium
    337. PsychENCODE
    338. Psychosis Endophenotypes International Consortium
    339. SynGO Consortium
    340. Adolfsson R
    341. Arango C
    342. Baune BT
    343. Belangero SI
    344. Børglum AD
    345. Braff D
    346. Bramon E
    347. Buxbaum JD
    348. Campion D
    349. Cervilla JA
    350. Cichon S
    351. Collier DA
    352. Corvin A
    353. Curtis D
    354. Forti MD
    355. Domenici E
    356. Ehrenreich H
    357. Escott-Price V
    358. Esko T
    359. Fanous AH
    360. Gareeva A
    361. Gawlik M
    362. Gejman PV
    363. Gill M
    364. Glatt SJ
    365. Golimbet V
    366. Hong KS
    367. Hultman CM
    368. Hyman SE
    369. Iwata N
    370. Jönsson EG
    371. Kahn RS
    372. Kennedy JL
    373. Khusnutdinova E
    374. Kirov G
    375. Knowles JA
    376. Krebs M-O
    377. Laurent-Levinson C
    378. Lee J
    379. Lencz T
    380. Levinson DF
    381. Li QS
    382. Liu J
    383. Malhotra AK
    384. Malhotra D
    385. McIntosh A
    386. McQuillin A
    387. Menezes PR
    388. Morgan VA
    389. Morris DW
    390. Mowry BJ
    391. Murray RM
    392. Nimgaonkar V
    393. Nöthen MM
    394. Ophoff RA
    395. Paciga SA
    396. Palotie A
    397. Pato CN
    398. Qin S
    399. Rietschel M
    400. Riley BP
    401. Rivera M
    402. Rujescu D
    403. Saka MC
    404. Sanders AR
    405. Schwab SG
    406. Serretti A
    407. Sham PC
    408. Shi Y
    409. St Clair D
    410. Stefánsson H
    411. Stefansson K
    412. Tsuang MT
    413. van Os J
    414. Vawter MP
    415. Weinberger DR
    416. Werge T
    417. Wildenauer DB
    418. Yu X
    419. Yue W
    420. Holmans PA
    421. Pocklington AJ
    422. Roussos P
    423. Vassos E
    424. Verhage M
    425. Visscher PM
    426. Yang J
    427. Posthuma D
    428. Andreassen OA
    429. Kendler KS
    430. Owen MJ
    431. Wray NR
    432. Daly MJ
    433. Huang H
    434. Neale BM
    435. Sullivan PF
    436. Ripke S
    437. Walters JTR
    438. O’Donovan MC
    439. Schizophrenia Working Group of the Psychiatric Genomics Consortium
    (2022) Mapping genomic loci implicates genes and synaptic biology in schizophrenia
    Nature.
    https://doi.org/10.1038/s41586-022-04434-5

Decision letter

  1. Genevieve Konopka
    Reviewing Editor; University of Texas Southwestern Medical Center, United States
  2. Molly Przeworski
    Senior Editor; Columbia University, United States

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "MCGA: a multi-strategy conditional gene-based association framework integrating with isoform-level expression profiles reveals new susceptible and druggable candidate genes of schizophrenia" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Molly Przeworski as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) Systematic comparisons with existing methods are needed to demonstrate the potential gain in biological insight from this method. How do the results from MGCA_Dist, MGCA_eQTL, and MGCA_isoQTL compare with MAGMA? Can you include a direct comparison of MGCA with MAGMA, including the gene-based results, biological pathway enrichments, and drug repositioning analyses? Can the authors compare MGCA_eQTL to other eQTL-based approaches such as S-PrediXcan?

2) When comparing MCGA-eQTL and MCGA-sQTL, only power is considered. The authors should include analyses to demonstrate the performance in control for false positives.

3) When choosing a favorable exponent value c (1.432 chosen in the study), the authors found that the c value is robust to trait type, sample size or variant size, but the authors didn't explain what factors affect the choosing of c. Considering the potential application of MCGA method in other studies, the authors should explain what factor affects c value, and provide the guidance how to choose an optimal c. Can the authors show via simulations the relationship between the coefficient used to adjust the chi square statistic correlations, $c$, and the strength of association signals?

Reviewer #1 (Recommendations for the authors):

– How do the results from MGCA_Dist, MGCA_eQTL, and MGCA_isoQTL compare with MAGMA? Can you include a direct comparison of MGCA with MAGMA, including the gene-based results, biological pathway enrichments, and drug repositioning analyses?

– Similarly, can you compare MGCA_eQTL to other eQTL-based approaches such as S-PrediXcan? I would like to see a systematic comparison with existing methods to see if novel biological insights can be derived from your results.

– Does MCGA provide information on tissue-specific eQTLs/isoQTLs? This information is available in GTEx but is not reported for association results in supplementary tables. Tissue-specific effects for each tissue might improve the interpretation of results.

– Can you provide more information on WGCNA results? Are MCGA susceptibility genes enriched in a particular co-expression module? Are synaptic signalling related biological pathways enriched in a particular module? How did the network modules compare across brain tissues? Is it possible to build a consensus network for the top five brain regions (or indeed all brain regions in GTEx) to simplify the results?

– The drug repositioning analysis could be improved by drawing on comprehensive drug-gene information from hetionet (https://doi.org/10.7554/eLife.26726.001). This resource might further refine some of your top associations by accounting for anatomies, biological processes, side-effects, and symptoms.

Reviewer #2 (Recommendations for the authors):

GTEx v8 data contains samples from multiple tissues, and the significant gene-trait association in some tissue may be non-causal driven by gene expression correlation with causal tissues. It should be interesting to come up a new method to remove those false positive.

Reviewer #3 (Recommendations for the authors):

My major concern is a lack of innovation as a new method. MCGA is based on DES, which integrates expression data with GWAS; and DESE relies on ECS to compute gene-level association evidence. Both DESE and ECS were developed by the authors in the past and published elsewhere. The only innovation in the MCGA methodology seems the improved control of type I error by formally assessing the null distribution of the conditional test statistic for ECS and adjust the test statistic. This improvement is definitely crucial to ECS. But in my view, this is a fix to a long standing issue of ECS rather than a new method in itself. Even with this improvement, the ECS is still not a completely statistically solid method because it still involves heuristic procedures to determine how much the ECS statistics should be adjusted. There is still no methodological guarantee of controlled type I error.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia" for further consideration by eLife. Your revised article has been evaluated by Molly Przeworski (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer #3 has 2 remaining issues that we would like you to more fully address before we would consider accepting the manuscript. Please read these comments carefully and revise accordingly. Thank you.

Reviewer #1 (Recommendations for the authors):

I would like to thank the authors for their detailed response to my questions and concerns. I have no further comments on the manuscript.

Reviewer #2 (Recommendations for the authors):

The authors have resolved all my concerns. Nice job.

Reviewer #3 (Recommendations for the authors):

The revised manuscript was submitted in PDF format with track changes. Although it clearly shows what's been changed since the original submission, it is therefore very difficult to read the complete article for a second overall assessment. I will therefore focus on comments I raised in my first review.

Two major issues remain.

1. The authors failed to respond properly to my comment on the overall concern that conditional analysis may not be the best approach for this type of problem. Figure 1 of SuSiE paper illustrates the exact scenarios why this could be a bad idea, which motivated my A-B-C genes example. Also incidentally, it is not true that SuSiE cannot work with summary statistics.

2. The response to the question of the choice of c (from another reviewer and myself) is still not satisfactory. It seems the favorable c still was computed in very specific scenarios. In fact, the new Figure 9 exactly demonstrates my concern: optimal choice of c depends greatly on LD and effect size, and for variants with small effects (OR < 1.4 which would be the case for most GWAS signals) the range of optimal c is wide. Given this result, I remain unconvinced that the tests are calibrated under current choice of c. From Figure 9 I would conclude that a conservative c = 1.8 should be used in general, which is different from the recommended choice from the authors.

https://doi.org/10.7554/eLife.70779.sa1

Author response

Essential revisions:

1) Systematic comparisons with existing methods are needed to demonstrate the potential gain in biological insight from this method. How do the results from MGCA_Dist, MGCA_eQTL, and MGCA_isoQTL compare with MAGMA? Can you include a direct comparison of MGCA with MAGMA, including the gene-based results, biological pathway enrichments, and drug repositioning analyses? Can the authors compare MGCA_eQTL to other eQTL-based approaches such as S-PrediXcan?

We thank the reviewers for their suggestions, and we agree that systematic comparisons with existing methods can help demonstrate the potential gain of our framework. Before the comparison with other methods, we recapped the inheritance and the extensions of our framework (eDESE) with our previous methods (i.e., ECS and DESE) in "Overview of the present study" of the revised manuscript, as suggested by Reviewer #3. We emphasized the two important advantages of the eDESE over ECS and DESE. We renamed MCGA_Dist as eDESE:dist and renamed MCGA_eQTL and MCGA_isoQTL as eDESE:gene and eDESE:isoform, respectively (see lines 110-139 in the revised manuscript). To clearly address all the reviewer's questions in point 1, we have organized these questions as follows and will address them in order:

Question A: How do the results from MGCA_Dist, MGCA_eQTL, and MGCA_isoQTL compare with MAGMA?

To discuss the limitations of MCGA in our last manuscript, we mentioned the number of significant genes identified by MAGMA to suggest that though the number of significant genes identified by MCGA (especially MCGA_isoform) was a little big, so was MAGMA. Considering that MCGA is a conditional gene-based association approach, we had no intention of directly comparing MCGA with MAGMA in our last manuscript.

Question B: Can you include a direct comparison of MGCA with MAGMA, including the gene-based results, biological pathway enrichments, and drug repositioning analyses?

Yes. Because conventional MAGMA and eDESE:dist both can perform gene-based association analysis by mapping variants to genes based on the variant-gene physical distance (+/-5kb around the gene boundary), we compared the susceptibility genes identified by eDESE:dist with that identified by MAGMA from the aspect of the significant gene counts, GO enrichment terms, text-mining results and the druggability of the susceptibility genes (see lines 208-245 and 434-478 in the revised manuscript).

Question C: Can the authors compare MGCA_eQTL to other eQTL-based approaches such as S-PrediXcan?

Yes. Since S-PrediXcan and eDESE (eDESE:gene and eDESE:isoform) both can perform the gene-based association analysis by considering the genetic regulatory effects on transcriptome (i.e., eQTLs), we compared the susceptibility genes identified by eDESE with S-PrediXcan.

The precalculated data in the PredictDB repository (http://predictdb.org) is important to S-PrediXcan. However, PredictDB provides gene-level expression and splicing predictors, but no isoform-level expression predictors, based on the MASHR models. Besides, there is no tutorial for the MASHR models in the PredictDB for now (see details in https://groups.google.com/g/predixcanmetaxcan/c/tTYlg8HIBeA/m/QNVskMCUCAAJ). So, we performed S-PrediXcan analysis only based on the gene-level expression predictors based on the MASHR models.

Before comparing S-PrediXcan and eDESE, we firstly optimized five brain regions based on eDESE:dist, eDESE:gene and eDESE:isoform (see lines 281-313 in the revised manuscript). Then, based on the five optimized brain regions, we compared the significant gene counts, GO enrichment terms, enriched consensus modules, text-mining results of the susceptibility genes (see lines 315-432 in the revised manuscript). We also compared the enrichment of drug-gene interactions and druggability of the susceptibility genes identified by MAGMA, S-PrediXcan and eDESE (see lines 434-478 in the revised manuscript).

The comparisons of eDESE with the two benchmark methods (i.e., MAGMA and S-PrediXcan) showed our framework's utility and potential gains.

2) When comparing MCGA-eQTL and MCGA-sQTL, only power is considered. The authors should include analyses to demonstrate the performance in control for false positives.

Thanks for pointing out this important question! In our last manuscript, we reported the power of isoform-level eQTLs over that of gene-level eQTLs in association analysis but neglected the performance in control for the false positives. In the revised manuscript, we reported the powers and false positives of gene/isoform-level eQTLs in association analysis. Our results showed that the false positives were controlled within a reasonable level (on average < 0.1% by a p-value cutoff of 0.001). Besides, the power of isoform-level eQTLs was higher than that of gene-level eQTLs in all scenarios in which gene expression can affect the phenotype (see lines 247-279 in the revised manuscript).

3) When choosing a favorable exponent value c (1.432 chosen in the study), the authors found that the c value is robust to trait type, sample size or variant size, but the authors didn't explain what factors affect the choosing of c. Considering the potential application of MCGA method in other studies, the authors should explain what factor affects c value, and provide the guidance how to choose an optimal c. Can the authors show via simulations the relationship between the coefficient used to adjust the chi square statistic correlations, $c$, and the strength of association signals?

Thanks for pointing out these important questions! To clearly address all the reviewer's questions, we have organized the questions as follows and will address them in order:

Question A: what factors affect the choosing of c?

Motived from the boundary of chi-square correlation, we adopted simulation studies to empirically choose c for controlling the type I error of the effective chi-square test. Besides the correlation of chi-square statistics, the choosing of c for the effective chi-square test may also be affected by the approximated non-negative solutions. However, the correlation of chi-square statistics is the major factor. Our simulation showed that the derived boundary and influence trend of LD on chi-square statistics were also applicable to the effective chi-square test. In the revised manuscript, we showed that the correlation of chi-square statistics is affected by the non-centrality parameter of chi-square statistics (see lines 640-655 in the revised manuscript).

Question B: how to choose an optimal c?

As the optimal c for controlling the type I error of the effective chi-square test would be affected by the non-centrality parameter of chi-square statistics which are generally unknown in practice, we have to resort to a grid search algorithm to explore an empirically optimal c. In our last manuscript, we mixed the methods of choosing optimal c with the introduction of new effective chi-squared statistics. We wrote a new subsection in Materials and methods to describe the procedure of choosing the optimal c in the revised manuscript (see lines 610-628 in the revised manuscript).

Question C: Can the authors show via simulations the relationship between the coefficient used to adjust the chi square statistic correlations, $c$, and the strength of association signals?

Yes. In the revised manuscript, we further investigated how the strength of association signals influenced the chi-square statistic correlations. Consistent with the theoretical derivation, the correlation of chi-square statistics quickly approached the maximal value |r| as the effect size increased. A stronger association signal led to a smaller c, and the c can quickly approach 1 as the association signals increased even in a moderate sample size (1000 cases and 1000 controls) (see Figure 9).

Reviewer #1 (Recommendations for the authors):

– How do the results from MGCA_Dist, MGCA_eQTL, and MGCA_isoQTL compare with MAGMA? Can you include a direct comparison of MGCA with MAGMA, including the gene-based results, biological pathway enrichments, and drug repositioning analyses?

We thank Reviewer #1 overall for the numerous insightful and helpful suggestions and comments. Please refer to Question A and B of Essential Revisions point 1.

– Similarly, can you compare MGCA_eQTL to other eQTL-based approaches such as S-PrediXcan? I would like to see a systematic comparison with existing methods to see if novel biological insights can be derived from your results.

We thank the reviewer for this suggestion and comment. Please refer to Question C of Essential Revisions point 1.

– Does MCGA provide information on tissue-specific eQTLs/isoQTLs? This information is available in GTEx but is not reported for association results in supplementary tables. Tissue-specific effects for each tissue might improve the interpretation of results.

We thank the reviewer for this suggestion. In our last manuscript, we provided the gene-level and isoform-level eQTLs for each tissue but did not provide tissue-specific eQTLs/isoQTLs. Indeed, GTEx v8 provided the tissue-specific gene-level eQTLs and splicing QTLs for each tissue. However, we cannot directly access the original tissue-specific eQTLs because the download is not free.

Fortunately, we found a processed gene-level tissue-specific eQTLs set of GTEx v8 from the E-MAGMA tutorial document (https://github.com/eskederks/eMAGMA-tutorial), in which the tissue-specific SNP-gene associations were provided. We downloaded the E-MAGMA annotation files for Brain-Front Cortex (BA9) and Brain-Cerebellum, i.e., Brain_Frontal_Cortex_BA9.genes.annot and Brain_Cerebellum.genes.annot, from https://github.com/eskederks/eMAGMA-tutorial/blob/gtex_v8/emagma_annot_1.tar.gz.

We then extracted the tissue-specific SNP-gene associations from the above two annotation files and converted the tissue-specific SNP-gene associations to the eQTLs file format required by the eDESE:gene. We then perform the eDESE:gene analysis based on the tissue-specific eQTLs of Brain-Front Cortex (BA9) and Brain-Cerebellum, respectively. Our results showed that no tissue was significantly associated with schizophrenia by using the tissue-specific eQTLs of Brain-Front Cortex (BA9) (cutoff p-value < 0.05/50=1e-3, Author response image 1a), while eight brain regions were predicted to be significantly associated with schizophrenia by using the tissue-specific eQTLs of Brain-Cerebellum (Author response image 1) .

Author response image 1
The tissue importance generated by eDESE:gene based on the tissue-specific eQTLs.

(a) Brain-Front Cortex (BA9); (b) Brain-Cerebellum. The red dotted lines denote the significant threshold.

Interestingly, though the tissue estimation based on the tissue-specific eQTLs of Brain-Front Cortex (BA9) was not satisfying, we found several brain regions ranked high among all the fifty tissues, and Brain-Front Cortex (BA9) ranked the second in the full tissue list (p-value = 2.4E-2). Brain-Cerebellum itself ranked second and was significantly associated with schizophrenia by using the tissue-specific eQTLs of Brain-Cerebellum. Further, we also investigated the potential susceptibility genes identified by eDESE:gene based on the tissue-specific eQTLs of Brain-Front Cortex (BA9) and Brain-Cerebellum, respectively. We found that these potential susceptibility genes were enriched with few GO terms, in which there was no biologically sensible GO term.

As suggested by the reviewer, our result showed that eDESE could benefit from the tissue-specific eQTLs, though the results of phenotype-associated tissue estimation based on the tissue-specific eQTLs were not as good as that based on the single tissue eQTLs in the current condition. This might result from the small sample size and strict threshold in eQTLs identification in GTEx v8. We believe that future genetic studies based on the increased sample sizes might alleviate this problem and increase the power of tissue-specific eQTLs. Integrating the tissue-specific eQTLs with eDESE will also be a topic for our further exploration. We also cited the paper and acknowledged the data in our Acknowledgments.

– Can you provide more information on WGCNA results? Are MCGA susceptibility genes enriched in a particular co-expression module? Are synaptic signalling related biological pathways enriched in a particular module? How did the network modules compare across brain tissues? Is it possible to build a consensus network for the top five brain regions (or indeed all brain regions in GTEx) to simplify the results?

We thank the reviewer for these suggestions and comments and for the opportunity to clarify.

In our last manuscript, we investigated the normalized intra-module connectivity score of the potential susceptibility genes identified by MCGA_eQTL and MCGA_isoQTL in each potential susceptibility brain region. The measurements were the statistics and statistical significance of the Wilcoxon rank-sum test based on the normalized intra-module connectivity score of the susceptibility and non-susceptibility genes in each tissue. In our last manuscript, we used Table 2 to present the tissues in which the intra-module connectivity score of the susceptibility was significantly higher than that of the non-susceptibility genes. We had no intention of comparing the network modules across brain regions.

We appreciate the reviewer's recommendation to build a consensus network to simplify the results, which we were previously unaware of. Following the reviewer's suggestion, we built a consensus network based on all thirteen brain regions' gene-level expression profiles (GTEx v8). We found that the potential susceptibility genes in the Brain-Cerebellum and Brain-Front Cortex (BA9) identified by eDESE:gene, and the potential susceptibility genes in the Brain-Cerebellum identified by eDESE:isoform were all significantly enriched with the same consensus module, which was significantly enriched with plenty of neuronal and synaptic signaling-related GO terms. However, no enriched consensus module was found for the susceptibility genes identified by S-PrediXcan (see lines 384-402 in the revised manuscript). The consensus analysis results supported the utility of our framework.

– The drug repositioning analysis could be improved by drawing on comprehensive drug-gene information from hetionet (https://doi.org/10.7554/eLife.26726.001). This resource might further refine some of your top associations by accounting for anatomies, biological processes, side-effects, and symptoms.

We thank the reviewer for this suggestion and appreciate the reviewer's recommendation to use Hetionet v1.0 to refine our top associations. We were mainly concerned about the potential susceptibility of the identified genes to schizophrenia and considered the drug repositioning evidence as a supplement. One reason is that some of our top associations contain many non-coding genes and pseudogenes that are not druggable. Another reason is that drug repositioning is complex and usually needs the support of multi-omics data. Thus we only investigated the drug-gene interaction terms and potential druggability of the susceptibility genes in our analysis.

We also studied Hetionet v1.0 and found that its function "Connectivity search" is interesting and useful in investigating the potential associations in genes, anatomies, biological processes, side-effects, symptoms and so on. We then used Hetionet v1.0 to support the top phenotype-gene associations for the susceptibility genes exclusively predicted by eDESE:isoform (see lines 349-354 in the revised manuscript).

Reviewer #2 (Recommendations for the authors):

GTEx v8 data contains samples from multiple tissues, and the significant gene-trait association in some tissue may be non-causal driven by gene expression correlation with causal tissues. It should be interesting to come up a new method to remove those false positive.

This is a very good recommendation! We thank the reviewer for this comment, which has allowed us to improve the estimation results of phenotype-associated tissues. The issue about how to remove the false positive in the gene-trait association analysis is complex. We attempted to alleviate the issue by removing those false positives in the phenotype-associated tissues. The gene-trait association in more reliable phenotype-associated tissues might be more likely the truly susceptibility genes.

The significant tissues identified based on eDESE:dist may not all be phenotype-associated due to the high correlations between brain expression profiles. Based on the eDESE:dist, all thirteen brain regions were ranked top by statistical significance and predicted as the significant phenotype-associated tissues for schizophrenia, no matter based on gene-level or isoform-level (transcript-level) expression profiles (see lines 281-286 in the revised manuscript and supplementary file 1d). For schizophrenia, the Brain-Frontal Cortex is the most known potential susceptibility region, while what roles of other twelve brain regions play in the development of schizophrenia remains obscure.

Our framework, eDESE, inherits the framework of DESE and can also predict potential phenotype-associated tissues and susceptibility genes with the guide of cis-eQTLs. What eDESE over DESE is: First, eDESE is built based on a new ECS, with which the type I error could be controlled within a proper level. Second, eDESE expands the conditional gene-based association analysis of DESE by using different SNPs sets, i.e., physically nearby SNPs (DESE did), gene-level and isoform-level variant-gene eQTLs associations.

We then assumed that if a tissue (say T1) is a phenotype-associated tissue, potential susceptibility genes identified by eDESE with the eQTLs of T1 will be more likely to be phenotype-associated genes and be selectively expressed in T1 or similar tissues. Using the isoform-level eQTLs, based on a more precise resolution, we found that the analysis results may be more powerful, as demonstrated in the simulations in the revised manuscript (see lines 247-279 in the revised manuscript). Thus, we integrated the outputs of eDESE:dist, eDESE:gene and eDESE:isoform. Based on the principle of "the wisdom of the crowd", we finally optimized five brain regions, which were all predicted as the potential susceptibility tissues by eDESE:dist, eDESE:gene and eDESE:isoform, respectively.

This is also our first trial to remove those non-direct phenotype-associated tissues. The method might be rough but can be a reference for further explorations.

Reviewer #3 (Recommendations for the authors):

My major concern is a lack of innovation as a new method. MCGA is based on DES, which integrates expression data with GWAS; and DESE relies on ECS to compute gene-level association evidence. Both DESE and ECS were developed by the authors in the past and published elsewhere. The only innovation in the MCGA methodology seems the improved control of type I error by formally assessing the null distribution of the conditional test statistic for ECS and adjust the test statistic. This improvement is definitely crucial to ECS. But in my view, this is a fix to a long standing issue of ECS rather than a new method in itself. Even with this improvement, the ECS is still not a completely statistically solid method because it still involves heuristic procedures to determine how much the ECS statistics should be adjusted. There is still no methodological guarantee of controlled type I error.

We understand the concern of Reviewer #3 about the novelty of our study. Indeed, MCGA (called eDESE in the revised manuscript) is based on the ECS and DESE, which were our previously published work. eDESE had at least two advantages over ECS and DESE. First, as the reviewer pointed out, we made a crucial improvement for the ECS to control the type I error by formally assessing the null distribution of the conditional gene-based association analysis based on the ECS and adjust the test statistic. Second, different from DESE, eDESE can perform conditional gene-based association analysis not only by mapping variants to genes according to their physical distance (like DESE), more importantly, but also based on gene-level and isoform-level gene-variant eQTLs associations. These improvements can help eDESE to use the distal variants to predict more significant susceptibility genes and tissues. We also found that isoform-level eQTLs could help eDESE identify more susceptibility genes than gene-level eQTLs in association analysis and produce the potentially significant phenotype-associated isoforms (or transcripts). Though mainly limited by sample size, the eQTL-guided models can be an important supplement for the gene-based approach. The methodology improvement of eDESE, the simulation studies and the comparisons with other methods demonstrated, at least partly, the potential gains of our framework.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #3 (Recommendations for the authors):

The revised manuscript was submitted in PDF format with track changes. Although it clearly shows what's been changed since the original submission, it is therefore very difficult to read the complete article for a second overall assessment. I will therefore focus on comments I raised in my first review.

We apologize for the inconvenience in assessing our revised manuscript in PDF format. To address this issue, we have provided the clean version of the revised manuscript through the "Related Manuscript File" option in the eLife submission system to facilitate the review process. We thank Reviewer #3 overall for the insightful and helpful comments and for the opportunity to clarify the confusion.

However two major issues remain.

1. The authors failed to respond properly to my public comment on the overall concern that conditional analysis may not be the best approach for this type of problem. Figure 1 of SuSiE paper illustrates the exact scenarios why this could be a bad idea, which motivated my A-B-C genes example. Also incidentally, it is not true that SuSiE cannot work with summary statistics.

We thank the reviewer for these comments and for the opportunity to correct and clarify. Yes, SuSiE can work with summary statistics. We apologize for not adequately addressing the difference between SuSiE and our approach in our last response.

The role of our conditional association analysis is to remove redundant association signals per se. We agree with the reviewer that the conditional association analysis may not be the best approach for the fine-mapping analysis because the gene order of entering the conditional association analysis will influence the performance. Gene enters the conditional association analysis in a pre-defined order. In conventional conditional association analysis, the gene order can be determined according to the p-value of the gene's overall association with the phenotype. The conditional p-value of the second gene was obtained by conditioning on the first gene, and that of the third gene was obtained by conditioning on the first two genes. The conditional p-values of subsequent genes were calculated according to the same procedure. Thus, genes with priority to enter the conditional iteration tend to be significant, while genes entering the conditional iteration later tend to be insignificant. In the A-B-C gene example, as the reviewer pointed out, if the non-causal gene C has a stronger marginal signal than either true susceptibility gene A or B, then gene C will enter the conditional association analysis procedure first, which might lead to the true susceptibility genes (A and B) to be insignificant in the conventional conditional association analysis. Also similar to Figure 1b of SuSiE, conventional conditional association analysis may incorrectly identify non-causal gene C as the significant phenotype-associated gene, and gene A and B's association signal will become weaker by conditioning on gene C.

To address this issue, in this study, we proposed to use the gene selective expression (not the conventional association p-value) to guide the gene order of entering the conditional association analysis, which worked well in the real data analysis. The main assumption of gene selective expression is that the true susceptibility genes tend to selectively express in phenotype-associated tissues (e.g., Nat Genet. 2018;50:621–9. Genome Biol. 2019;20(1):233), which might result in a relatively high gene selective expression score for the true susceptibility genes and prioritize the true susceptibility genes to enter the conditional association analysis. For the A-B-C gene example, though gene C has a stronger marginal signal than either gene A or B, according to our assumption, the true susceptibility gene A or B might selectively express in phenotype-associated tissues and tend to have a higher gene selective expression score than gene C, thus can enter the conditional association analysis before gene C. Finally, by conditioning on the true susceptibility gene A or B, the non-causal gene C will be insignificant, which might help alleviate the problems in the A-B-C gene example.

In addition, as the reviewer pointed out, SuSiE is also a solid and efficient fine-mapping approach. Incorporating the advantages of SuSiE with our method will be interesting and helpful, which we will carefully consider in our future research. As SuSiE is an important fine-mapping tool, we also cited the original paper of SuSiE and discussed the relevant issue in the revised manuscript. We have added the following introduction about SuSiE in our revised manuscript (or see lines 62-65 of the clean version manuscript):

"SuSiE (sum of single effects), a novel and popular approach to variable selection in linear regression, can use summary statistics and LD to produce gene-level evidence of association in terms of Bayes Factor10."

We have also added the following discussion in our revised manuscript (see lines 562-566 of the clean version manuscript):

"Third, the performance of conditional association analysis for fine-mapping can be greatly influenced by the gene orders of entering the analysis. Though the improved effective chi-squared test with eDESE can work well in the real data of schizophrenia, integrating some non-conditional fine-mapping methods, such as SuSiE, with eDESE is worth trying."

2. The response to the question of the choice of c (from another reviewer and myself) is still not satisfactory. It seems the favorable c still was computed in very specific scenarios. In fact, the new Figure 9 exactly demonstrates my concern: optimal choice of c depends greatly on LD and effect size, and for variants with small effects (OR < 1.4 which would be the case for most GWAS signals) the range of optimal c is wide. Given this result, I remain unconvinced that the tests are calibrated under current choice of c. From Figure 9 I would conclude that a conservative c = 1.8 should be used in general, which is different from the recommended choice from the authors.

We thank the reviewer for these comments and for the opportunity to clarify. The c in Figure 9 was merely the correlation of chi-square statistics, which increased with OR up to the upper bound (|r|). We initially performed relevant analysis and drew Figure 9 to respond to the comment 3 of Reviewer #3 in our last response. The meaning of c value in Figure 9 is different from the c value used for the association test.

The factors that affect the choosing of favorable c for the association test are complex, including such as the correlation of chi-square statistics, the redundancy of degree of freedom and the approximated non-negative solutions. We generated the favorable c for the association test by extensive simulations. Specifically speaking, we simulated many relatively common scenarios, which included various sample sizes and variant numbers for both binary and continuous traits, respectively. The description of the common scenarios is as follows (or see lines 627-631 in the clean version manuscript).

"We considered in total 84 parameter settings, i.e., a combination of three different sample sizes (10,000, 20,000 and 40,000) and 14 different variant number (10, 30, 50, 80, 100, 125, 150, 200, 250, 300, 400, 500, 800, and 1000) for both binary and continuous traits, respectively. For a parameter setting, 40,000 datasets were simulated and used to produce p-values to determine the favorable c value."

We apologize for the confusion about the c in Figure 9. We have replaced the expression of c in Figure 9 with c' to distinguish it from the c for the association test (or see lines 640-665 in the clean version manuscript).

To further address the reviewer's concern, we also performed simulation analyses (similar to the analyses in Figure 4 in the revised manuscript) with c=1.8. The Q-Q plots in Author response image 2 simply that the c=1.8 led to a higher false-positive rate in the conditional association analysis. It seems that a larger c will make the test more liberal. Therefore, we chose a more conservative c (i.e., 1.432) in this study. We also added the following discussion in our revised manuscript:

"Fourth, the optimal c value of the effective chi-squared test is still empirical although we derived its range and relevant factors. The optimal c value can be improved to be better suited for other specific application scenarios."

Author response image 2
Q-Q plots of the conditional, unconditional gene-based association test and likelihood-ratio test under the null hypothesis (c=1.8).

(a) and (d), two gene-variant pairs with the similar variant number (SIPA1L2 with 29 variants and LOC729336 with 30 variants). (b) and (e): two gene-variant pairs with different variant numbers, and the first is larger than the second (CACHD1 with 41 variants and RAVER2 with eight variants). (c) and (f): two gene-variant pairs with different variant numbers, and the second is larger than the first (LOC647132 with five variants and FAM5C with 48 variants). (a), (b) and (c): the former gene has no QTL, and QTL explained 0.5% of heritability in the latter gene. (d), (e) and (f): the former gene has no QTL, and QTL explained 1% of heritability in the latter gene. Ten thousand phenotype datasets were simulated for each scenario. Unconditional Eff. Chi. (the red) represents unconditional association analysis at the former gene by the improved ECS. Conditional Eff. Chi (the blue) represents conditional association analysis at the former gene conditioning on the latter gene by the improved ECS. The likelihood ratio test (the yellow) was conducted based on the nested linear regression models.

https://doi.org/10.7554/eLife.70779.sa2

Article and author information

Author details

  1. Xiangyi Li

    1. Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
    2. Key Laboratory of Tropical Disease Control (Sun Yat-sen University), Ministry of Education, Guangzhou, China
    3. Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
    Contribution
    Data curation, Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review and editing
    Contributed equally with
    Lin Jiang
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9408-3016
  2. Lin Jiang

    Research Center of Medical Sciences, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
    Contribution
    Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Xiangyi Li
    Competing interests
    No competing interests declared
  3. Chao Xue

    1. Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
    2. Key Laboratory of Tropical Disease Control (Sun Yat-sen University), Ministry of Education, Guangzhou, China
    3. Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
    Contribution
    Data curation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Mulin Jun Li

    The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, Tianjin Medical University, Tianjin, China
    Contribution
    Data curation, Formal analysis, The gene/isoform-level eQTLs were calculated by Mulin Jun Li who has the permission to access the GTEx genotype data
    Competing interests
    No competing interests declared
  5. Miaoxin Li

    1. Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
    2. Key Laboratory of Tropical Disease Control (Sun Yat-sen University), Ministry of Education, Guangzhou, China
    3. Center for Precision Medicine, Sun Yat-sen University, Guangzhou, China
    4. Guangdong Provincial Key Laboratory of Biomedical Imaging and Guangdong Provincial Engineering Research Center of Molecular Imaging, The Fifth Affiliated Hospital, Sun Yat-sen University, Zhuhai, China
    Contribution
    Conceptualization, Data curation, Methodology, Software, Supervision, Writing – review and editing
    For correspondence
    limiaoxin@mail.sysu.edu.cn
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4733-0109

Funding

National Natural Science Foundation of China (31771401)

  • Miaoxin Li

National Key Research and Development Program of China (2018YFC0910500 and 2016YFC0904300)

  • Miaoxin Li

Science and Technology Program of Guangzhou (201803010116)

  • Miaoxin Li

Guangdong project (2017GC010644)

  • Miaoxin Li

National Natural Science Foundation of China (31970650)

  • Miaoxin Li

National Natural Science Foundation of China (32170637)

  • Miaoxin Li

Department of Science and Technology of Guangdong Province (2018B030322006)

  • Miaoxin Li

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the GTEx Consortium and 1000 Genomes Projects for providing access to the data and thank the GTEx Consortium for the code to preprocess and normalize the expression profiles. We also appreciate the authors of the schizophrenia study and the Schizophrenia Working Group of the Psychiatric Genomics Consortium for sharing their GWAS summary statistics. Finally, we also thank the authors of E-MAGMA for providing their preprocessed tissue-specific eQTLs of GTEx (v8). This work was funded by the National Natural Science Foundation of China (31771401, 32170637 and 31970650), National Key R&D Program of China (2018YFC0910500 and 2016YFC0904300), Science and Technology Program of Guangzhou (201803010116), Guangdong project (2017GC010644), Department of Science and Technology of Guangdong Province (2018B030322006).

Senior Editor

  1. Molly Przeworski, Columbia University, United States

Reviewing Editor

  1. Genevieve Konopka, University of Texas Southwestern Medical Center, United States

Publication history

  1. Received: May 28, 2021
  2. Preprint posted: June 9, 2021 (view preprint)
  3. Accepted: November 11, 2021
  4. Version of Record published: April 12, 2022 (version 1)

Copyright

© 2022, Li et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 486
    Page views
  • 72
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Xiangyi Li
  2. Lin Jiang
  3. Chao Xue
  4. Mulin Jun Li
  5. Miaoxin Li
(2022)
A conditional gene-based association framework integrating isoform-level eQTL data reveals new susceptibility genes for schizophrenia
eLife 11:e70779.
https://doi.org/10.7554/eLife.70779

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Kiri Choi, Won Kyu Kim, Changbong Hyeon
    Research Article

    The projection neurons (PNs), reconstructed from electron microscope (EM) images of the Drosophila olfactory system, offer a detailed view of neuronal anatomy, providing glimpses into information flow in the brain. About 150 uPNs constituting 58 glomeruli in the antennal lobe (AL) are bundled together in the axonal extension, routing the olfactory signal received at AL to mushroom body (MB) calyx and lateral horn (LH). Here we quantify the neuronal organization in terms of the inter-PN distances and examine its relationship with the odor types sensed by Drosophila. The homotypic uPNs that constitute glomeruli are tightly bundled and stereotyped in position throughout the neuropils, even though the glomerular PN organization in AL is no longer sustained in the higher brain center. Instead, odor-type dependent clusters consisting of multiple homotypes innervate the MB calyx and LH. Pheromone-encoding and hygro/thermo-sensing homotypes are spatially segregated in MB calyx, whereas two distinct clusters of food-related homotypes are found in LH in addition to the segregation of pheromone-encoding and hygro/thermo-sensing homotypes. We find that there are statistically significant associations between the spatial organization among a group of homotypic uPNs and certain stereotyped olfactory responses. Additionally, the signals from some of the tightly bundled homotypes converge to a specific group of lateral horn neurons (LHNs), which indicates that homotype (or odor type) specific integration of signals occurs at the synaptic interface between PNs and LHNs. Our findings suggest that before neural computation in the inner brain, some of the olfactory information are already encoded in the spatial organization of uPNs, illuminating that a certain degree of labeled-line strategy is at work in the Drosophila olfactory system.

    1. Computational and Systems Biology
    Christian A Pulver, Emine Celiker ... Fernando Montealegre-Z
    Research Article

    Early predator detection is a key component of the predator-prey arms race and has driven the evolution of multiple animal hearing systems. Katydids (Insecta) have sophisticated ears, each consisting of paired tympana on each foreleg that receive sound both externally, through the air, and internally via a narrowing ear canal running through the leg from an acoustic spiracle on the thorax. These ears are pressure-time difference receivers capable of sensitive and accurate directional hearing across a wide frequency range. Many katydid species have cuticular pinnae which form cavities around the outer tympanal surfaces, but their function is unknown. We investigated pinnal function in the katydid Copiphora gorgonensis by combining experimental biophysics and numerical modelling using 3D ear geometries. We found that the pinnae in C. gorgonensis do not assist in directional hearing for conspecific call frequencies, but instead act as ultrasound detectors. Pinnae induced large sound pressure gains (20–30 dB) that enhanced sound detection at high ultrasonic frequencies (>60 kHz), matching the echolocation range of co-occurring insectivorous gleaning bats. These findings were supported by behavioural and neural audiograms and pinnal cavity resonances from live specimens, and comparisons with the pinnal mechanics of sympatric katydid species, which together suggest that katydid pinnae primarily evolved for the enhanced detection of predatory bats.