Haplotype function score improves biological interpretation and cross-ancestry polygenic prediction of human complex traits

  1. Weichen Song  Is a corresponding author
  2. Yongyong Shi  Is a corresponding author
  3. Guan Ning Lin  Is a corresponding author
  1. Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Bioengineering, Shanghai Jiao Tong University, China
  2. Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, China
  3. Biomedical Sciences Institute of Qingdao University (Qingdao Branch of SJTU Bio-X12 Institutes), Qingdao University, China


We propose a new framework for human genetic association studies: at each locus, a deep learning model (in this study, Sei) is used to calculate the functional genomic activity score for two haplotypes per individual. This score, defined as the Haplotype Function Score (HFS), replaces the original genotype in association studies. Applying the HFS framework to 14 complex traits in the UK Biobank, we identified 3619 independent HFS–trait associations with a significance of p < 5 × 10−8. Fine-mapping revealed 2699 causal associations, corresponding to a median increase of 63 causal findings per trait compared with single-nucleotide polymorphism (SNP)-based analysis. HFS-based enrichment analysis uncovered 727 pathway–trait associations and 153 tissue–trait associations with strong biological interpretability, including ‘circadian pathway-chronotype’ and ‘arachidonic acid-intelligence’. Lastly, we applied least absolute shrinkage and selection operator (LASSO) regression to integrate HFS prediction score with SNP-based polygenic risk scores, which showed an improvement of 16.1–39.8% in cross-ancestry polygenic prediction. We concluded that HFS is a promising strategy for understanding the genetic basis of human complex traits.

eLife assessment

This valuable paper presents a new approach for association testing, using the output of neural networks that have been trained to predict functional changes from DNA sequences. As such, the approach is an interesting addition to statistical genetics, and the evidence for the presented method being able to identify trait-associations in regions where GWASs are typically underpowered is solid. A limitation is, however, that it is unclear how the quality of these associations compares to those detected using conventional methods. Additional work assessing this method's power and characterizing false positives / false negative regions would be critical to ensure that the method is broadly adopted by the field.


eLife digest

Scattered throughout the human genome are variations in the genetic code that make individuals more or less likely to develop certain traits. To identify these variants, scientists carry out Genome-wide association studies (GWAS) which compare the DNA variants of large groups of people with and without the trait of interest.

This method has been able to find the underlying genes for many human diseases, but it has limitations. For instance, some variations are linked together due to where they are positioned within DNA, which can result in GWAS falsely reporting associations between genetic variants and traits. This phenomenon, known as linkage equilibrium, can be avoided by analyzing functional genomics which looks at the multiple ways a gene’s activity can be influenced by a variation. For instance, how the gene is copied and decoded in to proteins and RNA molecules, and the rate at which these products are generated.

Researchers can now use an artificial intelligence technique called deep learning to generate functional genomic data from a particular DNA sequence. Here, Song et al. used one of these deep learning models to calculate the functional genomics of haplotypes, groups of genetic variants inherited from one parent. The approach was applied to DNA samples from over 350 thousand individuals included in the UK BioBank. An activity score, defined as the haplotype function score (or HFS for short), was calculated for at least two haplotypes per individual, and then compared to various complex traits like height or bone density.

Song et al. found that the HFS framework was better at finding links between genes and specific traits than existing methods. It also provided more information on the biology that may be underpinning these outcomes. Although more work is needed to reduce the computer processing times required to calculate the HFS, Song et al. believe that their new method has the potential to improve the way researchers identify links between genes and human traits.


Genome-wide association studies (GWAS) have witnessed remarkable advancements over recent years, both in terms of sample size and genetic discovery. However, the elucidation of downstream mechanisms and subsequent applications still face certain limitations (Visscher et al., 2017). One caveat is that the statistical power of GWAS on a variant relies on its population frequency (Li et al., 2020; Null et al., 2022; Zhou et al., 2022), whereas most variants with large effect size are rare (Zeng et al., 2021), leading to insufficient discoveries. Moreover, linkage disequilibrium (LD) among neighboring variants can significantly inflate false positive results (Nowbandegani et al., 2022). The variability of LD structure among different populations further compounds the challenges associated with training predictive models and discovering causal genes. Lastly, most trait-relevant variants reside in non-coding regions (Watanabe et al., 2019), which lack direct functional annotations as coding variants. The prevalent approach to addressing this issue is to annotate each variant based on its location within functionally significant regions (Finucane et al., 2015; Grotzinger et al., 2022; Iotchkova et al., 2019; Weissbrod et al., 2020; Zheng et al., 2022), such as transcription factor-binding sites or enhancers. While this strategy has considerably advanced the analysis, it is not optimal, as a variant’s placement within a functionally important region does not inherently signify that the variant has substantial functional impacts.

The central dogma, proposing that DNA alterations’ effects on phenotype are mediated via RNA and protein changes, offers a novel strategy to address these challenges. More precisely, by replacing the original genotypes in association studies with the aggregated impact of variants on transcription or functional genomics, the central dogma ensures the preservation of the majority of genetic information. This ‘aggregated impact’ offers several benefits for GWAS analysis: it provides direct biological interpretations, bypasses the effects of LD and population genetic history, and amalgamates information from both common and rare variants. One successful implementation of this strategy is Polygenic Transcriptome Risk Scores (PTRS) (Hu et al., 2022; Liang et al., 2022), which employ genetically determined transcription levels rather than genotypes to predict complex trait, and achieved remarkable portability. Nonetheless, the accuracy of imputing transcription levels from genotypes, given the sample size of currently available cohorts such as the Genotype-Tissue Expression project, GTEx (Aguet et al., 2020), remains limited (R2 around 0.1 for most genes) (Barbeira et al., 2018). Thus, the performance of PTRS is yet to reach its optimal potential.

Following the success of PTRS, we made one step forward to utilize functional genomics in this strategy. Compared with transcription levels, predicting genetically determined functional genomic levels has achieved much higher accuracy by multiple recent deep learning (DL) studies (Avsec et al., 2021; Chen et al., 2022; Kelley, 2020; Yan et al., 2021; Zhou et al., 2018). These DL models utilize segments of the human reference genome as training samples, substantially increasing the sample size. Furthermore, functional genomics serve as a mediator between DNA and transcription, thus lessening the influence of non-genic factors such as the environment. Given these advancements, we propose that using the outputs of one of the state-of-the-art DL models, Sei (Chen et al., 2022), as the ‘aggregated impact’ in this novel strategy could effectively address the challenges aforementioned. Sei accepts a DNA sequence and computes multiple sequence class scores that represent different facets of the functional genomic activities of that sequence. This score integrates impacts from all variants, even those as rare as singletons, into one continuous variable, and is, in theory, unaffected by LD. In line with this notion, a recent similar strategy called cistrome-wide association study integrated variant–chromatin activity and variant–phenotype association to boost power of genetic study of cancer (Baca et al., 2022).

In this study, we present an analytical framework founded on this strategy (Figure 1) and implement it on complex traits in the UK Biobank to pinpoint causal loci and genes, decipher biological mechanisms, and devise cross-ancestry prediction models. We segmented the human reference genome into multiple 4096 bp loci, generated DNA sequences for each locus for two haplotypes per individual, and employed Sei to compute the functional genomic activities of these sequences. We designated this activity score as the Haplotype Function Score (HFS) and analyzed the association between the HFS and each trait. Our findings confirm that the HFS framework offers a unique improvement in the biological interpretation and polygenic prediction of complex traits compared to classic SNP-based methods, thereby demonstrating its value in genetic association studies.

Figure 1 with 3 supplements see all
Flowchart of the study.

Ind: individual.


Overview of genome-wide HFS

We used the HFS framework to analyze imputed genotype data from the UK Biobank (Figure 1). We segmented the human genome (hg38) into 617,378 discrete, non-overlapping loci, each 4096 base pairs long. Of these, 590,959 loci carried at least one non-reference haplotype in the UKB cohort (see Method and Supplementary file 1a). After quality control, these loci contained approximately 1.2 billion haplotypes, with a median count of 819 per loci (Figure 1—figure supplement 1). We then employed the DL framework, Sei (Chen et al., 2022), to compute sequence class scores for each haplotype. In its sequence mode, Sei accepts DNA sequences in fasta format and produces multiple distinct sequence class scores, 39 of which were included in our study (Method). Our analysis identified significant variation in sequence class scores across different loci. In fact, 49.7% of loci housed haplotypes whose sequence class (as defined by the maximum of the 39 sequence class scores) differed from the reference haplotype sequence class. Using the reference sequence class as a benchmark, we noted that 16.8% of loci showed a difference between the maximum and minimum haplotype scores that surpassed the score of the reference haplotype. Moreover, the correlation between sequence class scores of adjacent loci was low, with a median R2 value of 0.013 (Figure 1—figure supplement 2), effectively reducing the impact of LD in association studies. Further evaluation indicated that this low LD was led by two factors: integration of rare variant impacts and segmentation. Firstly, excluding rare variants from HFS caused the LD raised to median = 0.14 (Method; Figure 1—figure supplement 2C). Secondly, median LD of SNPs from adjacent loci was 0.06, which was significantly higher than HFS LD (paired Wilcoxon p = 1.76 × 10−5) but significantly lower than HFS LD without rare variants (paired Wilcoxon p < 2.2 × 10−16).

Expanding on the sequence class scores, we defined HFS for each locus. Specifically, we computed the mean sequence class score of two haplotypes per individual, reflecting an additive model. We selected the score corresponded to the sequence class of reference sequence as the HFS of the corresponding locus, and its association with each trait was computed using a generalized linear model. Simulation analysis revealed that when a non-reference sequence class score was associated the trait, reference class score could still capture median 70% of HFS–trait association R2. We applied this framework to 14 polygenic traits in the UKB British ancestry training set (n = 350,587; Supplementary file 1b and Method), identifying 16,597 significant HFS–trait associations at a threshold of p < 5 × 10−8 (n = 15 for insomnia, n = 7573 for height; Supplementary file 1b), equating to roughly 3619 independent associations. The most significant associations were between the ‘promotor’ score of chr7:121327898–121331994 (WNT16) and bone mineral density (BMD; regression beta = −0.02, p < 10−300), and the ‘promotor’ score of chr9:4760952–4765048 (AK3) and platelet count (beta = 3.20, p = 2.79 × 10−262; Supplementary file 1c).

When comparing HFS association with the standard SNP-based GWAS on the same data, we found that 98% of significant HFS loci also harbored a significant SNP. There were a few cases (n = 0–5) where significant HFS loci did not harbored even marginal SNP association (GWAS p > 0.01), which were due to the lack of common SNP in these loci. HFS association p-value was higher than GWAS p-value in 95% of significant loci, suggested that HFS did not improve power to detect marginal effect. The genomic control inflation factor (λGC) for the HFS association test varied between 0.99 for asthma and 1.50 for height, closely resembling the SNP GWAS (Pearson correlation coefficient [PCC] = 0.91, paired t-test p = 0.16; Method and Figure 1—figure supplement 3). We concluded that HFS-based association tests had adequate power and do not introduce additional p-value inflation.

Fine-mapping based on HFS

Based on these data, we applied SUSIE to fine-map the causal loci that were associated with each of the 14 traits. We divided hg38 genome into 1361 independent blocks as defined by MacDonald et al., 2022, and applied SUSIE (Wang et al., 2020) to loci HFS in each of these blocks (number of loci per block = 4–2392). As shown in Figure 2 and Supplementary file 1d, we identified a total of 2699 causal loci–trait associations at the threshold of posterior inclusion probability (PIP) >0.95, hereafter referred to as ‘causal loci’. Compared with SNP-based functionally aware fine-mapping methods PolyFun (Weissbrod et al., 2020) and SbayesRC (Zheng et al., 2022), HFS-based SUSIE detected −11 to 334 more causal signals (median = 63, Supplementary file 1e) for each trait. We cautioned that these methods use summary statistics as input and are by nature less sensitive than individual data-based methods. Yet, we suggested that such impact would be mild, since we used in-sample LD reference (from UKB European sample).

Figure 2 with 1 supplement see all
Fine-mapping result summary.

Gray bar plots indicated the number of loci with posterior inclusion probability (PIP) >0.95 in Haplotype Function Score (HFS) + SUSIE (causal loci). Black bar plots indicated number of SNP with PIP >0.95 in PolyFun or SbayesRC analysis (the larger number was shown). Each grid of heatmap showed the odds ratio of each sequence class loci being causal loci for each trait. ‘All_OR’ indicated odds ratio for pooling all traits together. Enh: enhancer. TF: transcription factor-binding site.

Among these causal loci, only 22% were also lead loci in association analysis (loci with the lowest p-value in 200 kb region), and 58% had association p-value >5 × 10−8. In line with previous SNP-based analysis (Weissbrod et al., 2020), this result highlighted the importance of using causal signals instead of lead signals in post-GWAS analysis. We found 67 causal loci showing pleiotropic effects on at least two independent traits, including ‘CTCF-Cohesin’ score of chr9:89596537–89600633 that was associated with age at menarche, body mass index (BMI) and height (PIP >0.97; Supplementary file 1d). We also found that rare variants played an important role in the good find-mapping performance of HFS: when variants with MAF <0.01 were removed, 55.3% of the causal signals would be missed in HFS + SUSIE analysis.

When looking at the reference sequence class of loci, those with functional importance were more likely to be causal loci, including ‘Promoter’ (odds ratio [OR] = 2.33, p = 1.41 × 10−14), ‘Bivalent stem cell enhancer’ (OR = 2.22, p = 1.11 × 10−8), and ‘Transcribed region 1’ (OR = 1.71, p = 1.581 × 10−10, Figure 2). Such functional enrichment was even higher for pleiotropic loci (‘Promoter’: OR = 7.20, p = 3.35 × 10−5). We also observed trait-specific patterns of such sequence class enrichment, such as ‘CEBPB-binding site’ (Insomnia: OR = 5.25, p = 0.01) and ‘FOXA1/AR/ESR1-binding site’ (intelligence: OR = 4.69, p = 0.01, Figure 2 and Supplementary file 1f). These results demonstrated the expected functional patterns of causal loci, and indicated that HFS-based fine-mapping was biologically interpretable and reliable.

Despite the functional enrichment, we applied several secondary analyses to verify the reliability of HFS-based SUSIE result. Firstly, we took causal SNP fine-mapped by PolyFun (Weissbrod et al., 2020) as positive control, and find that compared with genomic region-matched control loci, causal loci were significantly enriched for causal SNP (OR = 1.33–5.08, Fisher’s test p = 0.12–4.72 × 10−52, Supplementary file 1e). Secondly, we calculated the heritability tagged by causal loci and PolyFun causal SNP in independent test set (defined as the R2 of linear regression; Method), and found that causal loci tagged 38–251% more heritability than causal SNP (median = 151%; Supplementary file 1e). This was not an artifact of larger number of causal loci, since the Akaike information criterion (AIC) was similar between causal loci and causal SNP (paired t-test p = 0.36; Supplementary file 1e). Thirdly, for traits with sufficient causal loci coverage, we also applied Linkage Disequilibrium Score regression (LDSC) on independent GWAS summary statistic to evaluate heritability enrichment in causal loci. On average, causal loci showed 124-fold enrichment of heritability, significantly larger than genomic region-matched control loci (124- vs 101-fold; p = 0.0002, Method and Figure 2—figure supplement 1). Lastly, we applied simulation analysis and found that HFS + SUSIE showed similar advantages over SNP-based methods as in real data, with high accuracy and low false-positive rate (FDR) (Supplementary materials).

We further applied a sliding-window analysis (step = 2048 bp, Method) to test whether HFS-based result is robust against the choice of sequence interval. 29.4% of causal loci (PIP >0.95) in the original analysis were still causal in sliding-window analysis. 31.1% and 29.3% of causal loci whose 5′ and 3′ overlapping locus had PIP >0.95 in sliding-window analysis, respectively, while themselves were no longer causal. Besides, HFS + SUSIE was also robust when the predefined number of causal loci (L = 2–10) was changed, and the number of detected loci was not changed. Lastly, removing insertion and deletion would reveal 9% more significant association (p < 5 × 10−8) but 4.7% less causal association (PIP >0.95), and slightly increased inflation factor (Wilcoxon p = 0.0001, Figure 2—figure supplement 1). Taken together, HFS-based SUSIE is a powerful and robust strategy for individual data-based genetic fine-mapping.

Biological interpretation based on HFS

Pinpointing causal loci of complex traits provides the opportunity of analyzing the biological mechanism of them. Thus, based on the HFS-based fine-mapping result, we applied a linear regression model to analyze the underlying pathways, cell types, and tissues of each complex trait. For each locus, we annotated its relevance to a pathway by combined SNP to Gene (CS2G) strategy (Gazal et al., 2022), and regressed the PIP against this annotation, with a set of baseline annotations included as covariates, similar to the LDSC framework (Finucane et al., 2018) (Method). After p-value correction and recurrent pathway removal (Method), we detected a total of 727 pathway–trait associations (Figure 3A and Supplementary file 1g). The most significant associations were ‘megakaryocyte differentiation’ with platelet count (p = 2.26 × 10−34), ‘Insulin-like growth factor receptor signaling pathway’, ‘Endochondral ossification’ with height (p = 4.95 × 10−33 and 1.17 × 10−27), ‘PD-1 signaling’ with allergic disease (p = 5.55 × 10−25), and ‘major histocompatibility complex pathway’ with asthma (p = 1.22 × 10−23). In fact, asthma and allergic disease were predominantly associated with more than 80 immune-related pathways. These associations were all in line with existing knowledge of trait mechanism, and extended the understanding of their genetic basis. For example, PD-1 has recently been suggested as potential targets of allergic diseases like atopic dermatitis (Galván Morales et al., 2021), but such association has not been highlighted by previous genetic association studies.

Biological enrichment analysis based on Haplotype Function Score (HFS) fine-mapping.

x-axis indicated t statistics of the analyzed term in a multivariate linear regression (Method). Cell: single-cell ATAC peak for 222 cell types from Zhang et al., 2021a. Tissue: active chromatin regions of 222 tissues from epimap (Boix et al., 2021). For each trait, we showed the most significant term plus one or two terms with high biological interpretation that also passed significance threshold. Full enrichment result is shown in Supplementary file 1g and Supplementary file 1h.

For other traits, the most significant associations also replicated known mechanisms, such as ‘osteoblast differentiation’, ‘Wnt ligand biogenesis and trafficking’ with BMD (p = 4.59 × 10−13 and 2.78 × 10−12); ‘circadian pathway’ with chronotype (p = 4.25 × 10−12); ‘calcium regulated exocytosis of neurotransmitter’, ‘Arachidonic acid metabolism’ with intelligence (p = 5.52 × 10−7 and 2.78 × 10−6); ‘GPCR pathway’ and ‘adipogenesis’ with BMI (p = 4.97 × 10−10 and 2.02 × 10−7) and ‘physiological cardiac muscle hypertrophy’ with systolic blood pressure (p = 6.32 × 10−11). We also highlighted less significant association which provided novel insights, such as ‘synaptic vesicle docking’ and ‘neuron migration’ with chronotype (p = 4.00 × 10−7 and 4.55 × 10−7), ‘Prostaglandins synthesis’ with insomnia (p = 5.30 × 10−9), ‘behavioral response to cocaine’ with alcohol intake (p = 3.39 × 10−8) and ‘roof of mouth development’ and ‘glycoside metabolism’ with forced vital capacity (FVC) (p = 2.19 × 10−12 and 5.73 × 10−11).

For cell type and tissue analysis (Figure 3B and Supplementary file 1h), we applied the same linear model to evaluate whether causal loci enriched in active chromatin regions of each cell type (Method). We found 153 biologically interpretable associations with complex traits. For example, fetal megakaryocyte (p = 5.67 × 10−22) and child spleen (p = 2.15 × 10−13) were found to be key cell type and tissue of platelet count. Systolic blood pressure was significantly associated with multiple heart and artery tissues and fetal cardiomyocyte (p < 1.63 × 10−5), whereas allergic disease was associated with multiple immune cells including natural killer, Treg, and B cells (p < 4.79 × 10−16). For brain-related traits, we found 21 significant associations, 14 of which were from central nervous system. For example, adult hippocampus and cingulate gyrus were both linked to alcohol intake, smoking, and insomnia (p < 1.11 × 10−5), whereas chronotype was associated with embryonic brain germinal matrix (p < 8.68 × 10−6) and intelligence with embryonic neuron-derived stem cell (p < 6.89 × 10−7).

We also applied other modified strategies for this task but did not get satisfying result. For example, using cS2G to link locus to gene lists specifically expressed in each cell type suffered from scRNA dataset batch effect, whereas linear mix model was less sensitive than standard linear model (Supplementary Materials).

Taken together, our result suggests that fine-mapping results based on HFS could pinpoint the causal pathways, cell types, and tissues underlying complex traits, and is valuable for the biological interpretation of genetic association study.

Highlighted genes for complex traits

Enhanced power of fine-mapping and biological enrichment could reveal novel key genes for trait mechanism study. Below we integrated fine-mapping result and their functional annotation in several case studies to find causal signals and trait-relevant genes in regions not resolved by previous genetic association studies.

In our study, platelet count had large number of causal loci (Figure 2) which showed significant functional enrichment (Figure 3). To find key loci and genes underlying platelet count, we focused on causal loci that overlapped with active regions in ‘fetal megakaryocyte’ and ‘child spleen tissue’, and applied cS2G (Gazal et al., 2022) to link them to two key pathways (‘megakaryocyte differentiation’ and ‘platelet morphogenesis’, Method and Figure 4A). We chose these annotations based on p-value in biological enrichment analysis in Figure 3. A total of 25 loci were highlighted (Figure 4A), which were recurrently linked to well-known platelet-regulating genes like MEF2C, SH2B3, FLI1, RUNX1, THPO, and NFE2. Among them we noticed a less-studied gene RBBP5, a target of key transcriptome factor MEF2C during megakaryopoiesis (Kong et al., 2019). Specifically, in 1q32.1 region, HFS + SUSIE identified two loci with PIP >0.9 (Figure 4B). SNP-based association also found significant association in this region, but SNP fine-mapping (Weissbrod et al., 2020) could not resolve this signal and only found seven signals between PIP = 0.1–0.5. This was unlikely a statistical inflation, since HFS-based association test p-value was actually higher than SNP-based one (Figure 4—figure supplement 1). One of the causal loci, chr1:47401806–47405902 (PIP = 1), overlapped with spleen active chromatin and harbored a cCRE in megakaryocyte, and was linked to RBBP5 and three other genes. RBBP5 is known to be involved in megakaryocyte differentiation during megakaryopoiesis and was regulated by MEF2C (Kong et al., 2019), but previous genetic association studies provided little evidence for its association with platelet count.

Figure 4 with 2 supplements see all
Haplotype Function Score (HFS) linked trait to causal genes.

(A) Target genes of causal loci identified by HFS + SUSIE for platelet count. Only genes that showed functional convergence were shown. (B) Regional plot for RBBP5. HFS: loci posterior inclusion probability (PIP) calculated by HFS + SUSIE. SNP: SNP PIP calculated by PolyFun. cCRE: credible cis-regulation elements. (C) Regional plot of major histocompatibility complex (MHC) region for asthma. Thickened curve linked highlighted causal loci to its target genes predicted by cS2G (Gazal et al., 2022).

The major histocompatibility complex (MHC) region has long been a challenge of genetic association study due to its long-range LD, and is often excluded in fine-mapping tools. However, many disorders like schizophrenia (Sekar et al., 2016) and immune diseases (Nawijn et al., 2011) are robustly associated with MHC region. In our HFS-based fine-mapping of asthma, we found 15 loci within MHC region had PIP >0.95, 11 of which overlapped with active chromatin regions in Treg or natural killer cells (Figure 4C and Supplementary file 1j). This result showed good discrimination between causal and non-causal loci: despite these 15 likely causal loci, only six loci had PIP between 0.25 and 0.95. Since MHC region harbored a large number of genes, these causal loci were linked to as much as 105 potential target genes, which hindered the discovery of true targets. We further filtered them based on the involvement in pathway ‘TNFR2-NFKB pathway’ and ‘innate lymphocyte [ILC] development’, since these pathways were most significantly associated with asthma (Figure 3), even after excluding MHC region (p = 2.57 × 10−13 and 1.39 × 10−17). We found five genes (LTA, LTB, TNF, PSMB8, and PSMB9) that were predicted to be regulated by five causal loci overlapped with active chromatin regions (Figure 4C), which could be considered as potential key genes for further validation.

Similarly, we fine-mapped MHC region for other allergic diseases (Figure 4—figure supplement 2 and Supplementary file 1j) and found potential key genes including HLA family and AGER. We also highlighted other gene–trait association not previously emphasized by GWAS, including GATA4 and NPPA (cardiac muscle hypertrophy) with SBP, ALOX5 (arachidonic acid metabolism) with intelligence and CRY1 (circadian pathway) with chronotype, as further discussed in Supplementary file 1k, l, m and supplementary information.

On the other hand, HFS perform worse than SNP-based fine-mapping on exonic regions. Taking height as an example, PolyFun detected 125 causal SNPs (PIP >0.95) in the exonic regions, but only 16% (20) of loci that harbored them also reached PIP >0. 5 (11 reached PIP >0.95) in HFS + SUSIE analysis. Among the 105 loci that missed such signals (HFS PIP <0.5), 12 had a nearby loci (within 10 kb) showing HFS PIP >0.95, which likely reflected false positive led by LD. Thus, SNP-based analysis should be prioritized over HFS in coding regions.

HFS-based polygenic prediction

Lastly, we analyzed the potentiality of HFS in polygenic prediction accuracy. Compared with state-of-the-art SNP-based polygenic risk score (PRS) algorithm LDAK-BOLT (Zhang et al., 2021b), HFS-based PRS (weighted by SUSIE posterior effect size) reached 47–90% of R2 in independent European test set (meta-analyzed proportion = 75.6%, 95% confidence interval = 75.3–75.8%, Figure 5—figure supplement 1). The gap between performance of HFS- and SNP-based PRS reflected the fact that HFS only captured (the majority of) functional genomic alterations and missed the information of amino acid sequence and post-translational modification. We thus proposed that integrating information from HFS and SNP could provide better performance. Specifically, in the large European training set we trained SNP PRS model by LDAK. Then, in a small tuning sample of target ancestry, we calculated per-locus HFS prediction score of height (sum of HFS within this block, weighted by SUSIE posterior effect size), then used machine learning to integrate them with LDAK PRS into a final polygenic prediction score, hereafter referred to as ‘HFS + LDAK’. To choose the proper machine-learning tools to achieve this goal, in British European test set we applied LASSO, ridge regression, and elastic net and compared the result (Figure 5B). They gave comparable result with only difference of R2 around 0.01, and all of them were profoundly better than simple linear regression. We chose LASSO as the algorithm in the formal analysis.

Figure 5 with 1 supplement see all
Haplotype Function Score (HFS)-based polygenic prediction.

(A) Prediction R2 of HFS-based polygenic risk score (PRS) using different threshold of posterior inclusion probability (PIP). allSNP: SNP-based PRS calculated by LDAK-BOLT (Zhang et al., 2021b). n: number of features included in the corresponding PRS. (B) Prediction R2 of per-block HFS score in British European test set by different methods. EN: elastic net. (C) Prediction R2 of different tools in non-British European (NBE), South Asian (SAS), East Asian (EAS), and African (AFR) groups in UK Biobank.

Using height as a representative trait, we first estimated the proportion of variance captured by top loci, and found that HFS of loci with PIP >0.4 (n = 5101) captured roughly 80% of variance explained by all genome-wide loci (n = 1,200,024 corresponded to sling-window strategy; Figure 5A). We then calculated HFS + LDAK in non-British European (NBE), South Asian (SAS), East Asian (EAS), and African (AFR) population in UK Biobank, and observed 17.5%, 16.1%, 17.2%, and 39.8% improvement over LDAK alone (p = 3.21 × 10−16, 0.0001, 0.002, and 0.001, respectively. Figure 5C). As a comparison, we integrated LDAK with PolyFun-pred (Weissbrod et al., 2022) and SbayesRC (Zheng et al., 2022) using Polypred framework (Weissbrod et al., 2022), but did not observe significant improvement over LDAK alone (difference in R2 < 0.01, p = 0.001–0.07, Figure 5C). Since PolyFun-pred + BOLT-LMM has been shown to significantly outperformed BOLT-LMM alone (Weissbrod et al., 2022), we reasoned that the improvement of LDAK over BOLT-LMM might have attenuated the improvement brought about by PolyFun-pred, making it difficult to reach significance threshold. Taken together, we concluded that HFS could bring about mild but significant improvement to classic SNP-based PRS in the task of cross-ancestry polygenic prediction.


In this study, we designed the new HFS framework for genetic association analysis and demonstrated that it could improve classic SNP-based analysis in terms of causal loci and gene identification, biological interpretation and polygenic prediction. We suggest that HFS is a promising strategy for future genetic studies, but more progresses in algorithm and computation and data resources are still desired.

Compared with SNP, HFS has several compelling features. For instance, LD between adjacent HFS is much lower than SNP, which enhances the precision of statistical fine-mapping. For those false-positive variants caused by LD, they are expected to make little impacts on functional genomics, thus their HFS would be close to reference and would not influence downstream analysis significantly. In line with these advantages, we showed that HFS-based fine-mapping had high statistical power, and downstream enrichment analysis was capable of revealing biologically interpretable mechanisms. As a typical example, our findings of enrichment of intelligence-associated loci in arachidonic acid metabolism pathway is in line with the well-known role of polyunsaturated fatty acid in neurodevelopment (Helland et al., 2003). Nonetheless, previous GWAS provided little evidence on this association. Secondly, HFS could integrate effects of all variants within a locus, regardless of their population frequency. Thus, HFS could capture information from rare variants overlooked by classic association study and improve polygenic prediction, as shown by our result. In fact, HFS framework could directly extend to whole-genome sequencing data and capture all mutations as rare as singleton, making one step forward to fill in the ‘missing heritability’.

Despite its potential, the current HFS framework carries several drawbacks and necessitates significant enhancements. A key limitation is the substantial computational cost. In this study, the transformation phase of the genotype–haplotype sequence for UK Biobank SNP data required hundreds of thousands of CPU core hours. This computation cost would increase exponentially when analyzing whole-genome sequencing data or employing a sliding-window strategy. A potential solution could involve developing a new algorithm that bypasses the variant calling stage and directly generates DNA sequences per locus from raw sequencing or SNP array data. For the sequence-to-HFS step, Sei (Chen et al., 2022) required about 1.8 GPU hours per one million sequences. Intriguingly, the majority of Sei’s output is unused in the HFS framework, since Sei predicts over 20,000 functional genomic features, while the HFS only represents one of their integrated scores. Future development of novel DL models that predict functional genomics in a manner more fitting to the HFS framework could considerably reduce computation costs. Lastly, it is currently unfeasible to incorporate all genome-wide HFS into a single LASSO model. This limitation forced us to first integrate HFS into pre-locus score, which inevitably sacrificed the accuracy.

Another hurdle arises in integrating HFS with other genomic features. Intrinsically, HFS captures only the variant effect mediated by functional genomics, while a genetic variant might also influence amino acids, post-transcriptional modifications (PTMs) (Park et al., 2021), and 3D chromosomal structures (Zhou, 2022). Therefore, HFS alone cannot wholly replace SNP without any loss, as our results demonstrate that the HFS-based prediction model captured approximately 70% of the variance explainable by the SNP-based prediction model. One potential solution is to extend the concept of HFS, applying DL to quantify the genetically determined values of PTMs, protein biochemical properties (Pejaver et al., 2020), and protein and chromosomal structures, potentially employing AlphaFold (Jumper et al., 2021)-derived features (Liu et al., 2022). Analyzing HFS in conjunction with these multi-modal function scores could provide a comprehensive depiction of the genetic architecture of complex traits. However, the colossal computational cost is currently prohibitive. As a compromise, we simply performed joint analysis of HFS with SNP PRS in our prediction model analysis. This approach is far from optimal, as it led to only moderate improvement and did not enhance fine-mapping and biological enrichment analysis.

The challenge of using sequence-based DL models in HFS applications is further compounded by their difficulty in predicting variations between individuals. Recent studies (Huang et al., 2023; Sasse et al., 2023) indicate that DL models, trained on the reference human genome, demonstrate limited accuracy in predicting gene expression levels across different individuals. This limitation is likely due to the models' inability to account for long-range regulatory patterns, which are crucial for understanding the impact of variants on gene expression and vary across genes. In contrast, our study leveraged sequence-determined functional genomic profiles in association studies, which mitigates this issue to an extent. For instance, although sei cannot identify the specific gene regulated by a given input sequence, it can predict changes in the sequence’s functional activity. Future improvements in DL models' ability to predict interindividual differences could be achieved by incorporating cross-individual data in the training process. An example of such data is the EN-TEX (Rozowsky et al., 2023) dataset, which aligns functional genomic peaks with the specific individuals and haplotypes they correspond to.

In summary, our results demonstrate that incorporating HFS to represent genetically determined functional genomic activities in genetic association studies offers robust improvements in both the biological interpretation and polygenic prediction of complex traits. Thus, the application of the HFS framework in future genetic association studies holds considerable promise.


Sample description

This study analyzed UK Biobank data, with application ID 84436, and was adhered to the ethics and privacy policy of UK Biobank. We only included participants with array imputed genotype data in bgen format that passed UKB quality control, and removed related individuals. We randomly selected 350,587 self-identified British ancestry Caucasians as training sample. The remaining participants were grouped according to their ancestry, where non-British European, South Asian, East Asian, and African groups serve as test samples.

All phenotypes analyzed (Supplementary file 1b) were collected from UKB table browser, which came from self-report or physical measurement. Phenotypes were first adjusted by age, sex, top 10 principal components, Townsend index, and genotype array quality metrics by linear regression. We then applied inverse-normal transformation on the residuals. Binary phenotypes were adjusted in the same way except by generalized linear regression.

Genotype data processing

We first segmented hg38 genome into 4096 bp loci. To do so, we downloaded chromatin state annotation of 222 human tissues at different developmental stage (embryo, newborn, and adult) from epimap (Boix et al., 2021) database. For each tissue, all chromosomal regions annotated as ‘transcription start site (TSS), transcription region (TX), enhancer, promoter’ in at least half of the samples were marked as active regions. The union of active regions across all tissues was taken, and regions annotated as genomic gaps (centromere, ambiguous base pairs, etc.) in the Hg38 genome were removed. Then, for this series of active regions, if the length is less than 4096 bp, the locus is defined as a 4096-bp area centered around the active region. If the length is greater than 4096 bp, 4096-bp length loci are gradually delineated from the midpoint outward. Finally, non-overlapping 4096 bp blocks were used to cover the remaining genomic regions. This resulted in about 617,378 genomic regions in total. In the sliding-window analysis, all these blocks were shifted 2048 bp toward 5′ end, generating another 617,378 blocks. We repeated the fine-mapping analysis and applied polygenic analysis on these combined blocks, using height as a representative trait.

For each of the loci, we obtained ID of variants within this locus by bedtools (Quinlan and Hall, 2010), then extracted genotypes from UKB.bgen file by bgenix, finally used Plink (Purcell et al., 2007) to remove all variants with INFO <0.8, Hardy–Weinberg p < 10−6, allele count <10 or missing rate >10%, and removed individual that missed more than 10% of retained variants in this locus. The output vcf file was liftover to hg38 by Crossmap (Zhao et al., 2014) and phased by SHAPEIT4 (Delaneau et al., 2019). Phased vcf was transformed to.haps format by Plink, which in turn gave rise to two files: a vcf file containing information of each haplotype, and an n x 2 matrix in plain text that recorded the id of two haplotypes per individual.

HFS calculation

There has been several DL models that predict functional genomic profiles based on DNA sequence (Avsec et al., 2021; Chen et al., 2022; Kelley, 2020; Yan et al., 2021; Zhou et al., 2018). Among them, we chose sei (Chen et al., 2022) to calculate HFS for the following reasons: (1) the required input length (4096 bp) is moderate; (2) it represents 21,906 functional genomic tracks, more comprehensive than other models; (3) it integrated information of the entire sequence, not only the few bp at the center. For each haplotype at each locus, we generated its corresponding DNA sequence by bcftools (Danecek et al., 2021) consensus option. At each locus, the start point of each sequence was matched to the start point of reference sequence. When insertion variants made the sequence longer than 4096 bp, we discarded base pairs at the 3′ end. Likewise, with deletion variants, we added N to the 3′ end. We applied sei to predict 21,906 functional genomic tracks for each sequence, without normalizing for histone mark (divided each track score by the sum of histone mark score) as suggested by the sei author. We then used the projection matrix provided by sei to calculate forty sequence class scores, which could be regards as the weighted sum of these tracks and represented different aspect of functional genomic activities. We discarded the last score (heterochromatin 6 [centromere]), since its proportion is too low and is functionally trivial, leading to 39 scores per haplotype.

On each individual, we derived from each sequence class score the mean of two haplotypes, corresponding to additive model. For HFS LD calculation, we extracted the mean value of sequence class score corresponding to reference sequence class of adjacent loci, and calculate R2 value between them. The sequence class score of the reference sequence class was defined as the HFS for this locus, and was used for downstream trait association analysis.

HFS–trait association

For each locus, we calculated the association between trait-specific HFS and adjusted, normalized trait value by linear regression, without any covariates (this is because all selected covariates have been adjusted at the normalization step). For uniformity, we set the significance threshold at p < 5 × 10−8, even if it was over-stringent for n = 590,959 loci. Among significant associations, we defined an independent association as the locus with the lowest p-value in the 200 kb regions. As a positive control, we applied quantitative and binary GWAS with REGENIE (Mbatchou et al., 2021), using default settings and the same British training sample. The main difference is that we used raw trait values in REGENIE, and provided the same covariates. We calculated the genomic control inflation factor, λGC, as the median of Χ2 statistics, separately for HFS association test and GWAS only those SNPs in hapmap3 (Altshuler et al., 2010) project were calculated. We compared the λGC between HFS and SNP by Pearson correlation analysis and paired t-test.

Fine-mapping analysis

We divided hg38 genome into 1361 independent blocks as defined by MacDonald et al., 2022, and applied SUSIE to HFS of all loci within each block, separately for each trait (parameters: maximum number of causal signal = 10, coverage = 0.95). We subtracted reference HFS value for each locus prior to analysis, such that homozygous reference haplotype corresponded to HFS = 0. To avoid influence of sei prediction noise, we rounded the HFS value at two decimals. This is due to the fact that even if a variant actually makes no impact on functional genomics, Sei would still output a value that are close to but not equal to reference sequence class score. Rounding procedure would set such HFS to zero and remove the random value from sei. Loci whose HFS had PIP >0.95 were defined as causal loci, and loci that had causal association with multiple traits were defined as pleiotropic loci. As a positive control, we applied PolyFun (Weissbrod et al., 2020) and SbayesRC (Zheng et al., 2022) on the GWAS summary statistics by REGENIE on the same training set, and extracted the reported PIP to define causal SNP.

To analyze the functional characteristics of causal loci, we first defined the sequence class of each locus by the maximum sequence class score of reference haplotype. We then tested whether each sequence class contained excess causal loci of each trait by Fisher’s test. For each causal locus, we also defined a ‘control’ locus as the nearest locus that matched the p-value of this causal locus, and tested whether causal loci carried more PolyFun causal SNP than control loci by Fisher’s test. Furthermore, For traits whose causal loci covered >0.1% of genome-wide SNP, we applied LDSC (Finucane et al., 2015) to quantify the heritability enrichment in causal and control loci, and compare their difference by jackknife method. To avoid winner’s curse, we used external GWAS summary statistics for this analysis (Mikaelsdottir et al., 2021; Yengo et al., 2022). As an alternative method to quantify the heritability captured by causal loci, we ran multivariate linear regression in independent British test set where HFS of causal loci were independent variables and trait value were dependent variable, and calculated the R2 and AIC. We applied the same analysis on causal SNP, and compared AIC between HFS and SNP multivariate regression.

Functional enrichment analysis

Similar to the idea of LDSC (Finucane et al., 2015), we first generated a series of baseline annotation of each locus, then tested whether locus PIP was associated with functional annotations after controlling the impact of these baseline annotations. Specifically, we defined the following baseline annotations:

  1. Number of haplotypes, range of HFS distribution of all haplotypes (scaled by reference HFS), and 39 sequence class score of reference haplotype.

  2. Genomic regions of conserved base, high Phastcons score (Siepel et al., 2005) in mammals, primates and vertebrate, exon, intron, untranslated regions at 3′ and 5′ and 200 bp flanking regions of TSS. We used bedtools intersect -f 0.1 option to annotate each locus by these annotations.

  3. Maximum B statistics (McVicker et al., 2009), minimum allele age, and ASMCavg (Palamara et al., 2018) of all variants within this locus.

Type 2 and 3 annotations were directly obtained from LDSC (Finucane et al., 2015) baseline annotations. We did not include annotations related to functional genomics, since 39 sequence class scores were used to capture functional genomic characteristics. Conditioned on these baseline annotations, we analyzed the enrichment of PIP in the following functional annotations:

  1. Biological pathways: We downloaded all pathways from MsigDB (Subramanian et al., 2005), C2: canonical pathways category (including Reactome (Fabregat et al., 2018), Pathway Interaction Database (PID) (Schaefer et al., 2009), Biocarta and Wikipathway) and C6: Gene ontology (Ashburner et al., 2000) (biological process) category. We retained only pathways with >5 and <500 genes. We generated a gene × pathway binary matrix and applied hierarchical clustering so that similar pathways were placed close to each other. We sequentially compared adjacent pathways, and removed the smaller one if the fraction of overlap >30%. A total of 3219 pathways were retained. We then linked each locus to these pathways by cS2G (Gazal et al., 2022) strategy. Specifically, a locus L would be annotated as 1 for pathway P only if L contained a SNP that was link to P with cS2G score >0.5.

  2. Tissue-specific chromatin activity: We downloaded chromHMM (Ernst and Kellis, 2012) chromatin state annotation for 833 samples from epimap (Boix et al., 2021), and grouped them according to developmental stages and second-level tissue types. For each group, all chromosomal regions annotated as ‘transcription start site (TSS), transcription region (TX), enhancer, promoter’ in at least half of the samples were marked as active regions. We used bedtools intersect -f 0.1 option to annotate whether each locus was active in each tissue.

  3. Cell type-specific open chromatin regions: We downloaded scATAC-seq peak data from Zhang et al., 2021a, and annotated each locus by bedtools intersect -f 0.1 option.

We applied multivariate linear regression of PIP against baseline annotations +one of the functional annotations. Regression coefficient >0 and Bonferroni-adjusted regression p-value <0.05 were used as significance threshold. From the final results, we manually removed those pathways and cell types that reached significance threshold in more than half of the traits, since these pathways likely reflected unrecognized confounders.

Polygenic prediction

We used the posterior effect size estimated by SUSIE on sliding-window strategy (doubling the number of loci) as weights, and calculated the weighted sum of HFS as the PRS of each trait, and calculated R2 in independent British test sample with simple linear regression. As a positive control, we applied LDAK-BOLT (Zhang et al., 2021a) algorithm on the SNP array data (about seven hundred thousand variants) with tenfold cross-validation and max iteration = 200 in the same training sample, and calculated SNP-based PRS with the output SNP weights. Normalized trait values were analyzed, without any covariates provided. Array data were filtered by Plink with option --geno 0.1 --hwe 1e-6 --mac 100 --maf 0.01 --mind 0.1.

To train the refined model that predict height, we first calculated per-block HFS-based prediction score of height as the weighted sum of HFS within this block. Then, within each target ancestry group (non-British European (NBE), South Asian (SAS), East Asian (EA), and African (AFR) participants in UK Biobank), we randomly selected half as tuning sample and half as test sample. In the tuning sample, we applied LASSO regression that included both LDAK PRS and genome-wide per-block HFS score (1361 in total). The choice of LASSO regression was based on a comparison on British European test set (Figure 5B), where LASSO, ridge, and elastic net gave similar results and LASSO was relatively better. In the tuning sample of target ancestry, LASSO estimated the weights to combine per-block HFS score and LDAK PRS. We calculated the final prediction score in the test sample using these weights, and evaluated its prediction by linear regression R2. Since the outcome (height) has already been adjusted and standardized, no covariates were included in this step. Additionally, we applied PolyFun-pred (Weissbrod et al., 2022) and SbayesRC (Zheng et al., 2022) to the summary statistics of height (calculated by REGENIE in the same training sample), and integrated their effect size with LDAK weight in the tuning sample using Polypred (Weissbrod et al., 2022) method. PRS for LDAK, LDAK + PolyFun and LDAK + SbayesRC were calculated by plink score option, excluding variants with INFO <0.8, Hardy–Weinberg p < 10−6, allele count <2 or missing rate >10% in the target test set.

Simulation analysis

We simulated trait levels using HFS data from chromosome 1 in a randomly selected 50,000 samples from UKB EUR training data. We randomly selected 1% (500) loci, assigned effect size from standard normal distribution, and calculated the aggregated genetic liability. We then simulated trait levels with h2 = 0.1. We applied HFS + SUSIE as well as REGENIE + PolyFun on simulated traits and calculated the area under curve (AUC), FDR at PIP >0.95 for HFS + SUSIE. We repeated this procedure for 30 times.

On average, HFS + SUSIE showed high accuracy in identifying causal loci (median AUC = 0.92) and the FDR at PIP >0.95 is median 0.059. In line with real data analysis, the number of causal loci identified by HFS + SUSIE is 1.12-fold more than PolyFun on average. Furthermore, HFS + SUSIE showed good discrimination between causal and non-causal loci: the number of PIP >0.95 loci is larger than 0.5 < PIP < 0.95 loci.

Alternative strategy on biological enrichment analysis

Despite the standard linear regression as we applied in the main text, we also applied a linear mixed regression which took independent blocks as random effect. For each regression, we included one biological term plus all baseline annotations. The regression coefficient and p-value of each biological term were estimated by mgcv R package. After p-value correction, most of the significant terms were those recurrently appeared in more than half of the traits, which were considered artifacts of hidden covariates. When removing these recurrent terms, less than five significant terms remained for each trait. We concluded that linear mixed regression was less sensitive than standard linear regression for identifying trait-specific biological association.

We also tried another strategy for cell type-specific analysis. We first downloaded C8 category from MsigDB, which contained gene lists specifically expressed in about 800 cell types, derived from multiple single-cell RNA sequencing studies. We then linked each locus to these gene lists by CS2G method, then applied linear regression, similar to pathway analysis. We found that most traits predominantly linked to nearly all cell types from a specific study, which showed study batch effect instead of biological functions. For example, smoking was associated with all neuron subtypes, pericytes and immune cells from one brain scRNA data, but did not showed association with immune cells and pericytes from other scRNA studies. We reasoned that the curated cell type-specific gene lists contained batch effects that were not yet corrected. Thus, in the main text, we reported association between PIP and single-cell ATAC peak from one study, which reduced the batch effect.

Highlighted genes for complex traits

For chronotype, we found one circadian gene CRY1 that were predicted to be target of locus chr12:1070930221107097118, which had PIP = 0.56. This locus was active in cingulate gyrus, and belong to sequence class ‘enhancer-multi tissue’. CRY1 was known to participate in circadian pathway, and was not highlighted by previous GWAS. SNP-based fine-mapping also found no SNP with PIP >0.1 that was predicted to link to CRY1. We suggested that it was a novel promising target gene for understanding mechanisms of chronotype.

For systolic blood pressure, we found chr8:11726583–11730679 (PIP = 0.999) that resided on gene GATA4. This locus was active in both adult heart ventricle and in fetal cardiomyocyte. GATA4 took part in physiological myocardial hypertrophy. SNP fine-mapping got PIP <0.34 for all SNPs linked to GATA4. Previous GWAS has found its homolog GATA2 as a key gene in blood pressure, and our new result supported GATA4 as another key genes.

For intelligence, we found chr10:45559452–45563548 that was active in caudate nucleus and was associated with intelligence at PIP >0.5. It was predicted to regulate ALOX5, a key enzyme in the arachidonic acid metabolism. It is known that supplement of Arachidonic acid is beneficial for child intelligence development, and that arachidonic acid takes part in neurodevelopment. However, few genes related to arachidonic acid has been associated with intelligence.

Statistical analysis

All p-values were two-sided and adjusted by Bonferroni unless otherwise specified. For group comparison, we used Fisher’s test for count data and paired t-test for continuous data. For R2 of PRS comparison, we applied r2redux (Momin et al., 2023) R package to estimate 95% confidence interval and its p-value for the difference of R2.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Data from the UK Biobank (project 84436; Bycroft et al., 2018) are available, pending application approval from: https://www.ukbiobank.ac.uk/. Modeling code is available at https://github.com/WeiCSong/HFS (copy archived at Song, 2024).


    1. Aguet F
    2. Barbeira AN
    3. Bonazzola R
    4. Brown A
    5. Castel SE
    6. Jo B
    7. Kasela S
    8. Kim-Hellmuth S
    9. Liang Y
    10. Oliva M
    11. Flynn ED
    12. Parsana P
    13. Fresard L
    14. Gamazon ER
    15. Hamel AR
    16. He Y
    17. Hormozdiari F
    18. Mohammadi P
    19. Muñoz-Aguirre M
    20. Park YS
    21. Saha A
    22. Strober BJ
    23. Wen X
    24. Wucher V
    25. Ardlie KG
    26. Battle A
    27. Brown CD
    28. Cox N
    29. Das S
    30. Dermitzakis ET
    31. Engelhardt BE
    32. Garrido-Martín D
    33. Gay NR
    34. Getz GA
    35. Guigó R
    36. Handsaker RE
    37. Hoffman PJ
    38. Im HK
    39. Kashin S
    40. Kwong A
    41. Lappalainen T
    42. Xiao L
    43. MacArthur DG
    44. Montgomery SB
    45. Rouhana JM
    46. Stephens M
    47. Stranger BE
    48. Todres E
    49. Viñuela A
    50. Wang G
    51. Zou Y
    52. Anand S
    53. Gabriel S
    54. Graubert A
    55. Hadley K
    56. Huang KH
    57. Nguyen JL
    58. Balliu DT
    59. Conrad B
    60. Cotter DF
    61. Einson J
    62. Eskin E
    63. Eulalio TY
    64. Ferraro NM
    65. Gloudemans MJ
    66. Hou L
    67. Kellis M
    68. Xin L
    69. Mangul S
    70. Nachun DC
    71. Nobel AB
    72. Park Y
    73. Rao AS
    74. Reverter F
    75. Sabatti C
    76. Skol AD
    77. Teran NA
    78. Wright F
    79. Ferreira PG
    80. Li G
    81. Melé M
    82. Yeger-Lotem E
    83. Barcus ME
    84. Bradbury D
    85. Krubit T
    86. McLean JA
    87. Qi L
    88. Robinson K
    89. Smith AM
    90. Sobin L
    91. Tabor DE
    92. Undale A
    93. Bridge J
    94. Brigham LE
    95. Foster BA
    96. Gillard BM
    97. Hasz R
    98. Hunter M
    99. Johns C
    100. Johnson M
    101. Karasik E
    102. Kopen G
    103. Leinweber WF
    104. McDonald A
    105. Moser MT
    106. Myer K
    107. Ramsey KD
    108. Roe B
    109. Shad S
    110. Thomas JA
    111. Walters G
    112. Washington M
    113. Wheeler J
    114. Jewell SD
    115. Rohrer DC
    116. Valley DR
    117. Davis DA
    118. Mash DC
    119. Branton PA
    120. Sobin L
    121. Barker LK
    122. Gardiner HM
    123. Mosavel M
    124. Siminoff LA
    125. Flicek P
    126. Haeussler M
    127. Juettemann T
    128. Kent WJ
    129. Lee CM
    130. Powell CC
    131. Rosenbloom KR
    132. Ruffier M
    133. Sheppard D
    134. Taylor K
    135. Trevanion SJ
    136. Zerbino DR
    137. Abell NS
    138. Akey J
    139. Chen L
    140. Demanelis K
    141. Doherty JA
    142. Feinberg AP
    143. Hansen KD
    144. Hickey PF
    145. Hou L
    146. Jasmine F
    147. Jiang L
    148. Kaul R
    149. Kellis M
    150. Kibriya MG
    151. Li JB
    152. Li Q
    153. Lin S
    154. Linder SE
    155. Pierce BL
    156. Rizzardi LF
    157. Smith KS
    158. Snyder M
    159. Stamatoyannopoulos J
    160. Tang H
    161. Wang M
    162. Branton PA
    163. Carithers LJ
    164. Guan P
    165. Koester SE
    166. Little AR
    167. Moore HM
    168. Nierras CR
    169. Rao AK
    170. Vaught JB
    171. Volpi S
    (2020) The GTEx Consortium atlas of genetic regulatory effects across human tissues
    Science 369:1318–1330.
    1. Li X
    2. Li Z
    3. Zhou H
    4. Gaynor SM
    5. Liu Y
    6. Chen H
    7. Sun R
    8. Dey R
    9. Arnett DK
    10. Aslibekyan S
    11. Ballantyne CM
    12. Bielak LF
    13. Blangero J
    14. Boerwinkle E
    15. Bowden DW
    16. Broome JG
    17. Conomos MP
    18. Correa A
    19. Cupples LA
    20. Curran JE
    21. Freedman BI
    22. Guo X
    23. Hindy G
    24. Irvin MR
    25. Kardia SLR
    26. Kathiresan S
    27. Khan AT
    28. Kooperberg CL
    29. Laurie CC
    30. Liu XS
    31. Mahaney MC
    32. Manichaikul AW
    33. Martin LW
    34. Mathias RA
    35. McGarvey ST
    36. Mitchell BD
    37. Montasser ME
    38. Moore JE
    39. Morrison AC
    40. O’Connell JR
    41. Palmer ND
    42. Pampana A
    43. Peralta JM
    44. Peyser PA
    45. Psaty BM
    46. Redline S
    47. Rice KM
    48. Rich SS
    49. Smith JA
    50. Tiwari HK
    51. Tsai MY
    52. Vasan RS
    53. Wang FF
    54. Weeks DE
    55. Weng Z
    56. Wilson JG
    57. Yanek LR
    58. Abe N
    59. Abecasis GR
    60. Aguet F
    61. Albert C
    62. Almasy L
    63. Alonso A
    64. Ament S
    65. Anderson P
    66. Anugu P
    67. Applebaum-Bowden D
    68. Ardlie K
    69. Arking D
    70. Arnett DK
    71. Ashley-Koch A
    72. Aslibekyan S
    73. Assimes T
    74. Auer P
    75. Avramopoulos D
    76. Barnard J
    77. Barnes K
    78. Barr RG
    79. Barron-Casella E
    80. Barwick L
    81. Beaty T
    82. Beck G
    83. Becker D
    84. Becker L
    85. Beer R
    86. Beitelshees A
    87. Benjamin E
    88. Benos T
    89. Bezerra M
    90. Bielak LF
    91. Bis J
    92. Blackwell T
    93. Blangero J
    94. Boerwinkle E
    95. Bowden DW
    96. Bowler R
    97. Brody J
    98. Broeckel U
    99. Broome JG
    100. Bunting K
    101. Burchard E
    102. Bustamante C
    103. Buth E
    104. Cade B
    105. Cardwell J
    106. Carey V
    107. Carty C
    108. Casaburi R
    109. Casella J
    110. Castaldi P
    111. Chaffin M
    112. Chang C
    113. Chang YC
    114. Chasman D
    115. Chavan S
    116. Chen BJ
    117. Chen WM
    118. Chen YDI
    119. Cho M
    120. Choi SH
    121. Chuang LM
    122. Chung M
    123. Chung RH
    124. Clish C
    125. Comhair S
    126. Conomos MP
    127. Cornell E
    128. Correa A
    129. Crandall C
    130. Crapo J
    131. Cupples LA
    132. Curran JE
    133. Curtis J
    134. Custer B
    135. Damcott C
    136. Darbar D
    137. Das S
    138. David S
    139. Davis C
    140. Daya M
    141. de Andrade M
    142. de las Fuentes L
    143. DeBaun M
    144. Deka R
    145. DeMeo D
    146. Devine S
    147. Duan Q
    148. Duggirala R
    149. Durda JP
    150. Dutcher S
    151. Eaton C
    152. Ekunwe L
    153. El Boueiz A
    154. Ellinor P
    155. Emery L
    156. Erzurum S
    157. Farber C
    158. Fingerlin T
    159. Flickinger M
    160. Fornage M
    161. Franceschini N
    162. Frazar C
    163. Fu M
    164. Fullerton SM
    165. Fulton L
    166. Gabriel S
    167. Gan W
    168. Gao S
    169. Gao Y
    170. Gass M
    171. Gelb B
    172. Geng X
    173. Geraci M
    174. Germer S
    175. Gerszten R
    176. Ghosh A
    177. Gibbs R
    178. Gignoux C
    179. Gladwin M
    180. Glahn D
    181. Gogarten S
    182. Gong DW
    183. Goring H
    184. Graw S
    185. Grine D
    186. Gu CC
    187. Guan Y
    188. Guo X
    189. Gupta N
    190. Haessler J
    191. Hall M
    192. Harris D
    193. Hawley NL
    194. He J
    195. Heckbert S
    196. Hernandez R
    197. Herrington D
    198. Hersh C
    199. Hidalgo B
    200. Hixson J
    201. Hobbs B
    202. Hokanson J
    203. Hong E
    204. Hoth K
    205. Hsiung C
    206. Hung YJ
    207. Huston H
    208. Hwu CM
    209. Irvin MR
    210. Jackson R
    211. Jain D
    212. Jaquish C
    213. Jhun MA
    214. Johnsen J
    215. Johnson A
    216. Johnson C
    217. Johnston R
    218. Jones K
    219. Kang HM
    220. Kaplan R
    221. Kardia SLR
    222. Kathiresan S
    223. Kelly S
    224. Kenny E
    225. Kessler M
    226. Khan AT
    227. Kim W
    228. Kinney G
    229. Konkle B
    230. Kooperberg CL
    231. Kramer H
    232. Lange C
    233. Lange E
    234. Lange L
    235. Laurie CC
    236. Laurie C
    237. LeBoff M
    238. Lee J
    239. Lee SS
    240. Lee WJ
    241. LeFaive J
    242. Levine D
    243. Levy D
    244. Lewis J
    245. Li X
    246. Li Y
    247. Lin H
    248. Lin H
    249. Lin KH
    250. Lin X
    251. Liu S
    252. Liu Y
    253. Liu Y
    254. Loos RJF
    255. Lubitz S
    256. Lunetta K
    257. Luo J
    258. Mahaney MC
    259. Make B
    260. Manichaikul AW
    261. Manson J
    262. Margolin L
    263. Martin LW
    264. Mathai S
    265. Mathias RA
    266. May S
    267. McArdle P
    268. McDonald ML
    269. McFarland S
    270. McGarvey ST
    271. McGoldrick D
    272. McHugh C
    273. Mei H
    274. Mestroni L
    275. Meyers DA
    276. Mikulla J
    277. Min N
    278. Minear M
    279. Minster RL
    280. Mitchell BD
    281. Moll M
    282. Montasser ME
    283. Montgomery C
    284. Moscati A
    285. Musani S
    286. Mwasongwe S
    287. Mychaleckyj JC
    288. Nadkarni G
    289. Naik R
    290. Naseri T
    291. Natarajan P
    292. Nekhai S
    293. Nelson SC
    294. Neltner B
    295. Nickerson D
    296. North K
    297. O’Connell JR
    298. O’Connor T
    299. Ochs-Balcom H
    300. Paik D
    301. Palmer ND
    302. Pankow J
    303. Papanicolaou G
    304. Parsa A
    305. Peralta JM
    306. Perez M
    307. Perry J
    308. Peters U
    309. Peyser PA
    310. Phillips LS
    311. Pollin T
    312. Post W
    313. Becker JP
    314. Boorgula MP
    315. Preuss M
    316. Psaty BM
    317. Qasba P
    318. Qiao D
    319. Qin Z
    320. Rafaels N
    321. Raffield L
    322. Vasan RS
    323. Rao DC
    324. Rasmussen-Torvik L
    325. Ratan A
    326. Redline S
    327. Reed R
    328. Regan E
    329. Reiner A
    330. Reupena MS
    331. Rice KM
    332. Rich SS
    333. Roden D
    334. Roselli C
    335. Rotter JI
    336. Ruczinski I
    337. Russell P
    338. Ruuska S
    339. Ryan K
    340. Sabino EC
    341. Saleheen D
    342. Salimi S
    343. Salzberg S
    344. Sandow K
    345. Sankaran VG
    346. Scheller C
    347. Schmidt E
    348. Schwander K
    349. Schwartz D
    350. Sciurba F
    351. Seidman C
    352. Seidman J
    353. Sheehan V
    354. Sherman SL
    355. Shetty A
    356. Shetty A
    357. Sheu WHH
    358. Shoemaker MB
    359. Silver B
    360. Silverman E
    361. Smith JA
    362. Smith J
    363. Smith N
    364. Smith T
    365. Smoller S
    366. Snively B
    367. Snyder M
    368. Sofer T
    369. Sotoodehnia N
    370. Stilp AM
    371. Storm G
    372. Streeten E
    373. Su JL
    374. Sung YJ
    375. Sylvia J
    376. Szpiro A
    377. Sztalryd C
    378. Taliun D
    379. Tang H
    380. Taub M
    381. Taylor KD
    382. Taylor M
    383. Taylor S
    384. Telen M
    385. Thornton TA
    386. Threlkeld M
    387. Tinker L
    388. Tirschwell D
    389. Tishkoff S
    390. Tiwari HK
    391. Tong C
    392. Tracy R
    393. Tsai MY
    394. Vaidya D
    395. Van Den Berg D
    396. VandeHaar P
    397. Vrieze S
    398. Walker T
    399. Wallace R
    400. Walts A
    401. Wang FF
    402. Wang H
    403. Watson K
    404. Weeks DE
    405. Weir B
    406. Weiss S
    407. Weng LC
    408. Wessel J
    409. Willer CJ
    410. Williams K
    411. Williams LK
    412. Wilson C
    413. Wilson JG
    414. Wong Q
    415. Wu J
    416. Xu H
    417. Yanek LR
    418. Yang I
    419. Yang R
    420. Zaghloul N
    421. Zekavat M
    422. Zhang Y
    423. Zhao SX
    424. Zhao W
    425. Zhi D
    426. Zhou X
    427. Zhu X
    428. Zody M
    429. Zoellner S
    430. Abdalla M
    431. Abecasis GR
    432. Arnett DK
    433. Aslibekyan S
    434. Assimes T
    435. Atkinson E
    436. Ballantyne CM
    437. Beitelshees A
    438. Bielak LF
    439. Bis J
    440. Bodea C
    441. Boerwinkle E
    442. Bowden DW
    443. Brody J
    444. Cade B
    445. Carlson J
    446. Chang IS
    447. Chen YDI
    448. Chun S
    449. Chung RH
    450. Conomos MP
    451. Correa A
    452. Cupples LA
    453. Damcott C
    454. de Vries P
    455. Do R
    456. Elliott A
    457. Fu M
    458. Ganna A
    459. Gong DW
    460. Graham S
    461. Haas M
    462. Haring B
    463. He J
    464. Heckbert S
    465. Himes B
    466. Hixson J
    467. Irvin MR
    468. Jain D
    469. Jarvik G
    470. Jhun MA
    471. Jiang J
    472. Jun G
    473. Kalyani R
    474. Kardia SLR
    475. Kathiresan S
    476. Khera A
    477. Klarin D
    478. Kooperberg CL
    479. Kral B
    480. Lange L
    481. Laurie CC
    482. Laurie C
    483. Lemaitre R
    484. Li Z
    485. Li X
    486. Lin X
    487. Mahaney MC
    488. Manichaikul AW
    489. Martin LW
    490. Mathias RA
    491. Mathur R
    492. McGarvey ST
    493. McHugh C
    494. McLenithan J
    495. Mikulla J
    496. Mitchell BD
    497. Montasser ME
    498. Moran A
    499. Morrison AC
    500. Nakao T
    501. Natarajan P
    502. Nickerson D
    503. North K
    504. O’Connell JR
    505. O’Donnell C
    506. Palmer ND
    507. Pampana A
    508. Patel A
    509. Peloso GM
    510. Perry J
    511. Peters U
    512. Peyser PA
    513. Pirruccello J
    514. Pollin T
    515. Preuss M
    516. Psaty BM
    517. Rao DC
    518. Redline S
    519. Reed R
    520. Reiner A
    521. Rich SS
    522. Rosenthal S
    523. Rotter JI
    524. Schoenberg J
    525. Selvaraj MS
    526. Sheu WHH
    527. Smith JA
    528. Sofer T
    529. Stilp AM
    530. Sunyaev SR
    531. Surakka I
    532. Sztalryd C
    533. Tang H
    534. Taylor KD
    535. Tsai MY
    536. Uddin MM
    537. Urbut S
    538. Verbanck M
    539. Von Holle A
    540. Wang H
    541. Wang FF
    542. Wiggins K
    543. Willer CJ
    544. Wilson JG
    545. Wolford B
    546. Xu H
    547. Yanek LR
    548. Zaghloul N
    549. Zekavat M
    550. Zhang J
    551. Neale BM
    552. Sunyaev SR
    553. Abecasis GR
    554. Rotter JI
    555. Willer CJ
    556. Peloso GM
    557. Natarajan P
    558. Lin X
    (2020) Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale
    Nature Genetics 52:969–983.
    1. Yengo L
    2. Vedantam S
    3. Marouli E
    4. Sidorenko J
    5. Bartell E
    6. Sakaue S
    7. Graff M
    8. Eliasen AU
    9. Jiang Y
    10. Raghavan S
    11. Miao J
    12. Arias JD
    13. Graham SE
    14. Mukamel RE
    15. Spracklen CN
    16. Yin X
    17. Chen SH
    18. Ferreira T
    19. Highland HH
    20. Ji Y
    21. Karaderi T
    22. Lin K
    23. Lüll K
    24. Malden DE
    25. Medina-Gomez C
    26. Machado M
    27. Moore A
    28. Rüeger S
    29. Sim X
    30. Vrieze S
    31. Ahluwalia TS
    32. Akiyama M
    33. Allison MA
    34. Alvarez M
    35. Andersen MK
    36. Ani A
    37. Appadurai V
    38. Arbeeva L
    39. Bhaskar S
    40. Bielak LF
    41. Bollepalli S
    42. Bonnycastle LL
    43. Bork-Jensen J
    44. Bradfield JP
    45. Bradford Y
    46. Braund PS
    47. Brody JA
    48. Burgdorf KS
    49. Cade BE
    50. Cai H
    51. Cai Q
    52. Campbell A
    53. Cañadas-Garre M
    54. Catamo E
    55. Chai JF
    56. Chai X
    57. Chang LC
    58. Chang YC
    59. Chen CH
    60. Chesi A
    61. Choi SH
    62. Chung RH
    63. Cocca M
    64. Concas MP
    65. Couture C
    66. Cuellar-Partida G
    67. Danning R
    68. Daw EW
    69. Degenhard F
    70. Delgado GE
    71. Delitala A
    72. Demirkan A
    73. Deng X
    74. Devineni P
    75. Dietl A
    76. Dimitriou M
    77. Dimitrov L
    78. Dorajoo R
    79. Ekici AB
    80. Engmann JE
    81. Fairhurst-Hunter Z
    82. Farmaki AE
    83. Faul JD
    84. Fernandez-Lopez JC
    85. Forer L
    86. Francescatto M
    87. Freitag-Wolf S
    88. Fuchsberger C
    89. Galesloot TE
    90. Gao Y
    91. Gao Z
    92. Geller F
    93. Giannakopoulou O
    94. Giulianini F
    95. Gjesing AP
    96. Goel A
    97. Gordon SD
    98. Gorski M
    99. Grove J
    100. Guo X
    101. Gustafsson S
    102. Haessler J
    103. Hansen TF
    104. Havulinna AS
    105. Haworth SJ
    106. He J
    107. Heard-Costa N
    108. Hebbar P
    109. Hindy G
    110. Ho YLA
    111. Hofer E
    112. Holliday E
    113. Horn K
    114. Hornsby WE
    115. Hottenga JJ
    116. Huang H
    117. Huang J
    118. Huerta-Chagoya A
    119. Huffman JE
    120. Hung YJ
    121. Huo S
    122. Hwang MY
    123. Iha H
    124. Ikeda DD
    125. Isono M
    126. Jackson AU
    127. Jäger S
    128. Jansen IE
    129. Johansson I
    130. Jonas JB
    131. Jonsson A
    132. Jørgensen T
    133. Kalafati IP
    134. Kanai M
    135. Kanoni S
    136. Kårhus LL
    137. Kasturiratne A
    138. Katsuya T
    139. Kawaguchi T
    140. Kember RL
    141. Kentistou KA
    142. Kim HN
    143. Kim YJ
    144. Kleber ME
    145. Knol MJ
    146. Kurbasic A
    147. Lauzon M
    148. Le P
    149. Lea R
    150. Lee JY
    151. Leonard HL
    152. Li SA
    153. Li X
    154. Li X
    155. Liang J
    156. Lin H
    157. Lin SY
    158. Liu J
    159. Liu X
    160. Lo KS
    161. Long J
    162. Lores-Motta L
    163. Luan J
    164. Lyssenko V
    165. Lyytikäinen LP
    166. Mahajan A
    167. Mamakou V
    168. Mangino M
    169. Manichaikul A
    170. Marten J
    171. Mattheisen M
    172. Mavarani L
    173. McDaid AF
    174. Meidtner K
    175. Melendez TL
    176. Mercader JM
    177. Milaneschi Y
    178. Miller JE
    179. Millwood IY
    180. Mishra PP
    181. Mitchell RE
    182. Møllehave LT
    183. Morgan A
    184. Mucha S
    185. Munz M
    186. Nakatochi M
    187. Nelson CP
    188. Nethander M
    189. Nho CW
    190. Nielsen AA
    191. Nolte IM
    192. Nongmaithem SS
    193. Noordam R
    194. Ntalla I
    195. Nutile T
    196. Pandit A
    197. Christofidou P
    198. Pärna K
    199. Pauper M
    200. Petersen ERB
    201. Petersen LV
    202. Pitkänen N
    203. Polašek O
    204. Poveda A
    205. Preuss MH
    206. Pyarajan S
    207. Raffield LM
    208. Rakugi H
    209. Ramirez J
    210. Rasheed A
    211. Raven D
    212. Rayner NW
    213. Riveros C
    214. Rohde R
    215. Ruggiero D
    216. Ruotsalainen SE
    217. Ryan KA
    218. Sabater-Lleal M
    219. Saxena R
    220. Scholz M
    221. Sendamarai A
    222. Shen B
    223. Shi J
    224. Shin JH
    225. Sidore C
    226. Sitlani CM
    227. Slieker RC
    228. Smit RAJ
    229. Smith AV
    230. Smith JA
    231. Smyth LJ
    232. Southam L
    233. Steinthorsdottir V
    234. Sun L
    235. Takeuchi F
    236. Tallapragada DSP
    237. Taylor KD
    238. Tayo BO
    239. Tcheandjieu C
    240. Terzikhan N
    241. Tesolin P
    242. Teumer A
    243. Theusch E
    244. Thompson DJ
    245. Thorleifsson G
    246. Timmers P
    247. Trompet S
    248. Turman C
    249. Vaccargiu S
    250. van der Laan SW
    251. van der Most PJ
    252. van Klinken JB
    253. van Setten J
    254. Verma SS
    255. Verweij N
    256. Veturi Y
    257. Wang CA
    258. Wang C
    259. Wang L
    260. Wang Z
    261. Warren HR
    262. Bin Wei W
    263. Wickremasinghe AR
    264. Wielscher M
    265. Wiggins KL
    266. Winsvold BS
    267. Wong A
    268. Wu Y
    269. Wuttke M
    270. Xia R
    271. Xie T
    272. Yamamoto K
    273. Yang J
    274. Yao J
    275. Young H
    276. Yousri NA
    277. Yu L
    278. Zeng L
    279. Zhang W
    280. Zhang X
    281. Zhao JH
    282. Zhao W
    283. Zhou W
    284. Zimmermann ME
    285. Zoledziewska M
    286. Adair LS
    287. Adams HHH
    288. Aguilar-Salinas CA
    289. Al-Mulla F
    290. Arnett DK
    291. Asselbergs FW
    292. Åsvold BO
    293. Attia J
    294. Banas B
    295. Bandinelli S
    296. Bennett DA
    297. Bergler T
    298. Bharadwaj D
    299. Biino G
    300. Bisgaard H
    301. Boerwinkle E
    302. Böger CA
    303. Bønnelykke K
    304. Boomsma DI
    305. Børglum AD
    306. Borja JB
    307. Bouchard C
    308. Bowden DW
    309. Brandslund I
    310. Brumpton B
    311. Buring JE
    312. Caulfield MJ
    313. Chambers JC
    314. Chandak GR
    315. Chanock SJ
    316. Chaturvedi N
    317. Chen YDI
    318. Chen Z
    319. Cheng CY
    320. Christophersen IE
    321. Ciullo M
    322. Cole JW
    323. Collins FS
    324. Cooper RS
    325. Cruz M
    326. Cucca F
    327. Cupples LA
    328. Cutler MJ
    329. Damrauer SM
    330. Dantoft TM
    331. de Borst GJ
    332. de Groot L
    333. De Jager PL
    334. de Kleijn DPV
    335. Janaka de Silva H
    336. Dedoussis GV
    337. den Hollander AI
    338. Du S
    339. Easton DF
    340. Elders PJM
    341. Eliassen AH
    342. Ellinor PT
    343. Elmståhl S
    344. Erdmann J
    345. Evans MK
    346. Fatkin D
    347. Feenstra B
    348. Feitosa MF
    349. Ferrucci L
    350. Ford I
    351. Fornage M
    352. Franke A
    353. Franks PW
    354. Freedman BI
    355. Gasparini P
    356. Gieger C
    357. Girotto G
    358. Goddard ME
    359. Golightly YM
    360. Gonzalez-Villalpando C
    361. Gordon-Larsen P
    362. Grallert H
    363. Grant SFA
    364. Grarup N
    365. Griffiths L
    366. Gudnason V
    367. Haiman C
    368. Hakonarson H
    369. Hansen T
    370. Hartman CA
    371. Hattersley AT
    372. Hayward C
    373. Heckbert SR
    374. Heng CK
    375. Hengstenberg C
    376. Hewitt AW
    377. Hishigaki H
    378. Hoyng CB
    379. Huang PL
    380. Huang W
    381. Hunt SC
    382. Hveem K
    383. Hyppönen E
    384. Iacono WG
    385. Ichihara S
    386. Ikram MA
    387. Isasi CR
    388. Jackson RD
    389. Jarvelin MR
    390. Jin ZB
    391. Jöckel KH
    392. Joshi PK
    393. Jousilahti P
    394. Jukema JW
    395. Kähönen M
    396. Kamatani Y
    397. Kang KD
    398. Kaprio J
    399. Kardia SLR
    400. Karpe F
    401. Kato N
    402. Kee F
    403. Kessler T
    404. Khera AV
    405. Khor CC
    406. Kiemeney L
    407. Kim BJ
    408. Kim EK
    409. Kim HL
    410. Kirchhof P
    411. Kivimaki M
    412. Koh WP
    413. Koistinen HA
    414. Kolovou GD
    415. Kooner JS
    416. Kooperberg C
    417. Köttgen A
    418. Kovacs P
    419. Kraaijeveld A
    420. Kraft P
    421. Krauss RM
    422. Kumari M
    423. Kutalik Z
    424. Laakso M
    425. Lange LA
    426. Langenberg C
    427. Launer LJ
    428. Le Marchand L
    429. Lee H
    430. Lee NR
    431. Lehtimäki T
    432. Li H
    433. Li L
    434. Lieb W
    435. Lin X
    436. Lind L
    437. Linneberg A
    438. Liu CT
    439. Liu J
    440. Loeffler M
    441. London B
    442. Lubitz SA
    443. Lye SJ
    444. Mackey DA
    445. Mägi R
    446. Magnusson PKE
    447. Marcus GM
    448. Vidal PM
    449. Martin NG
    450. März W
    451. Matsuda F
    452. McGarrah RW
    453. McGue M
    454. McKnight AJ
    455. Medland SE
    456. Mellström D
    457. Metspalu A
    458. Mitchell BD
    459. Mitchell P
    460. Mook-Kanamori DO
    461. Morris AD
    462. Mucci LA
    463. Munroe PB
    464. Nalls MA
    465. Nazarian S
    466. Nelson AE
    467. Neville MJ
    468. Newton-Cheh C
    469. Nielsen CS
    470. Nöthen MM
    471. Ohlsson C
    472. Oldehinkel AJ
    473. Orozco L
    474. Pahkala K
    475. Pajukanta P
    476. Palmer CNA
    477. Parra EJ
    478. Pattaro C
    479. Pedersen O
    480. Pennell CE
    481. Penninx B
    482. Perusse L
    483. Peters A
    484. Peyser PA
    485. Porteous DJ
    486. Posthuma D
    487. Power C
    488. Pramstaller PP
    489. Province MA
    490. Qi Q
    491. Qu J
    492. Rader DJ
    493. Raitakari OT
    494. Ralhan S
    495. Rallidis LS
    496. Rao DC
    497. Redline S
    498. Reilly DF
    499. Reiner AP
    500. Rhee SY
    501. Ridker PM
    502. Rienstra M
    503. Ripatti S
    504. Ritchie MD
    505. Roden DM
    506. Rosendaal FR
    507. Rotter JI
    508. Rudan I
    509. Rutters F
    510. Sabanayagam C
    511. Saleheen D
    512. Salomaa V
    513. Samani NJ
    514. Sanghera DK
    515. Sattar N
    516. Schmidt B
    517. Schmidt H
    518. Schmidt R
    519. Schulze MB
    520. Schunkert H
    521. Scott LJ
    522. Scott RJ
    523. Sever P
    524. Shiroma EJ
    525. Shoemaker MB
    526. Shu XO
    527. Simonsick EM
    528. Sims M
    529. Singh JR
    530. Singleton AB
    531. Sinner MF
    532. Smith JG
    533. Snieder H
    534. Spector TD
    535. Stampfer MJ
    536. Stark KJ
    537. Strachan DP
    538. ’t Hart LM
    539. Tabara Y
    540. Tang H
    541. Tardif JC
    542. Thanaraj TA
    543. Timpson NJ
    544. Tönjes A
    545. Tremblay A
    546. Tuomi T
    547. Tuomilehto J
    548. Tusié-Luna MT
    549. Uitterlinden AG
    550. van Dam RM
    551. van der Harst P
    552. Van der Velde N
    553. van Duijn CM
    554. van Schoor NM
    555. Vitart V
    556. Völker U
    557. Vollenweider P
    558. Völzke H
    559. Wacher-Rodarte NH
    560. Walker M
    561. Wang YX
    562. Wareham NJ
    563. Watanabe RM
    564. Watkins H
    565. Weir DR
    566. Werge TM
    567. Widen E
    568. Wilkens LR
    569. Willemsen G
    570. Willett WC
    571. Wilson JF
    572. Wong TY
    573. Woo JT
    574. Wright AF
    575. Wu JY
    576. Xu H
    577. Yajnik CS
    578. Yokota M
    579. Yuan JM
    580. Zeggini E
    581. Zemel BS
    582. Zheng W
    583. Zhu X
    584. Zmuda JM
    585. Zonderman AB
    586. Zwart JA
    587. Ng MCY
    588. Rivadeneira F
    589. Thorsteinsdottir U
    590. Sun YV
    591. Tai ES
    592. Boehnke M
    593. Deloukas P
    594. Justice AE
    595. Lindgren CM
    596. Loos RJF
    597. Mohlke KL
    598. North KE
    599. Stefansson K
    600. Walters RG
    601. Winkler TW
    602. Young KL
    603. Loh PR
    604. Yang J
    605. Esko T
    606. Assimes TL
    607. Auton A
    608. Abecasis GR
    609. Willer CJ
    610. Locke AE
    611. Berndt SI
    612. Lettre G
    613. Frayling TM
    614. Okada Y
    615. Wood AR
    616. Visscher PM
    617. Hirschhorn JN
    (2022) A saturated map of common genetic variants associated with human height
    Nature 610:704–712.

Article and author information

Author details

  1. Weichen Song

    1. Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Bioengineering, Shanghai Jiao Tong University, Shanghai, China
    2. Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, China
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3197-6236
  2. Yongyong Shi

    1. Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders (Ministry of Education), Collaborative Innovation Center for Brain Science, Shanghai Jiao Tong University, Shanghai, China
    2. Biomedical Sciences Institute of Qingdao University (Qingdao Branch of SJTU Bio-X12 Institutes), Qingdao University, Qingdao, China
    Conceptualization, Resources, Supervision, Investigation, Writing – review and editing
    For correspondence
    Competing interests
    No competing interests declared
  3. Guan Ning Lin

    Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Bioengineering, Shanghai Jiao Tong University, Shanghai, China
    Conceptualization, Supervision, Funding acquisition, Validation, Investigation, Project administration, Writing – review and editing
    For correspondence
    Competing interests
    No competing interests declared


Ministry of Science and Technology (2030 Science and Technology Innovation Key Program 2022ZD020910001)

  • Guan Ning Lin

National Natural Science Foundation of China (81971292)

  • Guan Ning Lin

National Natural Science Foundation of China (82150610506)

  • Guan Ning Lin

Natural Science Foundation of Shanghai (21ZR1428600)

  • Guan Ning Lin

Medical-Engineering Cross Foundation of Shanghai Jiao Tong University (YG2022ZD026)

  • Guan Ning Lin

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.


This work was supported by grants from the 2030 Science and Technology Innovation Key Program of Ministry of Science and Technology of China (No. 2022ZD020910001), the National Natural Science Foundation of China (No. 81971292, 82150610506) and the Natural Science Foundation of Shanghai (No. 21ZR1428600), the Medical-Engineering Cross Foundation of Shanghai Jiao Tong University (No. YG2022ZD026).

Version history

  1. Preprint posted: August 9, 2023 (view preprint)
  2. Sent for peer review: September 28, 2023
  3. Preprint posted: December 5, 2023 (view preprint)
  4. Preprint posted: March 21, 2024 (view preprint)
  5. Version of Record published: April 19, 2024 (version 1)
  6. Version of Record updated: April 22, 2024 (version 2)

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.92574. This DOI represents all versions, and will always resolve to the latest one.


© 2023, Song et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.


  • 382
  • 35
  • 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Weichen Song
  2. Yongyong Shi
  3. Guan Ning Lin
Haplotype function score improves biological interpretation and cross-ancestry polygenic prediction of human complex traits
eLife 12:RP92574.

Share this article


Further reading

    1. Computational and Systems Biology
    2. Evolutionary Biology
    Ryan T Bell, Harutyun Sahakyan ... Eugene V Koonin
    Research Article

    A comprehensive census of McrBC systems, among the most common forms of prokaryotic Type IV restriction systems, followed by phylogenetic analysis, reveals their enormous abundance in diverse prokaryotes and a plethora of genomic associations. We focus on a previously uncharacterized branch, which we denote coiled-coil nuclease tandems (CoCoNuTs) for their salient features: the presence of extensive coiled-coil structures and tandem nucleases. The CoCoNuTs alone show extraordinary variety, with three distinct types and multiple subtypes. All CoCoNuTs contain domains predicted to interact with translation system components, such as OB-folds resembling the SmpB protein that binds bacterial transfer-messenger RNA (tmRNA), YTH-like domains that might recognize methylated tmRNA, tRNA, or rRNA, and RNA-binding Hsp70 chaperone homologs, along with RNases, such as HEPN domains, all suggesting that the CoCoNuTs target RNA. Many CoCoNuTs might additionally target DNA, via McrC nuclease homologs. Additional restriction systems, such as Type I RM, BREX, and Druantia Type III, are frequently encoded in the same predicted superoperons. In many of these superoperons, CoCoNuTs are likely regulated by cyclic nucleotides, possibly, RNA fragments with cyclic termini, that bind associated CARF (CRISPR-Associated Rossmann Fold) domains. We hypothesize that the CoCoNuTs, together with the ancillary restriction factors, employ an echeloned defense strategy analogous to that of Type III CRISPR-Cas systems, in which an immune response eliminating virus DNA and/or RNA is launched first, but then, if it fails, an abortive infection response leading to PCD/dormancy via host RNA cleavage takes over.

    1. Computational and Systems Biology
    Skander Kazdaghli, Iordanis Kerenidis ... Philip Teare
    Research Article

    Imputing data is a critical issue for machine learning practitioners, including in the life sciences domain, where missing clinical data is a typical situation and the reliability of the imputation is of great importance. Currently, there is no canonical approach for imputation of clinical data and widely used algorithms introduce variance in the downstream classification. Here we propose novel imputation methods based on determinantal point processes (DPP) that enhance popular techniques such as the multivariate imputation by chained equations and MissForest. Their advantages are twofold: improving the quality of the imputed data demonstrated by increased accuracy of the downstream classification and providing deterministic and reliable imputations that remove the variance from the classification results. We experimentally demonstrate the advantages of our methods by performing extensive imputations on synthetic and real clinical data. We also perform quantum hardware experiments by applying the quantum circuits for DPP sampling since such quantum algorithms provide a computational advantage with respect to classical ones. We demonstrate competitive results with up to 10 qubits for small-scale imputation tasks on a state-of-the-art IBM quantum processor. Our classical and quantum methods improve the effectiveness and robustness of clinical data prediction modeling by providing better and more reliable data imputations. These improvements can add significant value in settings demanding high precision, such as in pharmaceutical drug trials where our approach can provide higher confidence in the predictions made.