Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank

  1. Ardalan Naseri
  2. Degui Zhi  Is a corresponding author
  3. Shaojie Zhang  Is a corresponding author
  1. Department of Computer Science, University of Central Florida, United States
  2. Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, United States

Abstract

Runs-of-homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 single nucleotide polymorphisms (SNPs) and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended human leukocyte antigen (HLA) region and autoimmune disorders. We found an association between a diplotype covering the homeostatic iron regulator (HFE) gene and hemochromatosis, even though the well-known causal SNP was not directly genotyped or imputed. Using a genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID-19 patients (p-value = 1.82 × 10−11). In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at a population scale.

Editor's evaluation

This important study presents a new method for homozygosity mapping in population-scale datasets, based on an innovative computational algorithm that efficiently identifies runs-of-homozygosity (ROH) segments shared by many individuals. Simulation results provided convincing evidence for good accuracy and power of the new algorithm. Application of this new method to the UK Biobank dataset largely recapitulated previously known associations but also revealed a small number of novel discoveries that were missed by existing genome-wide association study methods, highlighting the utility of this new approach. This study will be of substantial interest to readers in human genetics and quantitative genetics.

https://doi.org/10.7554/eLife.81698.sa0

Introduction

Runs-of-homozygosity (ROH) regions are regions of diploid chromosomes where identical-by-descent (IBD) haplotypes are inherited from each parent (Ceballos et al., 2018). Traditionally, ROH was thought to be relevant only to inbred populations, and ROH may be linked to consanguinity and population isolation (Kirin et al., 2010). However, a growing number of studies of large cohorts and biobanks have found that ROH may be ubiquitously present (Clark et al., 2019; Joshi et al., 2015). Still, our understanding of the genetic impacts of ROH is limited.

Most existing studies used individuals’ global ROH content (the sum of lengths or the count of ROHs) as a surrogate for the degree of inbreeding and associated it with phenotypes. It has long been known that inbreeding is harmful to the health of offspring (Morton et al., 1956), and several studies have suggested that the global ROH content is associated with higher risks of recessive disorders (Lencz et al., 2007; Keller et al., 2012; Christofidou et al., 2015). ROHs can also be related to complex traits such as height (Yang et al., 2010). With the growing trend of multi-cohort collaboration through meta-analysis, the effect of global ROH content has been studied over very large sample sizes (Clark et al., 2019; Joshi et al., 2015). A recent study (Yengo et al., 2019) revealed that people with extremely long ROH can be found even in outbred populations.

However, collapsing the individual’s rich ROH content into a single number summarizing their global content is a drastic oversimplification. In doing so, the opportunities for mapping causal loci of phenotypes are lost. Ideally, one might wish to identify chromosomal regions with a certain ROH diplotype (Luo et al., 2006) (pairs of identical haplotypes) and associate the ROH diplotype with the phenotypes of interest. Indeed, homozygosity mapping in pedigree or inbred populations has achieved success in identifying recessive loci (Keller et al., 2012; Lander and Botstein, 1987; Leutenegger et al., 2006; Pourreza et al., 2020; Tischfield et al., 2005; Gandin et al., 2015). However, for general outbred populations, the total number of possible ROH diplotypes at a locus is too enormous to be enumerated efficiently, and ROH mapping of outbred populations has remained only a theoretical possibility.

Here, we proposed an approach, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), that bypasses this impossibly large search space of diplotypes. Instead of enumerating all ROH diplotypes, we focused on those that are sufficiently long and frequent. Such ROH diplotypes are of interest because they are at the extreme of distribution: the chance of ROH is determined by the chance of a pair of mates having IBD, and such chance and also the length of IBD segments will decay quickly in outbred populations, as supported by population genetics theory (Thompson, 2013; Donnelly, 1983) and real-world data (Ralph and Coop, 2013; Naseri et al., 2019b). However, little is known about such ROH diplotypes because no existing methods can efficiently find them.

We present an efficient positional Burrows–Wheeler transform (PBWT)-based (Durbin, 2014) method to find clusters of identical matches. We apply our method to find clusters of ROH diplotypes in UK Biobank data. Each cluster of ROH diplotypes is defined as a set of 100 consecutive homozygous sites that are shared among over 100 individuals. We investigate the association between the detected ROH diplotype clusters and self-reported non-cancerous diseases and present the results for the disease having the strongest associations with the detected ROH diplotype clusters.

Results

Methods overview

An ROH diplotype is a pair of homozygous haplotypes of an individual. A frequent ROH diplotype is one shared by several individuals at the same location and with the same consensus sequence. Although long and frequent ROH diplotypes are not very common, it is difficult to enumerate ROH diplotypes above a certain length and a certain frequency. We refer to frequent ROH diplotypes above a certain frequency (set of individuals) and a length as ROH clusters. As a compromise, ROH regions are traditionally aggregated into single numbers and their association with phenotypes is investigated. As a result, the loci-specific association signals of the ROHs or the allele-specific signals are likely to be lost (see Figure 1).

Figure 1 with 3 supplements see all
Runs-of-homozygosity (ROH)-DICE enables the discovery of loci-specific association signals of ROH diplotypes.

The actual ROH contents (a) including the locations and sequence identities of ROH (indicated by different colors) were lost in traditional ROH analysis pipelines (b) which aggregate the ROH contents per individual and lose the chances for identifying associating loci. ROH-DICE (c) reveals ROH diplotype clusters that are long and wide enough, thus enabling mapping loci associated with phenotypes.

To solve this problem, we first processed the biallelic genotype panel (with three possible values 0, 1, and 2 at each position) by randomly assigning any heterozygous sites to homozygous sites with the reference or the alternative allele. The reasons for such processing are twofold. One, through this conversion, the true ROH diplotype clusters, mostly consisting of homozygous sites, are relatively intact and will still have a high probability of maintaining a good portion of their haplotype. However, some post hoc processing may be needed to merge the ROH diplotype clustered with minor deviations of their consensus sequences. Notably, this conversion should introduce very few false positives as when the length and the width cut-offs are large, there is little chance a non-ROH diplotype cluster will emerge. Two, this effectively converts the panel into a haplotype panel (with two possible values 0 and 2 at each position), where efficient algorithms for identifying haplotype matching blocks are available. A haplotype matching block is defined as a sequence of variant sites that have a predefined minimum frequency. An extra benefit is, by doing this conversion, no phasing of haplotypes is needed.

Haplotype matching blocks can be identified by leveraging the efficient sorting of haplotypes in the PBWT data structure. For a haplotype panel, PBWT sorts haplotype sequences at each variant site according to their reverse suffixes, and thus a set of haplotypes sharing the same sequence before a variant site will be adjacent in the sorting and form a ‘matching block’. We use auxiliary PBWT data structures to keep track of the length (the number of variant sites) and the width (the number of haplotype sequences) of the matching block and trigger the output report by watching the data structures. Figure 2 summarizes the overall ROH-DICE method. More details about the algorithms for finding blocks of matches and searching for ROH diplotypes are presented in the Methods section.

A simple schematic of searching for runs-of-homozygosity (ROH) diplotype clusters in a genotype panel.

The input is a genotype panel where each line represents an individual. The heterozygous sites are depicted in violet in the genotype panel. Input genotype data are converted into a binarized genotype panel where homozygous sites are preserved. The matching blocks (clusters) are searched using consensus PBWT (cPBWT). A matching block is defined by a minimum number of sites, individuals, and also an objective function. The objective can be either maximizing the number of individuals or maximizing the number of sites. The clusters of matches are highlighted in different colors. Red represents a cluster with the maximized number of individuals and blue represents a cluster with the maximized number of sites.

Evaluation of ROH clusters in simulated data

Accuracy and power of ROH clusters

To evaluate the detection power and accuracy of ROH clusters, we simulated 200 individuals of European ancestry using msprime (Baumdicker et al., 2022). The IBD segments were computed using the tskit (Kelleher et al., 2018). The tskit package extracts IBD segments between any two individuals and genomic loci where the alleles have been inherited along the same genealogical path. The variant sites with a minor allele frequency of less than 1% were filtered out. We then created an artificial variant call file where the number of sites corresponds to the number of sites in the original file. We iterated over each pairwise IBD segment and assigned the identical randomly selected alleles for all sites covering the IBD segment. Finally, we ran a modified version of cPBWT on the interim panel where only homozygous sites are included in each matching block.

We extracted the cluster from the ground truth with the maximum overlap for each reported cluster and computed the overlap ratio. The accuracy is then defined as the average of the overlap ratios. The computed accuracy would ensure that a reported percentage of each cluster belongs to the one exact cluster in the ground truth. The power is defined as the average cumulative overlap ratios between the reported clusters and clusters from the ground truth. Large clusters may be reported as two or several smaller clusters due to the strict cut-off values for L and W, and the power would determine what percentage of the clusters could be recovered based on the cut-off values. We computed accuracy and power for the reported ROH clusters using haplotypes with 0% and 0.1% genotyping error rates for different L and W cut-off values (Figure 1—figure supplement 1). The results show that our approach is robust against genotyping errors up to 0.1%. The detection power for W = 5 and L = 50 without any error was 79.6% whereas the power for the data with the genotyping error was 79.1%. The accuracy increases with increasing the target lengths and widths. For example for W = 5 and L = 50, the accuracy was 55% whereas the accuracy for W = 20 and L = 100 was 63%. Figure 1—figure supplement 2 shows the detection power for clusters with L = 100 and W = 20, where different cut-off values were used. The figure shows that the detection power increases with smaller cut-off values. The power for the target values W = 20 and L = 100 was 34% if the cut-offs were set the same, however, the power increased to 84% with smaller cut-offs (W = 5 and L = 50). To estimate the power and accuracy for W = 100 and L = 100, we simulated another dataset containing 1000 individuals of European ancestry with a genotyping error rate of 0.1%. The simulation parameters for this dataset were the same as for the 200 samples (except the number of samples) and ground truth clusters were extracted similarly. The power was 55.96% and accuracy 52.84%, while 58.97% of the reported clusters overlap 50% or more with a ground truth cluster.

Power of ROH-DICE in association studies

To evaluate the effectiveness of ROH-DICE in association studies, we used the ROH clusters obtained from a sample of 200 genomes of 10 Mbp. We set the minimum length of variant sites to 100 and the minimum number of samples to 5 (L = 100 and W = 5). We generated 100 phenotypes associated with an ROH cluster for each effect size ranging from 0 to 0.3, using the formula Yi=Xiβ+N(0,σ2) with σ2 = 0.1. We choose large effect sizes so that the power can be evaluated even with small sample sizes. Here, Xi equals 1 if the sample belongs to the ROH cluster and 0 otherwise.

The total number of variant sites was 23,566, and we extracted 1263 ROH clusters. We calculated the p-values for both ROH clusters and all variant sites. We used a p-value cut-off of 0.05 divided by the number of tests for each phenotype to determine whether the calculated p-value was smaller than the threshold, indicating an association. For genome-wide association studies (GWAS), only one variant site within the ROH cluster, contributing to the phenotype, was required. We tested for all additive, dominant, and recessive effects (Figure 1—figure supplement 3). The figure demonstrates that ROH-DICE outperforms GWAS when a phenotype is associated with a set of consecutive homozygous sites. The maximum effect size of 0.3 resulted in ROH clusters achieving a power of 100%, whereas the additive model only achieved 11%, and the dominant and recessive models achieved 52% and 70%, respectively. The GWAS with recessive effect yields the best results among other GWAS tests, however, its power is still lower than using ROH clusters.

ROH diplotypes in UK Biobank

Here, we searched for the clusters of ROH regions in the UK Biobank data (Bycroft et al., 2018). All autosomal chromosomes of all UK participants (487,409) were searched for ROH regions that are shared among at least 100 individuals comprising at least 100 consecutive sites. 56,972 people with self-reported non-British ethnicity in UK Biobank were filtered out. We chose a minimal number of markers that is large enough to avoid an extensive number of clusters. Moreover, the longer the ROH segment, the more likely it is due to shared ancestry rather than statistical noise. Our objective is also to select clusters with a sufficiently large number of individuals to correlate them with phenotypes. It is worth noting that in previous studies, a minimum cut-off of 100 individuals was commonly used (Lencz et al., 2007; Christofidou et al., 2015; Moreno-Grau et al., 2021). On average ~18% of sites are heterozygous, and thus for a pair of 100 sites genotype sequences, there is a very small probability that they will be mapped to the same compressed haplotype. Thus, the rate of false positives should be low. To increase statistical power for downstream association tasks, the width-maximal blocks were reported. This was achieved by running the ROH-DICE program, with a wall clock time of 18 hr and 54 min where the program was executed for all chromosomes in parallel (total CPU hours of ~242.5 hr). The maximum residence size for each chromosome was approximately 180 MB. After running the ROH-DICE program, further post-processing steps were conducted. Each individual with more than 1% heterozygous sites within the block was removed from the cluster. Any two clusters with the same consensus and the exact starting and ending positions were merged.

A total of 1,880,826 ROH clusters (shared among at least 100 individuals and extending at least 100 consecutive sites) were identified in all 22 autosomal chromosomes (Supplementary file 1). The average length of these ROHs is 553,095 bp (~0.55 cM). The distribution of ROH clusters is very uneven (Figure 3a). Interestingly, the number of ROH clusters in chromosome 6 is the highest. This is mainly due to the excessive number of ROH clusters in the MHC region (65,458). Figure 4 illustrates the genome-wide coverage of the ROH clusters, with visible peaks at chromosomes 2, 6, and 8. A peak region in chromosome 2 (chr2:135755899–136827560) has been reported to harbor a high selection signal (Browning and Browning, 2020). This region contains the lactase gene (LCT) gene which includes a variant selected for lactose tolerance in the European population (Itan et al., 2009), though the current understanding of the selection pressure is more nuanced (Mathieson and Terhorst, 2022; Evershed et al., 2022; Le et al., 2022). The most prominent peak in chromosome 6 is located in the MHC region (chr6:28477797–33448354), whose details are shown in Figure 3b. The peak in chromosome 8 (chr8:42531565–42629520) contains two known genes, CHRNB3 and CHRNA6. Previous studies have demonstrated the significant role of the CHRNB3–CHRNA6 gene cluster on chromosome 8 in nicotine dependence (Wen et al., 2016). Additionally, an earlier study has identified strong evidence for selection in the CHRNB3–CHRNA6 region (Sadler et al., 2015). Surprisingly, some clusters comprise more than a hundred thousand individuals sharing the same ROH consensus. The high rate of ROH clusters in the MHC region may be attributed to the high density of markers and low recombination rates (Traherne, 2008; Lam et al., 2013). We also filtered out all ROH clusters shorter than 0.1 cM (Figure 3—figure supplement 1). There is no excessive number of ROH clusters in chromosome 6, as identified by a minimum number of variant sites. The number of samples in ROH clusters within the MHC regions reduces significantly. Although there is still a peak, it is comparable to other chromosomes such as chromosome 10 or 12 (Figure 4—figure supplement 1). In all subsequent results, we have included clusters with more than 100 sites. However, all the corresponding tables contain the genetic length of the clusters. Low recombination rate regions may contain excessive ROH clusters that we prefer not to discard since it will artificially ignore some ROH clusters driven by selection. The ROH clusters are abundant in regions with low recombination rates and also their distribution is expected to be population specific. Moreover, the ‘hotspots’ and ‘coldspots’ may vary in different populations (Pemberton et al., 2012). ‘ROH hotspots’ in study (Pemberton et al., 2012) refer to locations where the single nucleotide polymorphism (SNP)-wise ROH frequency is the 99.5th percentile among all frequencies, where a frequency was defined for each variant site. To enable a comparison with the ROH frequencies from the Pemberton et al., 2012 study, we also calculated a score for the variant sites by using the intersecting ROH clusters with the sites. We extracted ROH clusters with more individuals than the 99.5th percentile and lower than the 0.5th percentile (see Methods section). The top-ranked ROH ‘coldspot’ in the European population is located in chromosome 18 (Pemberton et al., 2012) and is also identified as below the 0.5th percentile using our method. The top-ranked ‘hotspot’ was reported in chromosome 15 for Europeans (Pemberton et al., 2012) which also overlaps with a peak for British people in chromosome 15 (72100881–72681976) in our study where the number of samples in detected ROH cluster exceeds the 99.5th percentile. The common hotspots and coldspots are listed in Supplementary files 2 and 3, respectively. However, further investigation may be required to confirm ‘hotspots’ as other factors such as marker density may contribute to excessive clusters in certain regions. We also calculated Spearman’s rank correlation coefficient (ρ) between the two datasets. The correlation coefficient between combined ROH classes in the European population (Pemberton et al., 2012) and the ROH clusters in UKBB was 0.54. Of note, Pemberton et al., 2012 defined three types of ROH clusters (short or class A, intermediate or class B, and long or class C). Our reported ROH regions are based on shared diplotypes with at least 100 SNPs. These regions may not necessarily align with all ROH classes, as variations in length and consensus may lead to differences in ROH regions.

Figure 3 with 1 supplement see all
Total number of detected runs-of-homozygosity (ROH) diplotype clusters in each autosomal chromosome (a) and the detected ROH clusters in the major histocompatibility complex (MHC) region (chr6:28477797–33448354) (b) in hg19.

Some regions may contain multiple overlapping clusters comprising different sets of individuals. The minimum length of the ROH regions was set to 100 sites and the minimum number of individuals to 100.

Figure 4 with 1 supplement see all
Detected runs-of-homozygosity (ROH) diplotype clusters with at least 100 individuals sharing the same consensus with a minimum number of 100 SNPs.

Chromosome 18 has the lowest peak for individuals sharing an ROH diplotype. Chromosomes 2, 6, and 8 contain diplotypes shared with more than 100,000 individuals.

ROH clusters and disease association

We conducted a phenotypic association analysis of the found ROH diplotype clusters with 445 self-reported non-cancerous diseases, as they are conveniently available in the UK Biobank. We first conducted a quick chi-squared test associating each of the 1,880,826 ROH diplotype cluster membership against each of the 445 phenotypes (see Methods section). The p-values for the 100 regions with the lowest p-values were re-computed using age, sex, genetics principal components, and genotype measurement batch fields by PHESANT (Millard et al., 2018) (details see Methods). This identified 61 associations passing the Bonferroni-corrected p-value threshold of 10−12. Table 1 shows the p-values for disease associated with the HLA region (chr6) computed by PHESANT. p-values for some diseases are very low in both the chi-squared test and regression analysis using PHESANT. It also includes the SNP with the lowest p-value in each cluster that is associated with the corresponding disease. The SNP with the lowest p-value in each cluster was extracted from Neale’s lab results [http://www.nealelab.is/blog/2017/9/15/heritability-of-2000-traits-and-disorders-in-the-uk-biobank]. Most of the clusters with low p-values contain at least one SNP with a very low p-value that is associated with the corresponding disease. The top 100 diplotypes with the lowest p-values using chi-squared tests and PHESANT are included in Supplementary file 4.

Table 1
Clusters of the runs-of-homozygosity (ROH) diplotypes with the lowest p-values in the HLA region for self-reported diseases using the British population in UK Biobank.

Detailed diplotype consensus sequences are available in Supplementary file 5. The p-values were calculated using PHESANT. Only the region with the lowest p-value has been included for each disease. Beta represents the effect size reported by PHESANT and D′ describes the non-random association of an ROH cluster and the overlapping SNP.

Disease (binary trait)Diplotype IDPosition (on chr6)p-valueBetaCarrier frequency (%)Odds ratioGenetic length (cM)GWAS p-value*GWAS beta*GWAS lead SNP*D
Ankylosing spondylitis131431031–314640504.62 × 10−340.1210.298.660.07119801.45 × 10−2rs1133404600.61
Hemochromatosis225969631–261081688.02 × 10−1200.4170.0924.510.011597----
Malabsorption/coeliac disease332564985–326297553.41 × 10−2590.3154.121.640.00540807.74 × 10−3rs92713521
Multiple sclerosis432410215–325541294.36 × 10−450.1920.373.790.0127361.05 × 10−1074.58 × 10−3rs92689250.99
Polymyalgia rheumatica531710968–317945927.31 × 10−090.0800.235.900.0068081.59 × 10−086.80 × 10−3rs11507481
Prostate problem (not cancer)634607958–351639742.84 × 10−080.0820.186.940.0348899.81 × 10−049.81 × 10−4rs761178340.03
Psoriasis731254263–312632161.20 × 10−1220.2141.212.733.07×10–0501.93 × 10−2rs132148721
Psoriatic arthropathy833072522–331157628.54 × 10−120.1220.203.970.0087084.76 × 10−101.01 × 10−3rs172214011
Rheumatoid arthritis932412539–325737608.15 × 10−1220.2080.232.340.012936.96 × 10−1248.24 × 10−3rs1885751170.98
  1. *

Not surprisingly, the most prominent associations we found are ROH diplotypes in the HLA region with autoimmune diseases. We found that malabsorption/coeliac disease, psoriasis, rheumatoid arthritis, and multiple sclerosis have the strongest association with loci in the HLA region. These results are largely consistent with known literature (Dieli-Crimi et al., 2015; Gutierrez-Achury et al., 2015; Kurkó et al., 2013; Bhalerao and Bowcock, 1998; Baranzini and Oksenberg, 2017; Canela-Xandri et al., 2018). One of the most significant associations we identified is the association between the ROH diplotype at chr6:25988167–26122453 and hemochromatosis (p-value = 9.16 × 10−120). The frequency of the ROH diplotype is only 0.02% and the odds ratio of having the disease for the carrier is 102.21. Interestingly, several other ROH diplotypes at this locus also have a strong association with hemochromatosis (Table 1). This locus is in the extended HLA region and has a low recombination rate. Hemochromatosis is an inherited disorder in which iron levels in the body slowly build up over several years. The gene HFE (chr6:26087509–26095469) is a well-known recessive locus for this disease (Pietrangelo, 2010). The C282Y polymorphism (rs1800562, chr6:26092913) in HFE is the most penetrant but other polymorphisms with lesser penetrance are also known. Interestingly, the minor allele frequency of the SNP rs1800562 is 6% in the European population but it is not genotyped (and is also not available in the imputed panel) in the UK Biobank data. As a result, this association signal has been completely missing in the Neale Lab results. In another study, the SNP has been imputed and a specific association study for the recessive effect between the homozygous alleles of rs1800562 and hemochromatosis has been reported (Tamosauskaite et al., 2019). Our approach found this recessive association signal without direct genotyping of any SNP with high linkage disequilibrium (LD) to the causal SNP, demonstrating the power of our approach beyond regular additive effect GWAS. However, we did not verify that this SNP is indeed part of the ROH diplotype as we do not have access to the WGS data.

We also found some loci outside of the HLA region that are presumably associated with non-cancerous diseases (Table 2). The most prominent one is an ROH diplotype at chr1:151515188–151902494 with eczema/dermatitis. This signal overlaps with the GWAS finding of rs4845604 at chr1:151829204 (Johansson et al., 2019).

Table 2
Clusters of the runs-of-homozygosity (ROH) diplotypes with the lowest p-values outside of the HLA region for self-reported diseases using the British population in UK Biobank.

The p-values were calculated using PHESANT.

Disease (binary trait)Diplotype IDPositionp-valueBetaCarrier frequency (%)Odds ratioGenetic length (cM)GWAS p-value*GWAS beta*GWAS lead SNP*D
Deep venous thrombosis (dvt)10chr1:169075589–1695288303.10 × 10−210.0392.0810.490.567.41 × 10−166−3.13 × 10−2rs60251
Eczema/dermatitis11chr1:151515188–1519024941.52 × 10−270.0442.857.310.363.45 × 10−361.43 × 10−2rs558752221
12chr1:151940401–1522800329.46 × 10−240.05311.762.070.121.35 × 10−641.84 × 10−2rs618155591
13chr1:152493154–1529644791.53 × 10−210.0392.857.350.361.01 × 10−421.62 × 10−2rs618138751
Hypothyroidism/myxoedema14chr12:111910219–1128741794.51 × 10−210.0625.061.250.041.88 × 10−809.87 × 10−3rs71378280.99
  1. *

The beta values for effect size were included in all reported tables. These beta values for ROH-DICE are positive, indicating that carriers of these ROH diplotypes may have an increased risk of certain non-cancerous diseases. We also used D′ as a measure of linkage between the reported GWAS results and ROH clusters (see Methods section). We found that most of the GWAS results and ROH clusters are strongly correlated. However, in a few cases, D′ is small or close to zero. In such cases, the reported p-value from GWAS was also insignificant, while the ROH cluster indicated a significant association (See Table 1 and Supplementary file 4). The SNP IDs and consensus alleles for all ROH clusters in Tables 1 and 2 are reported in Supplementary file 5.

ROH clusters and COVID-19 association

We computed the p-value using the chi-square test for the association between mortality of COVID-19 and the detected ROH regions. We considered only the clusters that had at least 10 cases (tested positive and passed away in 2020). Figure 5 shows the Manhattan plot for ROH regions and mortality of COVID-19. The most significant ROH region is located in chr4:106318456–106483898 (0.114 cM) with the p-value 1.63 × 10−10. 4389 individuals share the diplotypes and 76 of them have tested positive for COVID-19. Eleven persons who carried the same ROH consensus and had tested positive, died in 2020. In other words, carriers of this diplotype have a fivefold mortality compared to non-carriers among COVID-19 patients. We used the GMMAT (Hoare, 1961) mixed model regression to validate the association of this diplotype while adjusting for age, gender, and genetic similarity (see Methods section). The reported p-value was 1.82 × 10−11 which is even smaller than the p-value from the chi-square test. The region includes the PPA2 gene. The gene product is an inorganic pyrophosphatase located in the mitochondrion (Curbo et al., 2006). Missense mutations in this gene are reported to cause sudden unexpected cardiac arrest in infancy (Guimier et al., 2016). The PPA2 gene has also been recently implicated in COVID-19 through an integrated analysis of GWAS of European patients and lung expressed quantitative trait loci (eQTL) data by the summary-data-based (SMR) method (Zong and Li, 2021). The identified region linked to COVID-19 mortality overlaps also with the ARHGEF38 gene. A genetic variant within the gene (rs72670002) has been reported to be significantly associated with severe illness from COVID-19 in a recent study that used 24,202 cases of critical COVID-19 (Pairo-Castineira et al., 2023). Other nearby genes within a 200-kb range include TET2, INTS12, and GSTCD.

Runs-of-homozygosity (ROH) associations between ROH diplotypes and mortality of COVID-19.

(a) Manhattan plot of ROH diplotypes across all chromosomes and mortality of COVID-19. Diplotypes with less than 10 cases were discarded. (b) UCSC genome browser (https://genome.ucsc.edu) view of the region containing the diplotype with a significant p-value in chromosome 4.

Discussion

In this work, we introduced an efficient algorithm, ROH-DICE, for finding clusters of ROH regions in very large cohorts. The algorithm can find all clusters of ROH regions based on the given parameters: the minimum number of individuals, the minimum length of the ROH regions, and the objective function. The running time of the algorithm is linear to the size of the genotype panel which enables fast processing of millions of individuals without requiring extravagant resources.

Using ROH-DICE, we conducted a systematic investigation of ROH diplotype clusters in a large population cohort, the UK Biobank. To the best of our knowledge, there has been no such investigation of the genomic distribution of ROH diplotypes conducted previously. We found over 1.8 million ROH diplotype clusters spanning over 100 SNPs and shared by over 100 individuals. While we reported this single data point, the interpretation of the genome-wide ROH diplotype distribution is difficult. First, the expected distribution of ROH diplotype clusters is not known. For populations with an idealized infinite population size, ROH diplotype distribution can be estimated from the allele frequency spectrum. However, for any finite population, when we are looking at haplotypes spanning 100 sites, only a small fraction of possible allele combinations is observed and the distribution will be heavily dependent on the population history. Large ROH clusters can be used to identify signatures of selection in humans or other species. Positive selection reduces haplotype diversity, increasing homozygosity around the target locus, resulting in higher frequencies of ROH in regions containing selection loci (Pemberton et al., 2012; Sabeti et al., 2002). Therefore, excessive ROH regions can be linked to selective sweeps and have been found to coincide with positive selection in humans (Pemberton et al., 2012), and other species (Hewett et al., 2023). Three large ROH clusters in chromosomes 2, 6, and 8 of UKBB overlap with known hotspots for selection signals. It should be noted that although the selection of 100 individuals and 100 sites has been used in other studies, it is somewhat arbitrary. While we believe that small variations in the values would not affect the results, using different values such as 200 or 1000 may lead to different ROH clusters. Our preliminary analysis indicates that increasing the length and width of the clusters improves accuracy but reduces power. Future works may investigate the effect of different parameters on the distribution of ROH clusters and downstream analysis.

We found a strong association between non-cancerous diseases and some ROH diplotypes. The majority of ROH regions harboring strong associations with non-cancerous diseases were located in the extended HLA region in chromosome 6. As expected, most of the related diseases were also autoimmune system disorders. While the association signals we found mostly overlap with existing GWAS hits, we are testing different genetic effects. The existing GWAS are mainly testing the additive effects of single SNPs, while we are testing the recessive effects of relatively long haplotypes. In a sense, our analysis is similar to traditional family-based homozygosity mapping (Lander and Botstein, 1987), but at a population scale. Future works are warranted to fully develop this potential new gene mapping approach. We want to clarify that we are not claiming ROH-DICE to be superior to regular GWAS in all scenarios. Our simulation only demonstrates that ROH-DICE performs better under certain conditions. Specifically, when the causal variant is located in a long ROH diplotype shared by many individuals (ROH diplotype clusters), ROH-DICE outperforms regular GWAS. It is important to note that ROH-DICE is not meant to replace regular GWAS, but to complement it.

The disease associations presented in this work largely do not represent novel discoveries. The significant associations can be identified in the first place if a recessive mode of inheritance is assumed or a more powerful imputation panel is implemented. However, there is no guarantee that the sites are well imputed if the LD between the genotyped sites is low. We also showed in our simulation that the ROH clusters would outperform GWAS with an additive or even recessive model in terms of power if a phenotype is associated with a set of consecutive homozygous sites.

We used age, gender, and genetic principal components as confounding variables in the association analysis. Genetic principal components can reduce the confounding effect brought on by population structure but it may be insufficient to completely eliminate the effects of recent demographic structure and the local environment (Zaidi and Mathieson, 2020). For example, individuals sharing excessive ROH diplotypes may share similar environments since they are closely related and reside close to one another. Since we did not rule out related individuals, some of the reported GWAS signals may not be attributable to ROH.

Our association analysis is a proof of concept and opens up many future opportunities. With our methods, it is possible to extend this analysis to non-disease complex traits. For example, one can investigate whether individuals who share more ROH diplotype clusters have similar phenotypes. Such an analysis may reveal the contribution of dominance variance to the heritability of traits of interest. It will also be interesting to compare the findings with previous research based on genome-wide aggregate ROH content.

Methods

Identification of haplotype clusters in PBWT

The PBWT proposed by Durbin, 2014 facilitates an efficient approach to search for all pairs of long matches in haplotype or genotype panels. The basic idea behind the PBWT search is to sort the panels at each site by their reversed prefix order. As a result, the matches in the panel will be placed adjacent to each other. However, at the time we started this project, all existing PBWT algorithms (Durbin, 2014; Naseri et al., 2019c; Naseri et al., 2019a; Sanaullah et al., 2020) were aimed at identifying pairwise matches. In this work, we propose to employ the PBWT data structures to search for clusters of multi-way matches instead of individual pairs of matches. Independent of our work, a couple of algorithms have been proposed to find haplotype blocks in a PBWT panel (Cunha et al., 2018; Alanko et al., 2020). The algorithm by Cunha et al., 2018; Alanko et al., 2020 may not be feasible to handle biobank scale data. The recently proposed algorithms by Cunha et al., 2018; Alanko et al., 2020, however, will scale well for large-scale data, but they aim to enumerate all maximal haplotype blocks. For datasets comprising hundreds of thousands or millions of individuals, the number of reported clusters of any length may be excessive. Moreover, a minimum length threshold in terms of both sites and number of individuals would be more meaningful for downstream analysis especially association analysis, for example where a minimum number of cases are required. Hence, after detecting all possible clusters, filtering has to be applied to remove spurious clusters. Here, we formulate the haplotype blocks problem with two distinct objective functions which will reduce the complexity of filtering the detected clusters afterward.

Block maximal match problems

Based on the different formulations of the problem, we may have different objective functions: the first problem is to find all clusters with at least L sites that are shared among at least t sequences while maximizing the number of sequences for each cluster. Using proper data structures, we can keep track of the starting position of the matches and report them efficiently. The second problem is to find clusters with at least L sites among at least t sequences while maximizing the number of sites for each cluster. Again, the sequences that share a consensus are put in the same block.

PBWT sorting at the site k places sequences with identical reverse prefixes into clusters of matches that are adjacent to each other. We refer to these clusters as blocks, where the number of sequences W is the width of the block in terms of the number of haplotypes, and the length of matches L is the length of the block in terms of the number of sites. Recall the concept of the set maximal match of Durbin, 2014 as the pairwise haplotype match that cannot be extended at either end. We extend the concept of set maximal match to block maximal match, that is, the haplotype match block that cannot be extended. As the block is a 2D object, the extension can be defined either lengthwise or widthwise. Therefore, we can define the lengthwise block maximal match as the matching block that cannot be extended lengthwise. Similarly, the widthwise block maximal match is that which cannot be extended widthwise.

For the PBWT block match problem, the goal is to identify all block maximal matches that have a minimal sequence length L and a minimal width W. Note that for an identified PBWT block, the boundary of the block may not be exactly defined (see Figure 6 for an example). We can either report the block boundary that maximizes the length – length-maximal PBWT block, or the block boundary that maximizes the width – width-maximal PBWT block. We developed exact algorithms for identifying and reporting block maximal matches. This is achieved by using proper data structures tracking the starting position of the matches and the upper and lower boundaries of each matching block. A detailed description of the algorithms is provided in the cPBWT algorithms subsection.

Consensuses of haplotype matches with a minimum length (L) of 3 and a minimum width (W) of 3.

(a) Clusters of haplotypes with two different objectives: maximizing the number of sites and maximizing the number of indivdiuals. The green rectangle ending at site 4 highlights a cluster that meets the requirement of W ≥ 3 and L ≥ 3 while maximizing the number of individuals (width-maximal). The blue rectangles ending at 4 maximize the number of sites (length-maximal). The blue rectangles ending at site 8 show a cluster with W ≥ 3 and L ≥ 3 maximizing the number of sites and number of individuals. This cluster is length-maximal because adding either column 5 or 9 will introduce a mismatch; It is also width-maximal because adding the third haplotype will introduce a mismatch. (b) Two clusters (clusters A and B) with the same starting and ending positions but different consensuses. Therefore, these two clusters are not merged and considered as separate clusters. Each line represents one individual and 0/0 alleles are highlighted in gray, and 1/1 alleles in black.

cPBWT algorithms

Maximizing the number of haplotypes

Given a haplotype or genotype panel, the objective is to find all matches greater than a given length L that are shared among at least c haplotypes (or individuals). By sorting the panel at each site the matches are placed in the same block. The divergence value for each sequence contains the starting position of the match to its preceding sequence in the reversed prefix order. The matches are separated by a sequence with a divergence value greater than k − L. To maximize the number of sequences, the maximum value of the divergence values in each block is considered. The size of the block should also be greater than c to be reported. Algorithm 1 (Supplementary file 6) illustrates the procedure for finding long matches while maximizing the number of haplotypes or sequences in detail. Algorithm 2 (Supplementary file 6) illustrates the procedure for updating the intermediate variables V and Q to compute dk+1 and ak+1 based on the dk and ak. The time complexity of this algorithm is O(NM), where N denotes the number of variant sites and M denotes the number of individuals. Divergence values and prefix arrays are computed in linear time for each variant site and the maximal number of matching blocks at each site is bound by O(M).

Maximizing the length of the match

The objective is to find the longest matches greater than a given length L shared among at least c sequences. The match will not be reported if the block of matches can be extended while at least c sequences are not terminating. To do this, two conditions should be held: First, at least c sequence for one allele should be present in the block, and second, the cth lowest divergence value in the block should be greater or equal to the cth lowest divergence of the matches ending with the allele with at least c occurrences. To find the cth lowest divergence value, the Quickselect algorithm, a modified one-sided version of Quicksort (Hoare, 1961), is used. Quickselect has the average time complexity of O(N), where N denotes the size of the given list. Algorithm 3 (Supplementary file 6) illustrates the procedure of finding long matches while maximizing the length of the match in detail.

ROH-DICE algorithm

Any of the two cPBWT algorithms can be applied to search for ROH–diplotype clusters from genotype data. Maximizing the number of haplotypes would guarantee the inclusion of all samples that may share specific ROH diplotypes. Hence, for association analysis between ROH diplotype and phenotypes, this optimization would be preferred. On the other hand, maximizing the number of sites would ensure the inclusion of all variant sites between the individuals contributing to the matches which may be more appropriate for other applications such as studying population structures or imputation.

ROH-DICE maps the genotype sequence x, defined over the alphabet of {0,1,2}, into a compressed haplotype sequence y, defined over the alphabet of {0,1}. For homozygous sites, the mapping is straightforward: for xi=0, yi=0; for xi=2, yi=1. For heterozygous sites xi=1, a random value from 0 and 1 was assigned with a probability of ½ for 0 and ½ for 1. The identified maximal matching blocks in the PBWT panel comprising all compressed haplotype sequences {yi}, correspond to the approximate ROH clusters in the original genotype sequences {xi}. After finding all ROH clusters for a given cut-off, the clusters with the identical start and end positions, and consensus (determined by majority alleles) was merged.

Identification of ROH hotspots and coldspots

The frequency of ROH calculated over all three size classes at each SNP in the combined data set from the Pemberton study was downloaded (Pemberton et al., 2012). The genomic locations were lifted over to hg19 using the liftOver tool (Hinrichs et al., 2006). The overlapping ROH cluster from ROH-DICE results with the maximum number of individuals (samples) was assigned for each SNP. ROH hotspots were considered locations where the number of samples in ROH clusters exceeded the 99.5th percentile. ROH coldspots were considered locations where the number of samples in ROH clusters was lower than the 0.5th percentile (equal to 0).

UK Biobank dataset

The phased haplotype data of the UKBB data (version 2) comprising 658,720 sites were extracted. The Data-Field 20002 contains self-reported non-cancer illnesses comprising 445 categories (diseases). For the association analysis, 430,437 individuals of British ethnicity were selected. The ethnic backgrounds were extracted using the Data-Field 21000.

Genetic association analysis

We computed the p-values for each disease in all detected ROH clusters that were present in at least 10 individuals. p-values were computed using chi-squared test considering the following numbers: D1: Number of individuals sharing a disease within the detected consensus of ROH. N1: Number of individuals in the detected ROH not sharing the disease. D2: Total number of individuals sharing the disease subtracting D1. N2: MN1 − N2 − D2, where M denotes the total number of individuals. 100 regions with the lowest p-values (for any disease) were selected and further investigated using PHESANT (downloaded on August 22, 2018).

For PHESANT analysis, age was calculated manually using the date of attending the assessment center (53), year of birth (34), and month of birth (52). Sex (31), genetic principal components (22009), number of self-reported non-cancer illnesses (135), genotype measurement batch (22000), and non-cancer illness (20002) fields were also maintained. PHESANT tests the associations of a trait of interest with a set of other phenotypes, and we considered all diplotypes in the 100 regions as traits of interest. Most of the regions include multiple clusters with the same starting and ending positions but different consensus. We considered all of the clusters in the same region as traits of interest (660 traits of interest in total). Regressions were performed on each diplotype cluster separately, so more than one cluster may have been tested in the same region.

Retrieval and annotation using the genetic association result from Neale Lab

Each of the associations (computed by PHESANT) was validated against the GWAS results published by Neale’s lab [http://www.nealelab.is/blog/2017/9/15/heritability-of-2000-traits-and-disorders-in-the-uk-biobank, accessed July 27, 2018]. For each disease in each cluster (according to PHESANT), all reported SNPs within the genomic region of the cluster that were reported to be associated with the disease (according to Neale’s lab results) were searched and the SNP with the lowest p-value was reported.

Linkage pattern analysis between GWAS and ROH-DICE results

In linkage disequilibrium analysis, D and D′ are commonly used measures to quantify the degree of non-random association between alleles at different loci. D measures the difference between the observed frequency of a haplotype and the frequency expected under random mating, while D′ is a normalized measure of D that considers the allele frequencies at each locus. In this study, we have adapted these measures between two loci into a location and an ROH cluster.

D′ between an ROH cluster and an SNP overlapping the cluster was calculated by normalizing the D between the ROH cluster membership and alternate allele of the SNP similar to linkage analysis between variant sites. Assume pR is the frequency of samples that belong to the cluster, pS is the frequency of alternate allele, and pRS is the frequency of samples belonging to the cluster and having the minor allele. We calculate pr as 1 − pR and ps as 1 − pS. Finally, the D′ can be calculated by using the following formula:

if (D<0):D=max(pRpS,prps)

else: D=max(pRpS,prps)

where D=pRS-pRpS.

COVID-19 mortality and ROH diplotypes

Two tables ‘covid19_result.txt’ and ‘death.txt’ provided by the UK Biobank were downloaded on July 24, 2020. The table ‘covid19_result.txt’ contains the test results whether the sample was reported as positive or negative for COVID-19. The table ‘death.txt’ includes the date of death for samples. In the July 24, 2020 release of the table in UK Biobank, 201 British individuals have been reported COVID-19 positive and died in 2020. Those individuals were considered as cases for mortality analysis. A total of 8120 British individuals have been tested for COVID-19. The controls contained the individuals who had been tested but no death information was provided for them. We tested all detected ROH diplotypes for COVID-19 mortality association (with at least 10 cases) using the chi-square test. For the chi-square test, the total number of individuals M corresponds to the number of tested individuals for COVID-19 (8120). GMMAT (Hoare, 1961) was used to recalculate the p-value for the diplotype with the lowest p-value from the chi-square test (chr4:106318456–106483898) while adjusting for age, gender, and genomic relationship matrix (GRM). The GRM was computed using the kinship coefficients calculated from KING (Hoare, 1961).

Data availability

This research has been conducted using the UK Biobank Resource (Bycroft et al., 2018) under Application Number 24247. The source code is available at https://github.com/ZhiGroup/ROH-DICE (copy archived at Naseri, 2024).

References

    1. Clark DW
    2. Okada Y
    3. Moore KHS
    4. Mason D
    5. Pirastu N
    6. Gandin I
    7. Mattsson H
    8. Barnes CLK
    9. Lin K
    10. Zhao JH
    11. Deelen P
    12. Rohde R
    13. Schurmann C
    14. Guo X
    15. Giulianini F
    16. Zhang W
    17. Medina-Gomez C
    18. Karlsson R
    19. Bao Y
    20. Bartz TM
    21. Baumbach C
    22. Biino G
    23. Bixley MJ
    24. Brumat M
    25. Chai JF
    26. Corre T
    27. Cousminer DL
    28. Dekker AM
    29. Eccles DA
    30. van Eijk KR
    31. Fuchsberger C
    32. Gao H
    33. Germain M
    34. Gordon SD
    35. de Haan HG
    36. Harris SE
    37. Hofer E
    38. Huerta-Chagoya A
    39. Igartua C
    40. Jansen IE
    41. Jia Y
    42. Kacprowski T
    43. Karlsson T
    44. Kleber ME
    45. Li SA
    46. Li-Gao R
    47. Mahajan A
    48. Matsuda K
    49. Meidtner K
    50. Meng W
    51. Montasser ME
    52. van der Most PJ
    53. Munz M
    54. Nutile T
    55. Palviainen T
    56. Prasad G
    57. Prasad RB
    58. Priyanka TDS
    59. Rizzi F
    60. Salvi E
    61. Sapkota BR
    62. Shriner D
    63. Skotte L
    64. Smart MC
    65. Smith AV
    66. van der Spek A
    67. Spracklen CN
    68. Strawbridge RJ
    69. Tajuddin SM
    70. Trompet S
    71. Turman C
    72. Verweij N
    73. Viberti C
    74. Wang L
    75. Warren HR
    76. Wootton RE
    77. Yanek LR
    78. Yao J
    79. Yousri NA
    80. Zhao W
    81. Adeyemo AA
    82. Afaq S
    83. Aguilar-Salinas CA
    84. Akiyama M
    85. Albert ML
    86. Allison MA
    87. Alver M
    88. Aung T
    89. Azizi F
    90. Bentley AR
    91. Boeing H
    92. Boerwinkle E
    93. Borja JB
    94. de Borst GJ
    95. Bottinger EP
    96. Broer L
    97. Campbell H
    98. Chanock S
    99. Chee ML
    100. Chen G
    101. Chen YDI
    102. Chen Z
    103. Chiu YF
    104. Cocca M
    105. Collins FS
    106. Concas MP
    107. Corley J
    108. Cugliari G
    109. van Dam RM
    110. Damulina A
    111. Daneshpour MS
    112. Day FR
    113. Delgado GE
    114. Dhana K
    115. Doney ASF
    116. Dörr M
    117. Doumatey AP
    118. Dzimiri N
    119. Ebenesersdóttir SS
    120. Elliott J
    121. Elliott P
    122. Ewert R
    123. Felix JF
    124. Fischer K
    125. Freedman BI
    126. Girotto G
    127. Goel A
    128. Gögele M
    129. Goodarzi MO
    130. Graff M
    131. Granot-Hershkovitz E
    132. Grodstein F
    133. Guarrera S
    134. Gudbjartsson DF
    135. Guity K
    136. Gunnarsson B
    137. Guo Y
    138. Hagenaars SP
    139. Haiman CA
    140. Halevy A
    141. Harris TB
    142. Hedayati M
    143. van Heel DA
    144. Hirata M
    145. Höfer I
    146. Hsiung CA
    147. Huang J
    148. Hung YJ
    149. Ikram MA
    150. Jagadeesan A
    151. Jousilahti P
    152. Kamatani Y
    153. Kanai M
    154. Kerrison ND
    155. Kessler T
    156. Khaw KT
    157. Khor CC
    158. de Kleijn DPV
    159. Koh WP
    160. Kolcic I
    161. Kraft P
    162. Krämer BK
    163. Kutalik Z
    164. Kuusisto J
    165. Langenberg C
    166. Launer LJ
    167. Lawlor DA
    168. Lee IT
    169. Lee WJ
    170. Lerch MM
    171. Li L
    172. Liu J
    173. Loh M
    174. London SJ
    175. Loomis S
    176. Lu Y
    177. Luan J
    178. Mägi R
    179. Manichaikul AW
    180. Manunta P
    181. Másson G
    182. Matoba N
    183. Mei XW
    184. Meisinger C
    185. Meitinger T
    186. Mezzavilla M
    187. Milani L
    188. Millwood IY
    189. Momozawa Y
    190. Moore A
    191. Morange PE
    192. Moreno-Macías H
    193. Mori TA
    194. Morrison AC
    195. Muka T
    196. Murakami Y
    197. Murray AD
    198. de Mutsert R
    199. Mychaleckyj JC
    200. Nalls MA
    201. Nauck M
    202. Neville MJ
    203. Nolte IM
    204. Ong KK
    205. Orozco L
    206. Padmanabhan S
    207. Pálsson G
    208. Pankow JS
    209. Pattaro C
    210. Pattie A
    211. Polasek O
    212. Poulter N
    213. Pramstaller PP
    214. Quintana-Murci L
    215. Räikkönen K
    216. Ralhan S
    217. Rao DC
    218. van Rheenen W
    219. Rich SS
    220. Ridker PM
    221. Rietveld CA
    222. Robino A
    223. van Rooij FJA
    224. Ruggiero D
    225. Saba Y
    226. Sabanayagam C
    227. Sabater-Lleal M
    228. Sala CF
    229. Salomaa V
    230. Sandow K
    231. Schmidt H
    232. Scott LJ
    233. Scott WR
    234. Sedaghati-Khayat B
    235. Sennblad B
    236. van Setten J
    237. Sever PJ
    238. Sheu WHH
    239. Shi Y
    240. Shrestha S
    241. Shukla SR
    242. Sigurdsson JK
    243. Sikka TT
    244. Singh JR
    245. Smith BH
    246. Stančáková A
    247. Stanton A
    248. Starr JM
    249. Stefansdottir L
    250. Straker L
    251. Sulem P
    252. Sveinbjornsson G
    253. Swertz MA
    254. Taylor AM
    255. Taylor KD
    256. Terzikhan N
    257. Tham YC
    258. Thorleifsson G
    259. Thorsteinsdottir U
    260. Tillander A
    261. Tracy RP
    262. Tusié-Luna T
    263. Tzoulaki I
    264. Vaccargiu S
    265. Vangipurapu J
    266. Veldink JH
    267. Vitart V
    268. Völker U
    269. Vuoksimaa E
    270. Wakil SM
    271. Waldenberger M
    272. Wander GS
    273. Wang YX
    274. Wareham NJ
    275. Wild S
    276. Yajnik CS
    277. Yuan JM
    278. Zeng L
    279. Zhang L
    280. Zhou J
    281. Amin N
    282. Asselbergs FW
    283. Bakker SJL
    284. Becker DM
    285. Lehne B
    286. Bennett DA
    287. van den Berg LH
    288. Berndt SI
    289. Bharadwaj D
    290. Bielak LF
    291. Bochud M
    292. Boehnke M
    293. Bouchard C
    294. Bradfield JP
    295. Brody JA
    296. Campbell A
    297. Carmi S
    298. Caulfield MJ
    299. Cesarini D
    300. Chambers JC
    301. Chandak GR
    302. Cheng CY
    303. Ciullo M
    304. Cornelis M
    305. Cusi D
    306. Smith GD
    307. Deary IJ
    308. Dorajoo R
    309. van Duijn CM
    310. Ellinghaus D
    311. Erdmann J
    312. Eriksson JG
    313. Evangelou E
    314. Evans MK
    315. Faul JD
    316. Feenstra B
    317. Feitosa M
    318. Foisy S
    319. Franke A
    320. Friedlander Y
    321. Gasparini P
    322. Gieger C
    323. Gonzalez C
    324. Goyette P
    325. Grant SFA
    326. Griffiths LR
    327. Groop L
    328. Gudnason V
    329. Gyllensten U
    330. Hakonarson H
    331. Hamsten A
    332. van der Harst P
    333. Heng CK
    334. Hicks AA
    335. Hochner H
    336. Huikuri H
    337. Hunt SC
    338. Jaddoe VWV
    339. De Jager PL
    340. Johannesson M
    341. Johansson Å
    342. Jonas JB
    343. Jukema JW
    344. Junttila J
    345. Kaprio J
    346. Kardia SLR
    347. Karpe F
    348. Kumari M
    349. Laakso M
    350. van der Laan SW
    351. Lahti J
    352. Laudes M
    353. Lea RA
    354. Lieb W
    355. Lumley T
    356. Martin NG
    357. März W
    358. Matullo G
    359. McCarthy MI
    360. Medland SE
    361. Merriman TR
    362. Metspalu A
    363. Meyer BF
    364. Mohlke KL
    365. Montgomery GW
    366. Mook-Kanamori D
    367. Munroe PB
    368. North KE
    369. Nyholt DR
    370. O’connell JR
    371. Ober C
    372. Oldehinkel AJ
    373. Palmas W
    374. Palmer C
    375. Pasterkamp GG
    376. Patin E
    377. Pennell CE
    378. Perusse L
    379. Peyser PA
    380. Pirastu M
    381. Polderman TJC
    382. Porteous DJ
    383. Posthuma D
    384. Psaty BM
    385. Rioux JD
    386. Rivadeneira F
    387. Rotimi C
    388. Rotter JI
    389. Rudan I
    390. Den Ruijter HM
    391. Sanghera DK
    392. Sattar N
    393. Schmidt R
    394. Schulze MB
    395. Schunkert H
    396. Scott RA
    397. Shuldiner AR
    398. Sim X
    399. Small N
    400. Smith JA
    401. Sotoodehnia N
    402. Tai ES
    403. Teumer A
    404. Timpson NJ
    405. Toniolo D
    406. Tregouet DA
    407. Tuomi T
    408. Vollenweider P
    409. Wang CA
    410. Weir DR
    411. Whitfield JB
    412. Wijmenga C
    413. Wong TY
    414. Wright J
    415. Yang J
    416. Yu L
    417. Zemel BS
    418. Zonderman AB
    419. Perola M
    420. Magnusson PKE
    421. Uitterlinden AG
    422. Kooner JS
    423. Chasman DI
    424. Loos RJF
    425. Franceschini N
    426. Franke L
    427. Haley CS
    428. Hayward C
    429. Walters RG
    430. Perry JRB
    431. Esko T
    432. Helgason A
    433. Stefansson K
    434. Joshi PK
    435. Kubo M
    436. Wilson JF
    (2019) Associations of autozygosity with a broad range of human phenotypes
    Nature Communications 10:4957.
    https://doi.org/10.1038/s41467-019-12283-6
  1. Book
    1. Cunha L
    2. Diekmann Y
    3. Kowada L
    4. Stoye J
    (2018) Identifying maximal perfect haplotype blocks
    In: Alves R, editors. In Advances in Bioinformatics and Computational Biology. Springer International Publishing. pp. 26–37.
    https://doi.org/10.1007/978-3-030-01722-4
    1. Joshi PK
    2. Esko T
    3. Mattsson H
    4. Eklund N
    5. Gandin I
    6. Nutile T
    7. Jackson AU
    8. Schurmann C
    9. Smith AV
    10. Zhang W
    11. Okada Y
    12. Stančáková A
    13. Faul JD
    14. Zhao W
    15. Bartz TM
    16. Concas MP
    17. Franceschini N
    18. Enroth S
    19. Vitart V
    20. Trompet S
    21. Guo X
    22. Chasman DI
    23. O’Connel JR
    24. Corre T
    25. Nongmaithem SS
    26. Chen Y
    27. Mangino M
    28. Ruggiero D
    29. Traglia M
    30. Farmaki A-E
    31. Kacprowski T
    32. Bjonnes A
    33. van der Spek A
    34. Wu Y
    35. Giri AK
    36. Yanek LR
    37. Wang L
    38. Hofer E
    39. Rietveld CA
    40. McLeod O
    41. Cornelis MC
    42. Pattaro C
    43. Verweij N
    44. Baumbach C
    45. Abdellaoui A
    46. Warren HR
    47. Vuckovic D
    48. Mei H
    49. Bouchard C
    50. Perry JRB
    51. Cappellani S
    52. Mirza SS
    53. Benton MC
    54. Broeckel U
    55. Medland SE
    56. Lind PA
    57. Malerba G
    58. Drong A
    59. Yengo L
    60. Bielak LF
    61. Zhi D
    62. van der Most PJ
    63. Shriner D
    64. Mägi R
    65. Hemani G
    66. Karaderi T
    67. Wang Z
    68. Liu T
    69. Demuth I
    70. Zhao JH
    71. Meng W
    72. Lataniotis L
    73. van der Laan SW
    74. Bradfield JP
    75. Wood AR
    76. Bonnefond A
    77. Ahluwalia TS
    78. Hall LM
    79. Salvi E
    80. Yazar S
    81. Carstensen L
    82. de Haan HG
    83. Abney M
    84. Afzal U
    85. Allison MA
    86. Amin N
    87. Asselbergs FW
    88. Bakker SJL
    89. Barr RG
    90. Baumeister SE
    91. Benjamin DJ
    92. Bergmann S
    93. Boerwinkle E
    94. Bottinger EP
    95. Campbell A
    96. Chakravarti A
    97. Chan Y
    98. Chanock SJ
    99. Chen C
    100. Chen Y-DI
    101. Collins FS
    102. Connell J
    103. Correa A
    104. Cupples LA
    105. Smith GD
    106. Davies G
    107. Dörr M
    108. Ehret G
    109. Ellis SB
    110. Feenstra B
    111. Feitosa MF
    112. Ford I
    113. Fox CS
    114. Frayling TM
    115. Friedrich N
    116. Geller F
    117. Scotland G
    118. Gillham-Nasenya I
    119. Gottesman O
    120. Graff M
    121. Grodstein F
    122. Gu C
    123. Haley C
    124. Hammond CJ
    125. Harris SE
    126. Harris TB
    127. Hastie ND
    128. Heard-Costa NL
    129. Heikkilä K
    130. Hocking LJ
    131. Homuth G
    132. Hottenga J-J
    133. Huang J
    134. Huffman JE
    135. Hysi PG
    136. Ikram MA
    137. Ingelsson E
    138. Joensuu A
    139. Johansson Å
    140. Jousilahti P
    141. Jukema JW
    142. Kähönen M
    143. Kamatani Y
    144. Kanoni S
    145. Kerr SM
    146. Khan NM
    147. Koellinger P
    148. Koistinen HA
    149. Kooner MK
    150. Kubo M
    151. Kuusisto J
    152. Lahti J
    153. Launer LJ
    154. Lea RA
    155. Lehne B
    156. Lehtimäki T
    157. Liewald DCM
    158. Lind L
    159. Loh M
    160. Lokki M-L
    161. London SJ
    162. Loomis SJ
    163. Loukola A
    164. Lu Y
    165. Lumley T
    166. Lundqvist A
    167. Männistö S
    168. Marques-Vidal P
    169. Masciullo C
    170. Matchan A
    171. Mathias RA
    172. Matsuda K
    173. Meigs JB
    174. Meisinger C
    175. Meitinger T
    176. Menni C
    177. Mentch FD
    178. Mihailov E
    179. Milani L
    180. Montasser ME
    181. Montgomery GW
    182. Morrison A
    183. Myers RH
    184. Nadukuru R
    185. Navarro P
    186. Nelis M
    187. Nieminen MS
    188. Nolte IM
    189. O’Connor GT
    190. Ogunniyi A
    191. Padmanabhan S
    192. Palmas WR
    193. Pankow JS
    194. Patarcic I
    195. Pavani F
    196. Peyser PA
    197. Pietilainen K
    198. Poulter N
    199. Prokopenko I
    200. Ralhan S
    201. Redmond P
    202. Rich SS
    203. Rissanen H
    204. Robino A
    205. Rose LM
    206. Rose R
    207. Sala C
    208. Salako B
    209. Salomaa V
    210. Sarin A-P
    211. Saxena R
    212. Schmidt H
    213. Scott LJ
    214. Scott WR
    215. Sennblad B
    216. Seshadri S
    217. Sever P
    218. Shrestha S
    219. Smith BH
    220. Smith JA
    221. Soranzo N
    222. Sotoodehnia N
    223. Southam L
    224. Stanton AV
    225. Stathopoulou MG
    226. Strauch K
    227. Strawbridge RJ
    228. Suderman MJ
    229. Tandon N
    230. Tang S-T
    231. Taylor KD
    232. Tayo BO
    233. Töglhofer AM
    234. Tomaszewski M
    235. Tšernikova N
    236. Tuomilehto J
    237. Uitterlinden AG
    238. Vaidya D
    239. van Hylckama Vlieg A
    240. van Setten J
    241. Vasankari T
    242. Vedantam S
    243. Vlachopoulou E
    244. Vozzi D
    245. Vuoksimaa E
    246. Waldenberger M
    247. Ware EB
    248. Wentworth-Shields W
    249. Whitfield JB
    250. Wild S
    251. Willemsen G
    252. Yajnik CS
    253. Yao J
    254. Zaza G
    255. Zhu X
    256. Project TBJ
    257. Salem RM
    258. Melbye M
    259. Bisgaard H
    260. Samani NJ
    261. Cusi D
    262. Mackey DA
    263. Cooper RS
    264. Froguel P
    265. Pasterkamp G
    266. Grant SFA
    267. Hakonarson H
    268. Ferrucci L
    269. Scott RA
    270. Morris AD
    271. Palmer CNA
    272. Dedoussis G
    273. Deloukas P
    274. Bertram L
    275. Lindenberger U
    276. Berndt SI
    277. Lindgren CM
    278. Timpson NJ
    279. Tönjes A
    280. Munroe PB
    281. Sørensen TIA
    282. Rotimi CN
    283. Arnett DK
    284. Oldehinkel AJ
    285. Kardia SLR
    286. Balkau B
    287. Gambaro G
    288. Morris AP
    289. Eriksson JG
    290. Wright MJ
    291. Martin NG
    292. Hunt SC
    293. Starr JM
    294. Deary IJ
    295. Griffiths LR
    296. Tiemeier H
    297. Pirastu N
    298. Kaprio J
    299. Wareham NJ
    300. Pérusse L
    301. Wilson JG
    302. Girotto G
    303. Caulfield MJ
    304. Raitakari O
    305. Boomsma DI
    306. Gieger C
    307. van der Harst P
    308. Hicks AA
    309. Kraft P
    310. Sinisalo J
    311. Knekt P
    312. Johannesson M
    313. Magnusson PKE
    314. Hamsten A
    315. Schmidt R
    316. Borecki IB
    317. Vartiainen E
    318. Becker DM
    319. Bharadwaj D
    320. Mohlke KL
    321. Boehnke M
    322. van Duijn CM
    323. Sanghera DK
    324. Teumer A
    325. Zeggini E
    326. Metspalu A
    327. Gasparini P
    328. Ulivi S
    329. Ober C
    330. Toniolo D
    331. Rudan I
    332. Porteous DJ
    333. Ciullo M
    334. Spector TD
    335. Hayward C
    336. Dupuis J
    337. Loos RJF
    338. Wright AF
    339. Chandak GR
    340. Vollenweider P
    341. Shuldiner A
    342. Ridker PM
    343. Rotter JI
    344. Sattar N
    345. Gyllensten U
    346. North KE
    347. Pirastu M
    348. Psaty BM
    349. Weir DR
    350. Laakso M
    351. Gudnason V
    352. Takahashi A
    353. Chambers JC
    354. Kooner JS
    355. Strachan DP
    356. Campbell H
    357. Hirschhorn JN
    358. Perola M
    359. Polašek O
    360. Wilson JF
    (2015) Directional dominance on stature and cognition indiverse human populations
    Nature 523:459–462.
    https://doi.org/10.1038/nature14618

Article and author information

Author details

  1. Ardalan Naseri

    Department of Computer Science, University of Central Florida, Orlando, United States
    Present address
    School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, United States
    Contribution
    Conceptualization, Data curation, Software, Validation, Investigation, Visualization, Methodology, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2747-2193
  2. Degui Zhi

    Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing – original draft
    For correspondence
    degui.zhi@uth.tmc.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7754-1890
  3. Shaojie Zhang

    Department of Computer Science, University of Central Florida, Orlando, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing – original draft
    For correspondence
    shzhang@cs.ucf.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4051-5549

Funding

National Institutes of Health (R01 HG010086)

  • Ardalan Naseri
  • Degui Zhi
  • Shaojie Zhang

National Institutes of Health (R56 HG011509)

  • Ardalan Naseri
  • Degui Zhi
  • Shaojie Zhang

National Institutes of Health (OT2 OD002751)

  • Ardalan Naseri
  • Degui Zhi

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

AN, SZ, and DZ were supported by the National Institutes of Health grants R01 HG010086 and R56 HG011509. AN and DZ were also supported by the National Institutes of Health grant OT2 OD002751. We thank Dr. Irmgard Willcockson for proofreading.

Ethics

Our analysis was approved by The University of Texas Health Science Center at Houston committee for the protection of human subjects under No. HSC-SBMI-23-0583. UK Biobank (UKBB) has secured informed consent from the participants in the use of their data for approved research projects. UKBB data were accessed via approved project 24247.

Copyright

© 2024, Naseri et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,305
    views
  • 146
    downloads
  • 4
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Ardalan Naseri
  2. Degui Zhi
  3. Shaojie Zhang
(2024)
Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank
eLife 13:e81698.
https://doi.org/10.7554/eLife.81698

Share this article

https://doi.org/10.7554/eLife.81698

Further reading

    1. Computational and Systems Biology
    David B Blumenthal, Marta Lucchetta ... Martin H Schaefer
    Research Article

    Degree distributions in protein-protein interaction (PPI) networks are believed to follow a power law (PL). However, technical and study bias affect the experimental procedures for detecting PPIs. For instance, cancer-associated proteins have received disproportional attention. Moreover, bait proteins in large-scale experiments tend to have many false-positive interaction partners. Studying the degree distributions of thousands of PPI networks of controlled provenance, we address the question if PL distributions in observed PPI networks could be explained by these biases alone. Our findings are supported by mathematical models and extensive simulations and indicate that study bias and technical bias suffice to produce the observed PL distribution. It is, hence, problematic to derive hypotheses about the topology of the true biological interactome from the PL distributions in observed PPI networks. Our study casts doubt on the use of the PL property of biological networks as a modeling assumption or quality criterion in network biology.

    1. Computational and Systems Biology
    2. Microbiology and Infectious Disease
    Priya M Christensen, Jonathan Martin ... Kelli L Palmer
    Research Article

    Bacterial membranes are complex and dynamic, arising from an array of evolutionary pressures. One enzyme that alters membrane compositions through covalent lipid modification is MprF. We recently identified that Streptococcus agalactiae MprF synthesizes lysyl-phosphatidylglycerol (Lys-PG) from anionic PG, and a novel cationic lipid, lysyl-glucosyl-diacylglycerol (Lys-Glc-DAG), from neutral glycolipid Glc-DAG. This unexpected result prompted us to investigate whether Lys-Glc-DAG occurs in other MprF-containing bacteria, and whether other novel MprF products exist. Here, we studied protein sequence features determining MprF substrate specificity. First, pairwise analyses identified several streptococcal MprFs synthesizing Lys-Glc-DAG. Second, a restricted Boltzmann machine-guided approach led us to discover an entirely new substrate for MprF in Enterococcus, diglucosyl-diacylglycerol (Glc2-DAG), and an expanded set of organisms that modify glycolipid substrates using MprF. Overall, we combined the wealth of available sequence data with machine learning to model evolutionary constraints on MprF sequences across the bacterial domain, thereby identifying a novel cationic lipid.