Efficient estimation for large-scale linkage disequilibrium patterns of the human genome

  1. Xin Huang
  2. Tian-Neng Zhu
  3. Ying-Chao Liu
  4. Guo-An Qi
  5. Jian-Nan Zhang
  6. Guo-Bo Chen  Is a corresponding author
  1. Institute of Bioinformatics, Zhejiang University, China
  2. Center for General Practice Medicine, Department of General Practice Medicine, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, China
  3. Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, China
  4. Hainan Institute of Zhejiang University, China
  5. Alibaba Group, China
  6. Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, China

Abstract

In this study, we proposed an efficient algorithm (X-LD) for estimating linkage disequilibrium (LD) patterns for a genomic grid, which can be of inter-chromosomal scale or of small segments. Compared with conventional methods, the proposed method was significantly faster, dropped from O(nm2) to O(n2m)n the sample size and m the number of SNPs, and consequently we were permitted to explore in depth unknown or reveal long-anticipated LD features of the human genome. Having applied the algorithm for 1000 Genome Project (1KG), we found (1) the extended LD, driven by population structure, universally existed, and the strength of inter-chromosomal LD was about 10% of their respective intra-chromosomal LD in relatively homogeneous cohorts, such as FIN, and to nearly 56% in admixed cohort, such as ASW. (2) After splitting each chromosome into upmost of more than a half million grids, we elucidated the LD of the HLA region was nearly 42 folders higher than chromosome 6 in CEU and 11.58 in ASW; on chromosome 11, we observed that the LD of its centromere was nearly 94.05 folders higher than chromosome 11 in YRI and 42.73 in ASW. (3) We uncovered the long-anticipated inversely proportional linear relationship between the length of a chromosome and the strength of chromosomal LD, and their Pearson’s correlation was on average over 0.80 for 26 1KG cohorts. However, this linear norm was so far perturbed by chromosome 11 given its more completely sequenced centromere region. Uniquely chromosome 8 of ASW was found most deviated from the linear norm than any other autosomes. The proposed algorithm has been realized in C++ (called X-LD) and is available at https://github.com/gc5k/gear2, and can be applied to explore LD features in any sequenced populations.

eLife assessment

This study presents a useful new approach for efficient computation of statistics on correlations between genetic variants (linkage disequilibrium, or LD), which the authors apply to quantify the extent of LD across chromosomes. The method and its derivation are solid. The authors document that cross-chromosome LD can be substantial, which has implications for geneticists who are interested in population structure and its impact on genetic association studies.

https://doi.org/10.7554/eLife.90636.3.sa0

Introduction

Linkage disequilibrium (LD) is the association for a pair of loci and the metric of LD serves as the basis for developing genetic applications in agriculture, evolutionary biology, and biomedical research (Weir, 2008; Hill and Robertson, 1966). The structure of LD of the human genome is shaped by many factors, mutation, recombination, population demography, epistatic fitness, and completeness of genomic data itself (Myers et al., 2005; Nei and Li, 1973; Ardlie et al., 2002). Due to its overwhelming cost, LD structure investigation is often compromised to a small genomic region (Chang et al., 2015; Theodoris et al., 2021), and their typical LD structure is as illustrated for a small segment (Barrett et al., 2005). Now, given the availability of large-scale genomic data, such as millions of single-nucleotide polymorphisms (SNPs), the large-scale LD patterns of the human genome play crucial roles in determining genomics studies, and many theories and useful algorithms upon large-scale LD structure, from genome-wide association studies, polygenic risk prediction for complex diseases, and choice for reference panels for genotype imputation (Vilhjálmsson et al., 2015; Yang and Zhou, 2020; Bulik-Sullivan et al., 2015; Yang et al., 2011; Das et al., 2016).

However, there are impediments, largely due to intensified computational cost, in both investigating large-scale LD and providing high-resolution illustrations for their details. If we consider a genomic grid that consists of m2 SNP pairs, given a sample of n individuals and m SNPs (nm)—typically as observed in 1000 Genomes Project (1KG) (Lowy et al., 2019), its benchmark computational time cost for calculating all pairwise LD is O(nm2), a burden that quickly drains computational resources given the volume of the genomic data. In practice, it is of interest to know the mean LD of the mi2 SNP pairs for a genomic grid, which covers mi×mj SNP pairs. Upon how a genomic grid is defined, a genomic grid consequently can consist of (1) the whole genome-wide m2 SNP pairs, and we denote their mean LD as lg ; (2) the intra-chromosomal mean LD for the ith chromosome of mi2 SNP pairs, and denote as li ; and (3) the inter-chromosomal mean LD ith and jth chromosomal mimj SNP pairs, and denoted as lij .

In this study, we propose an efficient algorithm that can estimate lg , li , and lij , the computational time of which can be reduced from O(nmi2) to O(n2mi) for li and O(nmimj) to O(n2mi+n2mj) for lij . The rationale of the proposed method relies on the connection between the genetic relationship matrix (GRM) and LD (Chen, 2014; Goddard, 2009), and in this study a more general transformation from GRM to LD can be established via Isserlis’s theorem (Isserlis, 1918; Zhou, 2017). The statistical properties, such as sampling variance, of the estimated LD have been derived too.

The proposed method can be analogously considered a more powerful realization for Haploview (Barrett et al., 2005), but additional utility can be derived to bring out an unprecedented survey of LD patterns of the human genome. As demonstrated in 1KG, we consequently investigate how biological factors such as population structure, admixture, or variable local recombination rates can shape large-scale LD patterns of the human genomes.

  1. The proposed method provides statistically unbiased estimates for large-scale LD patterns and shows computational merits compared with the conventional methods (Figure 2).

  2. We estimated ℓ and 22 autosomal ℓ and 231 inter-autosomal ℓ⋅ for the 1KG cohorts. There was a ubiquitous existence of extended LD, which was associated with population structure or admixture (Figure 3).

  3. We provided high-resolution illustration that decomposed a chromosome into upmost nearly a million grids, each of which was consisted of 250 × 250 SNP pairs, the highest resolution that has been realized so far at autosomal level (Figure 4); tremendous variable recombination rates led to regional strong LD as highlighted for the HLA region of chromosomes 6 and the centromere region of chromosome 11.

  4. Furthermore, a consequently linear regression constructed could quantify LD decay score genome-widely, and in contrast LD decay was previously surrogated in a computationally expensive method. There was a strong ethnicity effect that was associated with extended LD (Figure 5).

  5. We demonstrate that the strength of autosomal li was inversely proportional to the SNP number, an anticipated relationship that is consistent with genome-wide spread of recombination hotspots. However, chromosome 8 of ASW showed substantial deviation from the fitted linear relationship (Figure 6).

The proposed algorithm has been realized in C++ and is available at https://github.com/gc5k/gear2, (copy archived at Chen, 2023). As tested, the software could handle sample sizes as large as 10,000 individuals.

Methods

The overall rationale for large-scale LD analysis

We assume LD for a pair of biallelic loci is measured by squared Pearson’s correlation, ρl1l22=Dl1l22pl1ql1pl2ql2 , in which Dl1l2 the LD of loci l1 and l2 , p. and q. the reference and the alternative allele frequencies. If we consider the averaged LD for a genomic grid over mi2 SNP pairs, the conventional estimator is l^i=1mi2l1,l2miρl1l22 , and, if we consider the averaged LD for mi and mj SNP pairs between two genomic segments, then l^ij=1mimjl1,l2mi,mjρl1l22 . Now let us consider the 22 human autosomes (Figure 1A). We naturally partition the genome into C=22 blocks, and its genomic LD, denoted as lg, can be expressed as

(1) lg=1m2l1,l2mρl1l22=1m2(iC(l1,l2miρl1l22)+ijC(l1mil2mjρl1l22))=iCmi2m2li+ijCmimjm2lij
Schematic illustration for large-scale linkage disequilibrium (LD) analysis as exampled for CONVERGE cohort.

(A) The 22 human autosomes have consequently 22 l^i and 231 l^ij , without (left) and with (right) scaling transformation; Scaling transformation is given in Equation 8. (B) If zoom into chromosome 2 of 420,946 single-nucleotide polymorphisms (SNPs), a chromosome of relative neutrality is expected to have self-similarity structure that harbors many approximately strong l^u along the diagonal, and relatively weak l^uv off-diagonally. Here chromosome 2 of CONVERGE has been split into 1000 blocks and yielded 1000 l^u LD grids, and 499,500 l^uv LD grids. (C) An illustration of the construction process for the LD-decay regression model.

So we can decompose lg into Cli and C(C1)2 unique lij . Obviously, Equation 1 can be also expressed in the context of a single chromosome li=1βi2(uβilu+uvβiluv), in which βi=mim the number of SNP segments, each of which has m SNPs. Geometrically it leads to βi diagonal grids and βiβi-12 unique off-diagonal grids (Figure 1B).

LD-decay regression

As human genome can be boiled down to small LD blocks by genome-widely spread recombination hotspots (Hinch et al., 2019; Li et al., 2022), mechanically there is self-similarity for each chromosome that the relatively strong li for juxtaposed grids along the diagonal but weak lij for grids slightly off-diagonal. So, for a chromosomal li , we can further express it as

(2) li=1βi2(uβilu+uvβiluv)=E(lu)1βi+E(luv)(11βi)=1βi[E(lu)E(luv)]+E(luv)

in which lu is the mean LD for a diagonal grid, luv the mean LD for off-diagonal grids, and mi the number of SNPs on the ith chromosome. Consider a linear model below (see Figure 1C for its illustration),

(3) l=b0+b1x+e

in which l represents a vector composed of Cli,x , x represents a vector composed of Cxi, and xi=1mi the inversion of the SNP number of the ith chromosome. The regression coefficient and intercept can be estimated as below:

b1=cov(x,l)var(x)=E(xl)E(x)E(l)var(x)

and

b0=E(l)b1E(x)

There are some technical details in order to find the interpretation for b0 and b1 . We itemize them briefly. For the mean and variance of x:

{E(x)=1CiC1mivar(x)=1CiC1mi2(1CiC1mi)2

For E(xl):

E(xl)=iC1mi{E(lummi)+E(luv)(1mmi)}C=iC{E(lummi2)+E(luv)(1mmi)}C=[(E(lu)E(luv))m](1CiC1mi2)+E(luv)(1CiC1mi)

For E(x)E(l):

E(x)E(l)={1CiC1mi}{(1CiC1mi)(E(lu)m)+[1m(1CiC1mi)]E(luv)}=[(E(lu)E(luv))m](1CiC1mi)2+E(luv)(1CiC1mi)

Then we integrate these items to have the expectation for b1:

E(b1)=E(xl)E(x)E(l)var(x)={[(E(lu)E(luv))m](1CiC1mi2)+E(luv)(1CiC1mi)}{[(E(lu)E(luv))m](1CiC1mi)2+E(luv)(1CiC1mi)}1CiC1mi2(1CiC1mi)2=[E(lu)E(luv)]m

Similarly, we plug in E(b1) so as to derive b0 :

E(b0)=E(l)E(b1)E(x)={(1CiC1mi)(E(lu)m)+[1m(1CiC1mi)]E(luv)}{(E(lu)E(luv))m(1CiC1mi)}=E(luv)

After some algebra, if E(lu)E(luv)—say if the former is far greater than the latter, the interpretation of b1 and b0 can be

(4) {E(b1)=E(luluv)mE(lu)mE(b0)=E(luv)

It should be noticed that E(b1)E(lu)m quantifies the averaged LD decay of the genome. Conventional LD decay is analyzed via the well-known LD decay analysis, but Equation 4 provides a direct estimate of both LD decay and possible existence of extended LD. We will see the application of the model in Figure 5 that the strength of the long-distance LD is associated with population structure. Of note, the underlying assumption of Equations 3 and 4 is genome-wide spread of recombination hotspots, an established result that has been revealed and confirmed (Hinch et al., 2019).

Efficient estimation for lg, li, and lij

For the aforementioned analyses, the bottleneck obviously lies in the computational cost in estimating li and lij . li and lij are used to be estimated via the current benchmark algorithm as implemented in PLINK (Chang et al., 2015), and the computational time complex is proportional to O(nm2). We present a novel approach to estimate li and lij . Given a genotypic matrix X, a n×m matrix, if we assume that there are mi and mj SNPs on chromosomes i and j, respectively, we can construct n×n genetic relatedness matrices as below:

(5) {Ki=1miX~iX~iTKj=1mjXj~X~jT

in which X~i is the standardized Xi and x~kl=xkl-2pl2(1+F)plql , where xkl is the genotype for the kth individual at the lth biallelic locus, F is the inbreeding coefficient having the value of 0 for random mating population and 1 for an inbred population, and pl and ql are the frequencies of the reference and the alternative alleles (pl+ql=1), respectively. When GRM is given, we can obtain some statistical characters of Ki . We extract two vectors kio , which stacks the off-diagonal elements of Ki , and kid , which takes the diagonal elements of Ki . The mathematical expectation of kio2 , in which E(kio2)=1n(n1)k1k2nkk1,k22 , can be established according to Isserlis’s theorem in terms of the four-order moment (Isserlis, 1918),

(6) E(kio2)=1mi2n(n1)k1k2nl1,l2mi[(1+θk1k22)ρl1l22+θk1k22]

in which E(θk1k2)=(12)r is the expected relatedness score and r indicates the rth-degree relatives. r=0 for the same individual, and r=1 for the first-degree relatives. Similarly, we can derive for Ekiokjo.Equation 6 establishes the connection between GRM and the aggregated LD estimation that li=E(kio2) . According to Delta method as exampled in Appendix I of Lynch and Walsh, 1998, the means and the sampling variances for li and lij are

(7) {E(kio2)=li=1mi2l1,l2miρl1,l22var(li)=4[var^(kio)]2n(n1)E(kiokko)=lij=1mimjl1,l2=1mi,mjρl12,l2var(kio)=E(kio2)[E(kio)]2=li1(n1)2

in which var(kio)=E(kio2)[E(kio)]2=li1(n1)2 and cov(kio,kjo)=E(kiokjo)E(kio)E(kjo)=lij1(n1)2 , respectively. Of note, the properties of lg can be derived similarly if we replace li with lg in Equation 7. We can develop lij , a scaled version of lij , as below:

(8) lij=lijlilj

in which li=mili1mi1 , a modification that removed the LD with itself. According to Delta method, the sampling variance of lij is

(9) var(lij~)=2(lij~^)2n(n1)[var^(kio)var^(kjo)(cov^(kio,kjo))2+(cov^(kio,kjo))2var^(kio)var^(kjo)2]

Of note, when there is no LD between a pair of loci, l yields zero and its counterpart PLINK estimate yields 1n , a difference that can be reconciled in practice (see Figure 2).

Figure 2 with 2 supplements see all
Reconciliation for linkage disequilibrium (LD) estimators in the 26 cohorts of 1KG.

(A) Consistency examination for the 26 1KG cohorts for their l^i and l^ij estimated by X-LD and PLINK (--r2). In each figure, the 22 l^i fitting line is in purple, whereas the 231 l^ij fitting line is in green. The gray solid line, y=1n+x, in which n the sample size of each cohort, represents the expected fit between PLINK and X-LD estimates, and the two estimated regression models at the top-right corner of each plot show this consistency. The sample size of each cohort is in parentheses. (B) Distribution of R2 of l^i and l^ij fitting lines is based on X-LD and PLINK algorithms in the 26 cohorts; R2 represents variation explained by the fitted model. 26 1KG cohorts: MSL (Mende in Sierra Leone), GWD (Gambian in Western Division, The Gambia), YRI (Yoruba in Ibadan, Nigeria), ESN (Esan in Nigeria), ACB (African Caribbean in Barbados), LWK (Luhya in Webuye, Kenya), ASW (African Ancestry in Southwest US), CHS (Han Chinese South), CDX (Chinese Dai in Xishuangbanna, China), KHV (Kinh in Ho Chi Minh City, Vietnam), CHB (Han Chinese in Beijing, China), JPT (Japanese in Tokyo, Japan), BEB (Bengali in Bangladesh), ITU (Indian Telugu in the UK), STU (Sri Lankan Tamil in the UK), PJL (Punjabi in Lahore, Pakistan), GIH (Gujarati Indian in Houston, TX), TSI (Toscani in Italia), IBS (Iberian populations in Spain), CEU (Utah residents [CEPH] with Northern and Western European ancestry), GBR (British in England and Scotland), FIN (Finnish in Finland); MXL (Mexican Ancestry in Los Angeles, CA), PUR (Puerto Rican in Puerto Rico), CLM (Colombian in Medellin, Colombia), and PEL (Peruvian in Lima, Peru).

Raise of LD due to population structure

In this study, the connection between LD and population structure is bridged via two pathways below, in terms of a pair of loci and of the aggregated LD for all pair of loci. For a pair of loci, their LD is often simplified as ρl1l22=Dl1l22pl1ql1pl2ql2 , but will be inflated if there are subgroups (Nei and Li, 1973). In addition, it is well established the connection between population structure and eigenvalues, and in particular the largest eigenvalue is associated with divergence of subgroups (Patterson et al., 2006). In this study, the existence of subgroups of cohort is surrogated by the largest eigenvalue λ1 or F-stλ1n .

Data description and quality control

The 1KG (Auton et al., 2015), which was launched to produce a deep catalog of human genomic variation by whole-genome sequencing (WGS) or whole-exome sequencing (WES), and 2503 strategically selected individuals of global diversity were included (containing 26 cohorts). We used the following criteria for SNP inclusion for each of the 26 1KG cohorts: (1) autosomal SNPs only; (2) SNPs with missing genotype rates higher than 0.2 were removed, and missing genotypes were imputed; and (3) only SNPs with minor allele frequencies higher than 0.05 were retained. Then 2,997,635 consensus SNPs that were present in each of the 26 cohorts were retained. According to their origins, the 26 cohorts are grouped as African (AFR: MSL, GWD, YRI, ESN, ACB, LWK, and ASW), European (EUR: TSI, IBS, CEU, GBR, and FIN), East Asian (EA: CHS, CDX, KHV, CHB, and JPT), South Asian (SA: BEB, ITU, STU, PJL, and GIH), and American (AMR: MXL, PUR, CLM, and PEL), respectively.

In addition, to test the capacity of the developed software (X-LD), we also included CONVERGE cohort (n=10,640), which was used to investigate major depressive disorder (MDD) in the Han Chinese population (Cai et al., 2015). We performed the same criteria for SNP inclusion as that of the 1KG cohorts, and m=5,215,820 SNPs remained for analyses.

X-LD software implementation

The proposed algorithm has been realized in our X-LD software, which is written in C++ and reads in binary genotype data as often used in PLINK. As multi-thread programming is adopted, the efficiency of X-LD can be improved upon the availability of computational resources. We have tested X-LD in various independent datasets for its reliability and robustness. Certain data management options, such as flexible inclusion or exclusion of chromosomes, have been built into the commands of X-LD. In X-LD, missing genotypes are naively imputed according to Hardy–Weinberg proportions; however, when the missing rate is high, we suggest the genotype matrix should be imputed by other advanced imputation tools.

The most time-consuming part of X-LD was the construction of GRM K=1mX~X~T , and the established computational time complex was O(n2m). However, if X~ is decomposed into X~=[X~t1,X~t2,X~tz,], in which X~[ti,] has dimension of n×B, using Mailman algorithm the computational time complex for building K can be reduced to O(n2mlog3m) (Liberty and Zucker, 2009). This idea of embedding Mailman algorithm into certain high-throughput genomic studies has been successful, and our X-LD software is also leveraged by absorbing its recent practice in genetic application (Wu and Sankararaman, 2018).

Results

Statistical properties of the proposed method

Table 1 introduces the symbols frequently cited in this study. As schematically illustrated in Figure 1, lg could be decomposed into Cli and C(C1)2 unique lij components. We compared the estimated li and lij in X-LD with those being estimated in PLINK (known as ‘--r2,’ and the estimated squared Pearson’s correlation LD is denoted as r2). Considering the substantial computational cost of PLINK, only 100,000 randomly selected autosome SNPs were used for each 1KG cohort, and 22 l^i and 231 l^ij were estimated. After regressing 22 l^i against those of PLINK, we found that the regression slope was close to unity and bore an anticipated intercept a quantity of approximately 1n (Figure 2A and B). In other words, PLINK gave 1n even for SNPs of no LD. However, when regressing 231 l^ij estimates against those of PLINK, it was found that largely because of the tiny quantity of l^ij it was slightly smaller than 1 but statistically insignificant from 1 in these 26 1KG cohorts (mean of 0.86 and SD of 0.10, and its 95% CI was (0.664, 1.056)); when the entire 1KG samples were used, its much larger LD due to subgroups, nearly no estimation bias was found (Figure 2A and B). In contrast, because of their much larger values, l^i components were always consistent with their corresponding estimates from PLINK (mean of 1.03 and SD of 0.012, 95% CI was (1.006, 1.053), bearing an ignorable bias). Furthermore, we also combined the African cohorts together (MSL, GWD, YRI, ESN, LWK, totaling 599 individuals), the East Asian cohorts together (CHS, CDX, KHV, CHB, and JPT, totaling 504 individuals), and the European cohorts together (TSI, IBS, CEU, GBR, and FIN, totaling 503 individuals), and the resemblance pattern between X-LD and PLINK was similar as observed in each cohort alone (Figure 2—figure supplement 1). The empirical data in 1KG verified that the proposed method was sufficiently accurate.

Table 1
Notation definitions.
NotationDefinition
CThe number of chromosomes.
i and jSubscripts index chromosome i and j.
βiThe number of SNP segments of chromosome i, each of which has m SNPs.
Dl1l2The difference between the observed and expected haplotype frequencies, with Dl1l2=pl1l2-pl1pl2 .
FThe inbreeding coefficient.
KiGenetic relatedness matrix for chromosome i, and two vectors, kio and kid , from Ki , where kio stacks the off-diagonal elements and kid stacks the diagonal elements.
kSubscript indexes individual.
l1 and l2Subscripts index a pair of SNPs.
mThe number of SNPs; mi the number of SNPs on chromosome i.
nThe number of samples; ni , the number of samples in subpopulation i.
pl and qlFrequency of the lth reference allele and alternative allele in the population.
θk1k2The relatedness score between individual k1 and k2 .
xklThe genotype for the kth individual at the lth biallelic locus.
Xi and X~iGenotype and standardized genotype matrixes for chromosome i.
ρl1l22Squared Pearson’s correlation coefficient for any pair of SNPs, including an SNP to itself when l1=l2 .
r2Squared Pearson’s correlation metric for LD but estimated from PLINK (--r2) or PopLDdecay.
lgThe mean LD of the whole genome-wide m2 SNP pairs.
liThe intra-chromosomal mean LD for the ith chromosome of mi2 SNP pairs.
lijThe inter-chromosomal mean LD ith and jth chromosomal mimj SNP pairs, a scaled version is ij .
luThe mean LD for a diagonal grid.
luvThe mean LD for off-diagonal grids.
  1. LD, linkage disequilibrium; SNP, single-nucleotide polymorphism.

To fairly evaluate the computational efficiency of the proposed method, the benchmark comparison was conducted on the first chromosome of the entire 1KG dataset (n=2,503 and m=225,967), and 10 CPUs were used for multi-thread computing. Compared with PLINK, the calculation efficiency of X-LD was nearly 30–40 times faster for the tested chromosome, and its computational time of X-LD was proportional to O(n2mlog3m) (Figure 2—figure supplement 2). So, X-LD provided a feasible and reliable estimation of large-scale complex LD patterns. More detailed computational time of the tested tasks is reported in their corresponding sections below; since each 1KG cohort had a sample size of around 100, otherwise specified the computational time was reported for CHB (n=103) as a reference (Table 2). In order to test the capability of the software, the largest dataset tested was CONVERGE (n=10,640, and m=5,215,820), and it took 77,508.00 s, about 22 hr, to estimate 22 autosomal l^i and 231 l^ij (Figure 1A); when zooming into chromosome 2 of CONVERGE, on which 420,949 SNPs had been evenly split into 1000 blocks and yielded 1000 l^u grids, and 499,500 l^uv LD grids, it took 45,125.00 s, about 12.6 hr, to finish the task (Figure 1B).

Table 2
Computational time for the demonstrated estimation tasks.
CohortTask descriptionTime costComputational time complex
CHB (n=103, m=2,997,655)Estimation for 22 autosomal i , and 231 inter-chromosomal ij . For results, see Figure 3 and Table 3.101,34 sO(n2m)
1KG (n=2,503, m=2,997,655)Same as above.3008.29 sSame as above
CONVERGE (n=10,640, m=5,215,820)Same as above. For results, see Figure 1A.77,508.00 sSame as above
Estimation for high-resolution LD interaction given bin size of 250 SNPs
CHB (n=103, m2=241,241)Chromosome 2, estimation for 965 li, and 465,130 lij . For results, see Figure 4.66.86 sO(n2(mi+(mi250)2))
CHB (n=103, m22=40,378)Chromosome 22, estimation for 162 li, and 13,041 lij . For results, see Figure 4.3.22 sSame as above
CONVERGE (n=10,640, m22=71,407)Chromosome 22, estimation for 286 li, and 40,755 lij .8,736.29 sSame as above
CONVERGE (n=10,640, m2=420,949)Chromosome 2, estimation for 1000 li, and 499,500 lij . For results, see Figure 1B.45,125.00 sChromosome 2 was split into 1000 blocks, each of which had about 420 SNPs
  1. For the sake of fair comparison, 10 CPUs were used for multi-thread computing.

  2. LD, linkage disequilibrium; SNP, single-nucleotide polymorphism.

Ubiquitously extended LD and population structure/admixture

We partitioned the 2,997,635 SNPs into 22 autosomes ( Figure 3A , Figure 3—figure supplement 1), and the general LD patterns were as illustrated for CEU, CHB, YRI, ASW, and 1KG. As expected, l^ij<l^g<l^i for each cohort (Figure 3B). As observed in these 1KG cohorts, all three LD measures were associated with population structure, which was surrogated by F-stλ1n , and their squared correlation R2 was greater than 0.8. ACB, ASW, PEL, and MXL, which all showed certain admixture, tended to have much greater l^g , l^i , and l^ij (Table 3 and Figure 3B). In contrast, East Asian (EA) and European (EUR)-orientated cohorts, which showed little within-cohort genetic differentiation—as their largest eigenvalues were slightly greater than 1—had their aggregated LD relatively low and resembled each other (Table 3). Furthermore, for several European (TSI, IBS, and FIN) and East Asian (JPT) cohorts, the ratio between l^ij and l^i components could be smaller than 0.1, and the smallest ratio was found to be about 0.091 in FIN. The largest ratio was found in 1KG that l^ij=5.7e3 and l^i=6.5e3, and the ratio was 0.877 because of the inflated LD due to population structure. A more concise statistic to describe the ratio between lij and li was lij (Equation 8), and the corresponding value for 231 scaled lij for FIN was l^ij=0.10 (SD of 0.027) and for 1KG was l^ij=0.88 (SD of 0.028).

Figure 3 with 1 supplement see all
Various linkage disequilibrium (LD) components for the 26 1KG cohorts.

(A) Chromosomal scale LD components for five representative cohorts (CEU, CHB, YRI, ASW, and 1KG). The upper parts of each figure represent l^i (along the diagonal) and l^ij (off-diagonal), and the lower part l^ij as in Equation 8. For visualization purposes, the quantity of LD before scaling is transformed to a -log10 scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all single-nucleotide polymorphisms (SNPs) are in LD. (B) The relationship between the degree of population structure (approximated by F-st) and l^i , l^g , and l^ij in the 26 1KG cohorts.

Table 3
X-LD estimation for complex LD components (2,997,635 SNPs).
Cohort (n)Ancestryλ1(Fst)*l^g (SE)l¯i^ (SD) l¯ij^ (SD) l^ij (SD) Lower bound of LD §
MSL (85)AFR1.10 (0.013)1.9e-4 (1.21e-6)6.9e-4 (2.0e-4)1.7e-4 (1.7e-5)0.26 (0.053)0.161971831
GWD (113)AFR1.07 (0.009)1.1e-4 (5.61e-7)6.0e-4 (2.0e-4)8.7e-5 (8.1e-6)0.16 (0.037)0.247218789
YRI (107)AFR1.05 (0.010)1.1e-4 (4.23e-7)5.9e-4 (2.0e-4)8.8e-5 (6.9e-6)0.16 (0.04)0.242001641
ESN (99)AFR1.09 (0.011)1.4e-4 (7.67e-7)7.0e-4 (2.2e-4)1.2e-4 (1.2e-5)0.19 (0.043)0.217391304
ACB (96)AFR2.01 (0.021)2.9e-4 (3.78e-6)9.1e-4 (2.5e-4)2.5e-4 (3.6e-5)0.29 (0.070)0.147727273
LWK (99)AFR1.35 (0.014)2.2e-4 (2.38e-6)8.4e-4 (2.5e-4)1.9e-4 (3.2e-5)0.24 (0.052)0.173913043
ASW (61)AFR1.90 (0.031)1.1e-3 (2.73e-5)2.0e-3 (3.2e-4)1.1e-3 (6.2e-5)0.57 (0.059)0.079681275
CHS (105)EA1.08 (0.010)1.4e-4 (9.39e-7)9.5e-4 (3.4e-4)1.0e-4 (1.3e-5)0.12 (0.030)0.31147541
CDX (93)EA1.11 (0.012)1.8e-4 (1.38e-6)1.1e-3 (3.6e-4)1.4e-4 (2.0e-5)0.14 (0.040)0.272277228
KHV (99)EA1.07 (0.011)1.4e-4 (7.67e-7)9.5e-4 (3.5e-4)1.0e-4 (1.2e-5)0.12 (0.031)0.31147541
CHB (103)EA1.07 (0.010)1.3e-4 (6.94e-7)9.3e-4 (3.4e-4)9.5e-5 (1.1e-5)0.11 (0.030)0.317948718
JPT (104)EA1.06 (0.010)1.3e-4 (7.22e-7)1.0e-3 (3.8e-4)9.3e-5 (1.2e-5)0.10 (0.028)0.338638673
BEB (86)SA1.07 (0.012)1.7e-4 (8.09e-7)9.1e-4 (3.1e-4)1.4e-4 (1.5e-5)0.17 (0.042)0.236363636
ITU (102)SA1.61 (0.016)1.9e-4 (1.84e-6)9.5e-4 (3.1e-4)1.5e-4 (1.7e-5)0.18 (0.044)0.231707317
STU (102)SA1.56 (0.015)2.6e-4 (3.21e-6)1.0e-3 (3.3e-4)2.3e-4 (3.1e-5)0.23 (0.047)0.171526587
PJL (96)SA1.67 (0.017)2.4e-4 (2.74e-6)1.1e-3 (3.4e-4)2.0e-4 (2.2e-5)0.21 (0.048)0.20754717
GIH (103)SA1.73 (0.017)2.7e-4 (3.41e-6)1.1e-3 (3.4e-4)2.4e-4 (1.9e-5)0.23 (0.049)0.179153094
TSI (107)EUR1.07 (0.010)1.2e-4 (6.10e-7)9.1e-4 (3.3e-4)9.0e-5 (1.1e-5)0.11 (0.029)0.325
IBS (107)EUR1.07 (0.010)1.2e-4 (6.10e-7)9.1e-4 (3.3e-4)8.8e-5 (1.1e-5)0.11 (0.028)0.329949239
CEU (99)EUR1.07 (0.011)1.4e-4 (7.67e-7)9.6e-4 (3.4e-4)1.1e-4 (1.3e-5)0.12 (0.030)0.293577982
GBR (91)EUR1.11 (0.012)1.7e-4 (1.08e-6)1.0e-3 (3.6e-4)1.4e-4 (1.8e-5)0.15 (0.036)0.253807107
FIN (99)EUR1.09 (0.011)1.5e-4 (9.69e-7)1.1e-3 (3.8e-4)1.0e-4 (1.5e-5)0.10 (0.027)0.34375
MXL (64)AMR2.29 (0.036)7.2e-4 (1.49e-5)2.1e-3 (4.1e-4)6.3e-4 (9.6e-5)0.32 (0.072)0.136986301
PUR (104)AMR1.43 (0.014)1.6e-4 (1.30e-6)1.2e-3 (4.2e-4)1.2e-4 (1.7e-5)0.11 (0.026)0.322580645
CLM (94)AMR1.58 (0.017)2.3e-4 (2.49e-6)1.4e-3 (4.5e-4)1.7e-4 (2.6e-5)0.13 (0.035)0.281690141
PEL (85)AMR2.38 (0.028)4.5e-4 (7.33e-6)1.9e-3 (5.1e-4)3.7e-4 (8.5e-5)0.21 (0.062)0.196483971
1KG (2503)MIX164.20 (0.066)5.8e-3 (4.63e-6)6.5e-3 (4.1e-4)5.7e-3 (2.4e-4)0.88 (0.028)0.051505547
  1. LD, linkage disequilibrium; SNPs, single-nucleotide polymorphisms.

  2. *

    Eigenvalue was estimated. In parentheses is the ratio between the listed largest eigenvalue and the sample size. Since there exists an approximation that F-stλ1n , the ratio can be taken as an approximation of population structure.

  3. Standard error was calculated as 2n(n1)[l^g1(n1)2], as Equation 7.

  4. Estimated empirically from C chromosomal l^i ; Estimated empirically from C(C1)2 inter-chromosomal l^ij .

  5. §

    It is estimated by 22¯^i22¯^i+231¯^ij , indicating lower bound of true LD.

In terms of computational time, for 103 CHB samples, it took about 101.34 s to estimate 22 autosomal l^i and 231 l^ij ; for all 1KG 2503 samples, X-LD took about 3008.29 s (Table 1). Conventional methods took too long to complete the analyses in this section, so no comparable computational time was provided. For detailed 22 l^i and 231 l^ij estimates for each 1KG cohort, please refer to Supplementary file 1 (Excel sheet 1–27).

Detecting exceedingly high LD grids shaped by variable recombination rates

We further explored each autosome with high-resolution grid LD visualization. We set m=250, so each grid had the luv for 250 × 250 SNP pairs. The computational time complex was O(n2(mi+βi24)), in which βi=mi250 , and with our proposed method in CHB it cost 66.86 s for chromosome 2, which had 241,241 SNPs and totaled 466,095 unique grids, and 3.22 s only for chromosome 22, which had 40,378 SNPs and totaled 13,203 unique grids (Table 1). In contrast, under conventional methods those LD grids were not very likely to be exhaustively surveyed because its computational cost was O(nmi2): for CHB chromosome 2, it would have taken about 40 hr as estimated. As the result was very similar for m=500 (Figure 4—figure supplement 1), we only report the results under m=250 below.

As expected, chromosome 6 (206,165 SNPs, totaling 340,725 unique grids) had its HLA cluster showing much higher LD than the rest of chromosome 6. In addition, we found a very dramatic variation of the HLA cluster LD l^HLA (28,477,797–33,448,354 bp, totaling 3160 unique grids) across ethnicities. For CEU, CHB, YRI, and ASW, their l^6=0.0010, 0.00090, 0.00064, and 0.0019, respectively, but their corresponding HLA cluster grids had l^HLA=0.042, 0.029, 0.025, and 0.022, respectively (Figure 4). Consequently, the largest ratio for l^HLAl^6 was 42.00 in CEU, 39.06 in YRI, and 32.22 in CHB, but was reduced to 11.58 in ASW. Before the release of CHM13 (Hoyt et al., 2022), chromosome 11 had the most completely sequenced centromere region, which had much rarer recombination events, and all four cohorts showed a strong LD l^11.c around the centromere (46,061,947–59,413,484 bp, totaling 1035 unique grids) regardless of their ethnicities (Figure 4). l^11=0.0012, 0.0012, 0.00084, and 0.0022, respectively, and l^11.c=0.098, 0.10, 0.079, and 0.094, respectively; the ratio for l^11.cl^11= 81.67, 83,33, and 94,05, for CEU, CHB, and YRI, respectively; the lowest ratio was found in ASW of 42.73. In addition, removing the HLA region of chromosome 6 or the centromere region of chromosome 11 would significantly reduce l^6 or l^11 in comparison with the random removal of other regions (Figure 4—figure supplement 2).

Figure 4 with 2 supplements see all
High-resolution illustration for linkage disequilibrium (LD) grids for CEU, CHB, YRI, and ASW (m=250).

For each cohort, we partition chromosomes 6 and 11 into high-resolution LD grids (each LD grid contains 250 ×250 single-nucleotide polymorphism [SNP] pairs). The bottom half of each figure shows the LD grids for the entire chromosome. Further zooming into HLA on chromosome 6 and the centromere region on chromosome 11, and their detailed LD in the relevant regions are also provided in the upper half of each figure. For visualization purposes, LD is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

Model-based LD decay regression revealed LD composition

The real LD block size was not exact of m=250 or m=500, but an unknown parameter that should be inferred in computational intensive ‘LD decay’ analysis (Zhang et al., 2019; Chang et al., 2015). We conducted the conventional LD decay for the 26 1KG cohorts (Figure 5A), and the time cost was 1491.94 s for CHB. For each cohort, we took the area under the LD decay curve in the LD decay plot, and it quantified approximately the LD decay score for each cohort. The smallest score was 0.0421 for MSL, and the largest was 0.0598 for PEL (Table 5). However, this estimation did not take into account the real extent of LD, so it was not precise enough to reflect the LD decay score. For example, for admixture population, such as the American cohorts, the extent of LD would be longer.

Figure 5 with 1 supplement see all
Linkage disequilibrium (LD) decay analysis for 26 1KG cohorts.

(A) Conventional LD decay analysis in PLINK for 26 cohorts. To eliminate the influence of sample size, the inverse of sample size has been subtracted from the original LD values. The YRI cohort, represented by the orange dotted line, is chosen as the reference cohort in each plot. The top-down arrow shows the order of LDdecay values according to Table 5. (B) Model-based LD decay analysis for the 26 1KG cohorts. We regressed each autosomal l^i against its corresponding inversion of the single-nucleotide polymorphism (SNP) number for each cohort. Regression coefficient b1 quantifies the averaged LD decay of the genome and intercept b0 provides a direct estimate of the possible existence of long-distance LD. The R values in the first three plots indicate the correlation between b^1 and LD decay score in three different physical distance and the correlation between b^1 (left-side vertical axis) and LD decay score (right-side vertical axis) and the correlation between b^0 (left-side vertical axis) and F-st (right-side vertical axis), respectively. The last plot assessed the impact of centromere region of chromosome 11 on the linear relationship between chromosomal LD and the inverse of the SNP number. The dark and light gray dashed lines represent the mean of the R with and without the presence of centromere region of chromosome 11.

In contrast, we proposed a model-based method, as given in Equation 3, which could estimate LD decay score (regression coefficient b1) and long-distance LD score (intercept b0) jointly. Given the estimated 22 l^i (Supplementary file 1; see Table 4 for four representative cohorts and Supplementary R code), we regressed each autosomal l^i against its corresponding inversion of SNP number, and all yielded positive slopes (Pearson’s correlation R>0.80, Table 5 and Figure 5B), an observation that was consistent with genome-wide spread of recombination hotspots. This linear relationship could consequently be considered the norm for a relatively homogeneous population as observed in most 1KG cohorts (Figure 5—figure supplement 1), while for all the 2503 1KG samples R=0.55 only (Table 5), indicating that the population structure and possible differentiated recombination hotspots across ethnicities disturbed the assumption underlying Equation 3 and smeared the linearity. We extracted b^0 and b^1 for the 26 1KG cohorts for further analysis. The rates of LD decay score, as indicated by b^1 , within the African cohorts (AFR) were significantly faster than the other continents, consistent with previous observation that the African population had relatively shorter LD Gabriel et al., 2002; while subgroups within the American continent (AMR) tended to have extended LD range due to their admixed genetic composition (Table 4 and Figure 5). Notably, the correlation between b^1 and the approximated LD decay score was R=0.88. The estimated F-st was highly correlated with b^0 (R=0.94).

Table 4
Estimates for 22 autosomal l^i in CEU, CHB, YRI, and ASW, respectively.
ChromosomeSNP numberl^i
CEUCHBYRIASW
1225,9675.0e-4 (8.2e-6)0.00049 (7.8e-6)0.00032 (4.3e-6)0.0015 (4e-05)
2241,2415.0e-4 (8.1e-6)5.0e-4 (7.9e-6)3.0e-4 (4.1e-6)0.0015 (4e-05)
3212,6706.0e-04 (1.0e-5)0.00058 (9.5e-6)0.00039 (5.7e-6)0.0018 (5.1e-5)
4222,2410.00062 (1.0e-5)0.00061 (1.0e-5)0.00038 (5.4e-6)0.0018 (5.0e-5)
5193,6320.00069 (1.2e-5)7.0e-04 (1.2e-5)0.00043 (6.5e-6)0.0018 (4.9e-5)
6206,1650.0010 (1.9e-5)9.0e-04 (1.6e-5)0.00064 (1.0e-5)0.0019 (5.4e-5)
7177,4140.00073 (1.3e-5)0.00071 (1.2e-5)0.00045 (6.8e-6)0.0016 (4.3e-5)
8163,4360.00075 (1.3e-5)0.00069 (1.2e-5)0.00043 (6.5e-6)0.0022 (6.4e-5)
9129,4400.00074 (1.3e-5)0.00074 (1.3e-5)0.00047 (7.2e-6)0.0018 (5.0e-5)
10152,2510.00078 (1.4e-5)8.0e-04 (1.4e-5)0.00058 (9.3e-6)0.0019 (5.6e-5)
11151,7510.0012 (2.3e-5)0.0012 (2.2e-5)0.00084 (1.4e-5)0.0022 (6.2e-5)
12139,6848.0e-4 (1.4e-5)0.00073 (1.2e-5)0.00049 (7.5e-6)0.0017 (4.8e-5)
13113,3900.0010 (1.8e-5)0.00094 (1.6e-5)0.00061 (9.8e-6)0.0018 (4.9e-5)
1497,3350.0011 (2.0e-5)0.0010 (1.8e-5)0.00065 (1.1e-5)0.0020 (5.6e-5)
1585,3070.0010 (1.8e-5)0.00098 (1.7e-5)6.0e-4 (9.6e-6)0.0020 (5.8e-5)
1692,0070.00088 (1.6e-5)0.00084 (1.5e-5)0.00054 (8.4e-6)0.0021 (6.2e-5)
1779,4780.0012 (2.3e-5)0.0011 (2.0e-5)0.00069 (1.1e-5)0.0021 (6.0e-5)
1887,1050.0010 (1.8e-5)0.00095 (1.7e-5)0.00058 (9.2e-6)0.0023 (6.8e-5)
1972,7940.0012 (2.3e-05)0.0012 (2.1e-5)0.00082 (1.4e-5)0.0022 (6.2e-5)
2068,8810.0014 (2.6e-5)0.0015 (2.7e-5)0.00078 (1.3e-5)0.0024 (7.0e-5)
2145,0680.0018 (3.4e-5)0.0017 (3.2e-5)0.00098 (1.7e-5)0.0024 (7.1e-5)
2240,3780.0016 (3.1e-5)0.0016 (2.9e-5)0.0010 (1.8e-5)0.0027 (8.1e-5)
  1. Each l^i and its standard error are in parentheses, as estimated in Equation 7.

  2. SNP, single-nucleotide polymorphism.

Table 5
LD decay regression analysis for 26 cohorts.
Cohort (n)LD-decay regression*Population parameters
b^0b^1RLD decay scoreFst¯ (%)AncestryTrue LD
MSL (85)0.0004129.970.840.04210.013AFR0.62727273
GWD (113)0.0003130.170.830.04390.009AFR0.65934066
YRI (107)0.0003030.640.850.04360.010AFR0.66292135
ESN (99)0.0003734.820.870.04360.011AFR0.65420561
ACB (96)0.0005339.620.880.04510.021AFR0.63194444
LWK (99)0.0004640.520.920.04470.014AFR0.64615385
ASW (61)0.001546.880.830.04720.031AFR0.57142857
CHS (105)0.0004652.360.870.05550.010EA0.67375887
CDX (93)0.0005553.770.830.05570.012EA0.66666667
KHV (99)0.0004453.790.870.05600.011EA0.68345324
CHB (103)0.0004154.900.900.05580.010EA0.69402985
JPT (104)0.0004557.750.850.05680.010EA0.68965517
BEB (86)0.0004548.840.880.05560.012SA0.66911765
ITU (102)0.0004849.580.890.05460.016SA0.66433566
STU (102)0.0005552.840.890.05460.015SA0.64516129
PJL (96)0.0005454.000.900.05460.017SA0.67073171
GIH (103)0.0005755.810.910.05620.017SA0.65868263
TSI (107)0.0004153.170.910.05580.010EUR0.68939394
IBS (107)0.0003954.220.920.05550.010EUR0.7
CEU (99)0.0004554.230.890.05590.011EUR0.68085106
GBR (91)0.0004758.230.910.05550.012EUR0.68027211
FIN (99)0.0005459.240.860.05790.011EUR0.67073171
MXL (64)0.001466.130.890.05580.036AMR0.6
PUR (104)0.0005967.200.890.05710.014AMR0.67039106
CLM (94)0.0006975.970.950.05720.017AMR0.66985646
PEL (85)0.001278.150.850.05980.028AMR0.61290323
1KG (2503)0.006140.650.550.066Mixed0.51587302
  1. LD, linkage disequilibrium; SNP, single-nucleotide polymorphism.

  2. *

    The regression intercept b^0 and the coefficients b^1 are as represented in Equation 3.

  3. The column for LD decay score was taken as the mean of the estimated r2-1n from PopLDdecay in a physical distance of 1500 kb, which was approximated to the area under the curve in Figure 5A for each cohort; Fst was approximated by λ1n , in which λ1 the largest eigenvalue for the cohort. r2 was the estimated LD statistic from PLINK (--r2).

  4. True LD is defined as l¯^ijl¯^ij+b^0 .

A common feature was universally relative high LD of chromosome 6 and 11 in the 26 1KG cohorts (Figure 5—figure supplement 1). We quantified the impact of chromosome 6 and 11 by leave-one-chromosome-out test in CEU, CHB, YRI, and ASW for details (Figure 6A and B) and found that dropping chromosome 6 off could lift R on average by 0.017 and chromosome 11 by 0.046. One possible explanation was that the centromere regions of chromosomes 6 and 11 have been assembled more completely than other chromosomes before the completion of CHM13 (Hoyt et al., 2022), whereas meiotic recombination tended to be reduced around the centromeres (Hinch et al., 2019). We estimated li after having knocked out the centromere region (46,061,947–59,413,484 bp, chr 11) in CEU, CHB, YRI, and ASW, and chromosome 11 then did not deviate much from their respective fitted lines (Figure 6C). A notable exceptional pattern was found in ASW, chromosome 8 of which had even more deviation than chromosome 11 (R was 0.83 and 0.87 with and without chromosome 8 in leave-one-chromosome out test) (Figure 6B). The deviation of chromosome 8 of ASW was consistent even more SNPs were added (Figure 6—figure supplement 1). We also provided high-resolution LD grids illustration for chromosome 8 (163,436 SNPs, totaling 214,185 grids) of the four representative cohorts for more detailed virtualization (Figure 6D). ASW had l^8= 0.0022, but 0.00075, 0.00069, and 0.00043 for CEU, CHB, and YRI, respectively.

Figure 6 with 1 supplement see all
The correlation between the inversion of the single-nucleotide polymorphism (SNP) number and l^i.

(A) The correlation between the inversion of the SNP number and l^i in CEU, CHB, YRI, and ASW. (B) Leave-one-chromosome-out strategy is adopted to evaluate the contribution of a certain chromosome on the correlation between the inverse of the SNP number and l^i . (C) The correlation between the inversion of the SNP number and chromosomal linkage disequilibrium (LD) in CEU, CHB, YRI, and ASW after removing the centromere region of chromosome 11. (D) High-resolution illustration for LD grids for chromosome 8 in CEU, CHB, YRI, and ASW. For each cohort, we partition chromosome 8 into consecutive LD grids (each LD grid contains 250 ×250 SNP pairs). For visualization purposes, LD is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

Discussion

In this study, we present a computationally efficient method to estimate the mean LD of genomic grids of many SNP pairs. Our LD analysis framework is based on GRM, which has been embedded in variance component analysis for complex traits and genomic selection (Goddard, 2009; Visscher et al., 2014; Chen, 2014). The key connection from GRM to LD is bridged via the transformation between n×n matrix and m×m matrix, in particular here via Isserlis’s theorem under the fourth-order moment (Isserlis, 1918). With this connection, the computational cost for estimating the mean LD of m×m SNP pairs is reduced from O(nm2) to O(n2m), and the statistical properties of the proposed method are derived in theory and validated in 1KG datasets. In addition, as the genotype matrix X is of limited entries {0, 1, 2}, assuming missing genotypes are imputed first, using Mailman algorithm the computational cost of GRM can be further reduced to O(n2mlog3m) (Liberty and Zucker, 2009). The largest data tested so far for the proposed method has a sample size of 10,640 and more than 5 million SNPs, so it can complete genomic LD analysis in 77,508.00 s (Table 1). The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such a UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.

We also applied the proposed method into 1KG and revealed certain characteristics of the human genomes. Firstly, we found the ubiquitous existence of extended LD, which likely emerged because of population structure, even very slightly, and admixture history. We quantified the l^i and l^ij in 1KG, and as indicated by lij we found that the inter-chromosomal LD was nearly an order lower than intra-chromosomal LD; for admixed cohorts, the ratio was much higher, even very close to each other such as in all 1KG samples. Secondly, variable recombination rates shaped peak of local LD. For example, the HLA region showed high LD in the European and East Asian cohorts, but relatively low LD in such as YRI, consistent with their much longer population history. Thirdly, there existed a general linear correlation between li and the inversion of the SNP number, a long-anticipated result that is as predicted with genome-wide spread of recombination hotspots (Hinch et al., 2019). One outlier of this linear norm was chromosome 11, which had so far the most completely genotyped centromere and consequently had more elevated LD compared with other autosomes. We anticipate that with the release of CHM13 the linear correlation should be much closer to unity (Hoyt et al., 2022). Of note, under the variance component analysis for complex traits, it is often a positive correlation between the length of a chromosome (as surrogated by the number of SNPs) and the proportion of heritability explained (Chen et al., 2014).

In contrast, throughout the study recurrent outstanding observations were found in ASW. For example, in ASW the ratio of l^HLA/l^6 substantially dropped compared with that of CEU, CHB, or YRI as illustrated in Figure 4. Furthermore, chromosome 8 in ASW fluctuated upward most from the linear correlation (Figure 6) even after various analyses, such as expanding SNP numbers. One possible explanation may lie under the complex demographic history of ASW, which can be investigated and tested in additional African American samples or possible existence for epistatic fitness (Ni et al., 2020).

Data availability

Public genetic datasets used in this study can be freely downloaded from the following URLs.1000 Genomes Project: https://www.ebi.ac.uk/eva/?eva-study=PRJEB30460. CONVERGE: https://www.ebi.ac.uk/eva/?eva-study=PRJNA289433. All data generated or analysed during this study are included in the manuscript and supporting file. The CONVERGE dataset was used to generate Figure 1 and the other figures were generated from 1000 Genomes Project data.

The following previously published data sets were used

References

    1. Hill WG
    2. Robertson A
    (1966)
    The effect of linkage on limits to artificial selection
    Genetical Research 8:269–294.
  1. Book
    1. Lynch M
    2. Walsh B
    (1998)
    Genetics and Analysis of Quantitative Traits
    Sinauer.

Peer review

Joint Public Review:

Summary:

In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of O(nm^2), where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic-relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of O(n^2m). Since in most datasets, n<

Strengths:

Generally, for computational papers like this, the proof is in the pudding, and the authors have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper are Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X-LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

Weakness:

This method seems to be limited to calculating average levels of LD in broad regions of the genome. While it would be possible to make the regions more fine-grained, doing so appears to make this approach much less efficient. As such, applications of this method may be limited to those proposed in the paper, for questions where average LD of large chromosomal segments is informative.

Impact:

This approach seems to produce real gains for settings where broad average levels of LD are useful to know, but it will likely have less of an impact in settings where fine-grained levels are LD are necessary (e.g., accounting for LD in GWAS summary statistics).

https://doi.org/10.7554/eLife.90636.3.sa1

Author response

The following is the authors’ response to the original reviews.

Reviewer #1 (Public Review)

Summary:

Huang and colleagues present a method for approximation of linkage disequilibrium (LD) matrices. The problem of computing LD matrices is the problem of computing a correlation matrix. In the cases considered by the authors, the number of rows (n), corresponding to individuals, is small compared to the number of columns (m), corresponding to the number of variants. Computing the correlation matrix has cubic time complexity [O(nm2)], which is prohibitive for large samples. The authors approach this using three main strategies:

1. they compute a coarsened approximation of the LD matrix by dividing the genome into variant-wise blocks which statistics are effectively averaged over;

2. they use a trick to get the coarsened LD matrix from a coarsened genomic relatedness matrix (GRM), which, with O(n2m) time complexity, is faster when n << m;

3. they use the Mailman algorithm to improve the speed of basic linear algebra operations by a factor of log(max(m,n)). The authors apply this approach to several datasets.

Strengths:

The authors demonstrate that their proposed method performs in line with theoretical explanations.

The coarsened LD matrix is useful for describing global patterns of LD, which do not necessarily require variant-level resolution.

They provide an open-source implementation of their software.

Weaknesses:

The coarsened LD matrix is of limited utility outside of analyzing macroscale LD characteristics.The method still essentially has cubic complexity--albeit the factors are smaller and Mailman reduces this appreciably. It would be interesting if the authors were able to apply randomized or iterative approaches to achieve more fundamental gains. The algorithm remains slow when n is large and/or the grid resolution is increased.

Thanks for your positive and accurate evaluation! We acknowledge the weakness and include some sentences in Discussion.

“The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such as UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.”

Reviewer #2 (Public Review)

Summary:

In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of [O(nm2)], where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic- relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of O(n2m). Since in most datasets, n<<m, this can lead to major computational improvements. They have produced software written in C++ to implement this algorithm, which they call X-LD. Using the output of their method, they estimate the LD decay and the mean extended LD for various subpopulations from the 1000 Genomes Project data.

Strengths:

Generally, for computational papers like this, the proof is in the pudding, and the authors appear to have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper is Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X- LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

Weakness:

While the X-LD software appears to work well, I had a hard time following the manuscript enough to make a very good assessment of the work. This is partly because many parameters used are not defined clearly or at all in some cases. My best effort to intuit what the parameters meant often led me to find what appeared to be errors in their derivation. As a result, I am left worrying if the performance of X-LD is due to errors cancelling out in the particular setting they consider, making it potentially prone to errors when taken to different contexts.

Thanks for you critical reading and evaluation. We do feel apologize for typos, which have been corrected and clearly defined now (see Eq 1 and Table 1). In addition, we include more detailed mathematical steps, which explain how LD decay regression is constructed and consequently finds its interpretation (see the detailed derivation steps between Eq 3 and Eq 4).

Impact:

I feel like there is value in the work that has been done here if there were more clarity in the writing. Currently, LD calculations are a costly step in tools like LD score regression and Bayesian prediction algorithms, so a more efficient way to conduct these calculations would be useful broadly. However, given the difficulty I had following the manuscript, I was not able to assess when the authors’ approach would be appropriate for an extension such as that.

See our replies below in responding to your more detailed questions.

Reviewer #1 (Recommendations For The Authors)

There are numerous linguistic errors throughout, making it challenging to read.

It is unclear how the intercepts were chosen in Figure S2. Since theory only gives you the slopes, it seems like it would make more sense to choose the intercept such that it aligns with the empirical results in some way.

Thanks for your critical evaluation. We do feel apologize some typos, and we have read it through and clarify the text as much as possible. In addition, we included Table 1, which introduces mathematical symbols of the paper.

In Figure S2, the two algorithms being compared have different software implementations, PLINK vs X-LD. Their real performance not only depended on the time complexity of the algorithms (right-side y-axis), but also how the software was coded. PLINK is known for its excellent programming. If we could have programmed as well as Chris Chang, the performance of X-LD should have been even better and approach the ratio m/n. However, even under less skilled programming, X-LD outperformed plink.

Reviewer #2 (Recommendations For The Authors):

Thank you for the chance to review your manuscript. It looks like compelling work that could be improved by greater detail. Providing the level of detail necessary may require creating a Supplementary Note that does a lot of hand-holding for readers like me who are mathematically literate but who don’t have the background that you do. Then you can refer readers to the Supplement if they can’t follow your work.

We fix the problems and style issues as possible as we can.

Regarding the weakness section in the public review, here are a few examples of where I got confused, though this list is not exhaustive.

1. Consider Equation 1 (line 100), which I believe must be incorrect. Imagine that g consists of two SNPs on different chromosomes with correlation rho. Then ell_g (which is defined as the average squared elements of the correlation matrix) would be

ell_g = 1/4 (1 + 1 + rho^2 + rho^2) = (1+rho^2)/2.

But ell_1=1 and ell_2=1 and ell_12=rho^2 (The average squared elements of the chromosome-specific correlation matrices and the cross-chromosome correlation matrix, respectively). So

sum(ell_i)+sum(ell_ij) = 1 + 1 + rho^2 + rho^2 = (1+rho^2)*2.

I believe your formulas would hold if you defined your LD values as the sum of squared correlations instead of the mean, but then I don’t know if the math in the subsequent sections holds. I think this problem also holds for Eq 2 and therefore makes Eqs 3 and 4 difficult to interpret.

Thanks for your attentive review and invaluable suggestions. We acknowledge the typo in calculating the mean in Eq 1, resulting in difficulties in understanding the equations. We sincerely apologize for this oversight. To address this issue and ensure clarity in the interpretation of Eq 3 and Eq 4, we have provided more detailed explanations (see the derivation between Eq 3 and Eq 4).

2. I didn’t know what the parameters are in Equation 3. The vector ell needs to be defined. Is it the vector of ell_i for each chromosomal segment i? I’m also confused by the definition of m_i, which is defined on line 113 as the “SNP number of the i-th chromosome.” Do the authors mean the number of SNPs on the i-th chromosomal segment? If so, it wasn’t clear to me how Eq 2 and Eq 3 imply Eq 4. Further, it wasn’t clear to me why E(b1) quantifies the average LD decay of the genome. I’m used to seeing plots of average LD as a function of distance between SNPs to calculate this, though I’m admittedly not a population geneticist, so maybe this is standard. Standard or not, readers deserve to have their hands held a bit more through this either in the text or in a Supplementary Note.

Thanks for your insightful feedback. When we were writing this paper, our actually focus was Eq 3 and to establish the relationship between chromosomal LD and the reciprocal of the length of chromosome (Fig 6A) – which was surrogated by the number of SNPs, the correlation between ell_i and 1/m_i.

We asked around our friends who are population geneticists, who anticipated the correlation between chromosomal LD (ell) and 1/m. The rationale simple if one knows the very basis of population genetics. A long chromosome experiences more recombination, which weakens LD for a pair of loci. In particular, for a pair of loci D_t=D_0 (1-c)^t. D_t the LD at the t generation, D_0 at the 0 generation, and c the recombination fraction. As recombination hotspots are nearly even distributed along the genome, such as reported by Science 2019;363:eaau8861, the chromosome will be broken into the shape in Author response image 1 (Fig 1C, newly added). Along the diagonal you see tight LD block, which will be vanished in the further as predicted by D_t equation, and any loci far away from each other will not be in LD otherwise raised by such as population structure. Ideally, we assume the diagonal block of aveage size of m×m and average LD of a SNP with other SNPs inside the diagonal block (red) is l_u; and, in contrast, off-diagonal average LD (light red) to be l_uv. This logic is hidden but employed in such as ld score regression and prs refinement using LD structure.

Author response image 1

But, how to estimate chromosomal LD (ell), which is overwhelming [O(nm2)] as our friends said!So, the Figure 6A is logically anticipated by a seasoned population geneticist, but has never been realized because of [O(nm2)] is nightmare. Often, those signature patterns should have been employed as showcases in releasing new reference data, such as HapMap. However, to our knowledge, this signature linear relationship has never been illustrated in those reference data.

If you further test a population geneticist, if any chromosome will deviate from this line (Fig 6A)? The answer most likely will be chromosome 6 because of the LD tight HLA region. However, it is chromosome 11 because of its most completed sequenced centromere. Chr 11 is a surprise! With T2T sequenced population, Chr 11 will not deviate much. We predict!

However, we suspect whether people appreciate this point, we shift our focus to efficient computation of LD—which is more likely understood. We acknowledge the lack of clarity in notation definitions and the absence of the derivation for the interpretation of b1 and b0 for LD decay regression. So, we have added a table to provide an explanation of the notation (see the Table 1) and provided additional derivations, which explained how LD decay regression was derived (see the derivation between Eq 3 and Eq 4). Figure 1C provides illustration for the underlying assumption under LD.

The technique to bridge Eq 2~3 to Eq 4 is called “building interpretation”. It once was one of the kernel tasks for population genetics or statistical genetics, and a classical example is Haseman-Elston regression (Behavior Genetics, 1972, 2:3-19). When it is moving towards a data-driven style, the culture becomes “shut up, calculate”. Finding interpretation for a regression is a vanishing craftmanship, and people often end up with unclear results!

3. In line 135, it’s not clear to me what is meant by Gio2. If it is GioGio, then wouldn’t the resulting matrix be a matrix of zeros since Gio is zero everywhere except the lower off-diagonal? So maybe it is GioTGio? But then later in that line, you say that the square of this matrix is the sum of several terms of the form Gk1k2. Are these the scalar elements of the G matrix? But then the sum is a scalar, which can’t be true since E(Gio2) is a matrix.

Thanks for your attentive review. We indeed confused the definition of matrices and their elements, and Gio should refer to the stacked off-diagonal elements of matrix Gi. So, Gio is a vector for variable gij – the relationship between sample i and j. We assume the reviewer use R software, then E(Gio2) corresponds to mean (G[row(G)<col(G)]2).

See the text between Eq 5 and Eq 6.

“We extract two vectors kio, which stacks the off-diagonal elements of Ki, and kid, which takes the diagonal elements of Ki.”

In addition, E(Gdiag )×n+E(Goff-diag )n(n1)=0, so the ground truth is that E(Gio)=1n1, but not zero.

To clarify these math symbols, we replace G with K, so as to be consistent with our other works (see Table 1).

To derive the means and the sampling variances for i and ij, the Eq 7 can be established by some modifications on the Delta method as exampled in Appendix I of Lynch and Walsh’s book (Lynch and Walsh, 1998). We added this sentence near Eq 7 in the main text.

https://doi.org/10.7554/eLife.90636.3.sa2

Article and author information

Author details

  1. Xin Huang

    1. Institute of Bioinformatics, Zhejiang University, Hangzhou, China
    2. Center for General Practice Medicine, Department of General Practice Medicine, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, China
    3. Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Zhejiang, China
    Contribution
    Data curation, Software, Formal analysis, Validation, Visualization, Methodology, Writing – original draft
    Contributed equally with
    Tian-Neng Zhu
    Competing interests
    No competing interests declared
  2. Tian-Neng Zhu

    Institute of Bioinformatics, Zhejiang University, Hangzhou, China
    Contribution
    Data curation, Software, Formal analysis, Validation, Visualization, Methodology, Writing – review and editing
    Contributed equally with
    Xin Huang
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0009-0007-7507-4521
  3. Ying-Chao Liu

    Institute of Bioinformatics, Zhejiang University, Hangzhou, China
    Contribution
    Data curation, Formal analysis, Validation, Visualization
    Competing interests
    No competing interests declared
  4. Guo-An Qi

    1. Institute of Bioinformatics, Zhejiang University, Hangzhou, China
    2. Hainan Institute of Zhejiang University, Hainan, China
    Contribution
    Formal analysis, Validation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2412-3932
  5. Jian-Nan Zhang

    Alibaba Group, Hangzhou, China
    Contribution
    Software
    Competing interests
    No competing interests declared
  6. Guo-Bo Chen

    1. Center for General Practice Medicine, Department of General Practice Medicine, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, China
    2. Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Zhejiang, China
    3. Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, China
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Supervision, Validation, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    chenguobo@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5475-8237

Funding

National Natural Science Foundation of China (31771392)

  • Guo-Bo Chen

China National Tobacco Corporation (110202101032(JY-09))

  • Guo-Bo Chen

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

GBC conceived and initiated the study. XH, TNZ, and YL conducted simulation and analyzed data. GBC, XH, TNZ, GAQ, and JZ developed the software. GBC wrote the first draft of the article. All authors contributed to the writing, discussion of the paper, and validation of the results. This work was supported by the National Natural Science Foundation of China (31771392), 110202101032 (JY-09), and GZY-ZJ-KJ-23001. The funders played no role in the design, preparation, and submission of the article.

Ethics

Human subjects: This study investigated variation in previously published anonymized genome data from the 1000 Genomes Project.

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Reviewing Editor

  1. Alexander Young, University of California, Los Angeles, United States

Version history

  1. Preprint posted: June 20, 2023 (view preprint)
  2. Sent for peer review: August 7, 2023
  3. Preprint posted: October 5, 2023 (view preprint)
  4. Preprint posted: November 10, 2023 (view preprint)
  5. Version of Record published: December 27, 2023 (version 1)
  6. Version of Record updated: January 2, 2024 (version 2)

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.90636. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Huang, Zhu et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 184
    Page views
  • 41
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Xin Huang
  2. Tian-Neng Zhu
  3. Ying-Chao Liu
  4. Guo-An Qi
  5. Jian-Nan Zhang
  6. Guo-Bo Chen
(2023)
Efficient estimation for large-scale linkage disequilibrium patterns of the human genome
eLife 12:RP90636.
https://doi.org/10.7554/eLife.90636.3

Share this article

https://doi.org/10.7554/eLife.90636

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Thomas A Sasani, Aaron R Quinlan, Kelley Harris
    Research Article

    Maintaining germline genome integrity is essential and enormously complex. Although many proteins are involved in DNA replication, proofreading, and repair, mutator alleles have largely eluded detection in mammals. DNA replication and repair proteins often recognize sequence motifs or excise lesions at specific nucleotides. Thus, we might expect that the spectrum of de novo mutations – the frequencies of C>T, A>G, etc. – will differ between genomes that harbor either a mutator or wild-type allele. Previously, we used quantitative trait locus mapping to discover candidate mutator alleles in the DNA repair gene Mutyh that increased the C>A germline mutation rate in a family of inbred mice known as the BXDs (Sasani et al., 2022, Ashbrook et al., 2021). In this study we developed a new method to detect alleles associated with mutation spectrum variation and applied it to mutation data from the BXDs. We discovered an additional C>A mutator locus on chromosome 6 that overlaps Ogg1, a DNA glycosylase involved in the same base-excision repair network as Mutyh (David et al., 2007). Its effect depends on the presence of a mutator allele near Mutyh, and BXDs with mutator alleles at both loci have greater numbers of C>A mutations than those with mutator alleles at either locus alone. Our new methods for analyzing mutation spectra reveal evidence of epistasis between germline mutator alleles and may be applicable to mutation data from humans and other model organisms.

    1. Chromosomes and Gene Expression
    2. Evolutionary Biology
    Katherine Rickelton, Trisha M Zintel ... Courtney C Babbitt
    Research Article Updated

    Primate evolution has led to a remarkable diversity of behavioral specializations and pronounced brain size variation among species (Barton, 2012; DeCasien and Higham, 2019; Powell et al., 2017). Gene expression provides a promising opportunity for studying the molecular basis of brain evolution, but it has been explored in very few primate species to date (e.g. Khaitovich et al., 2005; Khrameeva et al., 2020; Ma et al., 2022; Somel et al., 2009). To understand the landscape of gene expression evolution across the primate lineage, we generated and analyzed RNA-seq data from four brain regions in an unprecedented eighteen species. Here, we show a remarkable level of variation in gene expression among hominid species, including humans and chimpanzees, despite their relatively recent divergence time from other primates. We found that individual genes display a wide range of expression dynamics across evolutionary time reflective of the diverse selection pressures acting on genes within primate brain tissue. Using our samples that represent a 190-fold difference in primate brain size, we identified genes with variation in expression most correlated with brain size. Our study extensively broadens the phylogenetic context of what is known about the molecular evolution of the brain across primates and identifies novel candidate genes for the study of genetic regulation of brain evolution.