Efficient estimation for largescale linkage disequilibrium patterns of the human genome
Abstract
In this study, we proposed an efficient algorithm (XLD) for estimating linkage disequilibrium (LD) patterns for a genomic grid, which can be of interchromosomal scale or of small segments. Compared with conventional methods, the proposed method was significantly faster, dropped from O(nm^{2}) to O(n^{2}m)—n the sample size and m the number of SNPs, and consequently we were permitted to explore in depth unknown or reveal longanticipated LD features of the human genome. Having applied the algorithm for 1000 Genome Project (1KG), we found (1) the extended LD, driven by population structure, universally existed, and the strength of interchromosomal LD was about 10% of their respective intrachromosomal LD in relatively homogeneous cohorts, such as FIN, and to nearly 56% in admixed cohort, such as ASW. (2) After splitting each chromosome into upmost of more than a half million grids, we elucidated the LD of the HLA region was nearly 42 folders higher than chromosome 6 in CEU and 11.58 in ASW; on chromosome 11, we observed that the LD of its centromere was nearly 94.05 folders higher than chromosome 11 in YRI and 42.73 in ASW. (3) We uncovered the longanticipated inversely proportional linear relationship between the length of a chromosome and the strength of chromosomal LD, and their Pearson’s correlation was on average over 0.80 for 26 1KG cohorts. However, this linear norm was so far perturbed by chromosome 11 given its more completely sequenced centromere region. Uniquely chromosome 8 of ASW was found most deviated from the linear norm than any other autosomes. The proposed algorithm has been realized in C++ (called XLD) and is available at https://github.com/gc5k/gear2, and can be applied to explore LD features in any sequenced populations.
eLife assessment
This study presents a useful new approach for efficient computation of statistics on correlations between genetic variants (linkage disequilibrium, or LD), which the authors apply to quantify the extent of LD across chromosomes. The method and its derivation are solid. The authors document that crosschromosome LD can be substantial, which has implications for geneticists who are interested in population structure and its impact on genetic association studies.
https://doi.org/10.7554/eLife.90636.3.sa0Introduction
Linkage disequilibrium (LD) is the association for a pair of loci and the metric of LD serves as the basis for developing genetic applications in agriculture, evolutionary biology, and biomedical research (Weir, 2008; Hill and Robertson, 1966). The structure of LD of the human genome is shaped by many factors, mutation, recombination, population demography, epistatic fitness, and completeness of genomic data itself (Myers et al., 2005; Nei and Li, 1973; Ardlie et al., 2002). Due to its overwhelming cost, LD structure investigation is often compromised to a small genomic region (Chang et al., 2015; Theodoris et al., 2021), and their typical LD structure is as illustrated for a small segment (Barrett et al., 2005). Now, given the availability of largescale genomic data, such as millions of singlenucleotide polymorphisms (SNPs), the largescale LD patterns of the human genome play crucial roles in determining genomics studies, and many theories and useful algorithms upon largescale LD structure, from genomewide association studies, polygenic risk prediction for complex diseases, and choice for reference panels for genotype imputation (Vilhjálmsson et al., 2015; Yang and Zhou, 2020; BulikSullivan et al., 2015; Yang et al., 2011; Das et al., 2016).
However, there are impediments, largely due to intensified computational cost, in both investigating largescale LD and providing highresolution illustrations for their details. If we consider a genomic grid that consists of ${m}^{2}$ SNP pairs, given a sample of $n$ individuals and $m$ SNPs ($n\ll m$)—typically as observed in 1000 Genomes Project (1KG) (Lowy et al., 2019), its benchmark computational time cost for calculating all pairwise LD is $\mathcal{O}\left(n{m}^{2}\right)$, a burden that quickly drains computational resources given the volume of the genomic data. In practice, it is of interest to know the mean LD of the ${m}_{i}^{2}$ SNP pairs for a genomic grid, which covers ${m}_{i}\times {m}_{j}$ SNP pairs. Upon how a genomic grid is defined, a genomic grid consequently can consist of (1) the whole genomewide ${m}^{2}$ SNP pairs, and we denote their mean LD as $\mathcal{l}}_{g$ ; (2) the intrachromosomal mean LD for the ith chromosome of ${m}_{i}^{2}$ SNP pairs, and denote as $\mathcal{l}}_{i$ ; and (3) the interchromosomal mean LD ith and jth chromosomal ${m}_{i}{m}_{j}$ SNP pairs, and denoted as $\mathcal{l}}_{i\cdot j$ .
In this study, we propose an efficient algorithm that can estimate $\mathcal{l}}_{g$ , $\mathcal{l}}_{i$ , and $\mathcal{l}}_{i\cdot j$ , the computational time of which can be reduced from $\mathcal{O}\left(n{m}_{i}^{2}\right)$ to $\mathcal{O}\left({n}^{2}{m}_{i}\right)$ for $\mathcal{l}}_{i$ and $\mathcal{O}\left(n{m}_{i}{m}_{j}\right)$ to $\mathcal{O}\left({n}^{2}{m}_{i}+{n}^{2}{m}_{j}\right)$ for $\mathcal{l}}_{i\cdot j$ . The rationale of the proposed method relies on the connection between the genetic relationship matrix (GRM) and LD (Chen, 2014; Goddard, 2009), and in this study a more general transformation from GRM to LD can be established via Isserlis’s theorem (Isserlis, 1918; Zhou, 2017). The statistical properties, such as sampling variance, of the estimated LD have been derived too.
The proposed method can be analogously considered a more powerful realization for Haploview (Barrett et al., 2005), but additional utility can be derived to bring out an unprecedented survey of LD patterns of the human genome. As demonstrated in 1KG, we consequently investigate how biological factors such as population structure, admixture, or variable local recombination rates can shape largescale LD patterns of the human genomes.
The proposed method provides statistically unbiased estimates for largescale LD patterns and shows computational merits compared with the conventional methods (Figure 2).
We estimated ℓ and 22 autosomal ℓ and 231 interautosomal ℓ⋅ for the 1KG cohorts. There was a ubiquitous existence of extended LD, which was associated with population structure or admixture (Figure 3).
We provided highresolution illustration that decomposed a chromosome into upmost nearly a million grids, each of which was consisted of 250 × 250 SNP pairs, the highest resolution that has been realized so far at autosomal level (Figure 4); tremendous variable recombination rates led to regional strong LD as highlighted for the HLA region of chromosomes 6 and the centromere region of chromosome 11.
Furthermore, a consequently linear regression constructed could quantify LD decay score genomewidely, and in contrast LD decay was previously surrogated in a computationally expensive method. There was a strong ethnicity effect that was associated with extended LD (Figure 5).
We demonstrate that the strength of autosomal $\mathcal{l}}_{i$ was inversely proportional to the SNP number, an anticipated relationship that is consistent with genomewide spread of recombination hotspots. However, chromosome 8 of ASW showed substantial deviation from the fitted linear relationship (Figure 6).
The proposed algorithm has been realized in C++ and is available at https://github.com/gc5k/gear2, (copy archived at Chen, 2023). As tested, the software could handle sample sizes as large as 10,000 individuals.
Methods
The overall rationale for largescale LD analysis
We assume LD for a pair of biallelic loci is measured by squared Pearson’s correlation, ${\rho}_{{l}_{1}{l}_{2}}^{2}=\frac{{D}_{{l}_{1}{l}_{2}}^{2}}{{p}_{{l}_{1}}{q}_{{l}_{1}}{p}_{{l}_{2}}{q}_{{l}_{2}}}$ , in which ${D}_{{l}_{1}{l}_{2}}$ the LD of loci ${l}_{1}$ and ${l}_{2}$ , ${p}_{.}$ and ${q}_{.}$ the reference and the alternative allele frequencies. If we consider the averaged LD for a genomic grid over ${m}_{i}^{2}$ SNP pairs, the conventional estimator is $\hat{\mathcal{l}}}_{i}=\frac{1}{{m}_{i}^{2}}\sum _{{l}_{1},{l}_{2}}^{{m}_{i}}{\rho}_{{l}_{1}{l}_{2}}^{2$ , and, if we consider the averaged LD for ${m}_{i}$ and ${m}_{j}$ SNP pairs between two genomic segments, then $\hat{\mathcal{l}}}_{i\cdot j}=\frac{1}{{m}_{i}{m}_{j}}\sum _{{l}_{1},{l}_{2}}^{{m}_{i},{m}_{j}}{\rho}_{{l}_{1}{l}_{2}}^{2$ . Now let us consider the 22 human autosomes (Figure 1A). We naturally partition the genome into $\mathcal{C}=22$ blocks, and its genomic LD, denoted as ${\mathcal{l}}_{g},$ can be expressed as
So we can decompose $\mathcal{l}}_{g$ into $\mathcal{C}$$\mathcal{l}}_{i$ and $\frac{\mathcal{C}\left(\mathcal{C}1\right)}{2}$ unique $\mathcal{l}}_{i\cdot j$ . Obviously, Equation 1 can be also expressed in the context of a single chromosome ${\mathcal{l}}_{i}=\frac{1}{{\beta}_{i}^{2}}\left(\sum _{u}^{{\beta}_{i}}{\mathcal{l}}_{u}+\sum _{u\ne v}^{{\beta}_{i}}{\mathcal{l}}_{uv}\right)$, in which $\beta}_{i}=\frac{{\mathcal{m}}_{i}}{\mathcal{m}$ the number of SNP segments, each of which has $\mathcal{m}$ SNPs. Geometrically it leads to ${\beta}_{i}$ diagonal grids and $\frac{{\beta}_{i}\left({\beta}_{i}1\right)}{2}$ unique offdiagonal grids (Figure 1B).
LDdecay regression
As human genome can be boiled down to small LD blocks by genomewidely spread recombination hotspots (Hinch et al., 2019; Li et al., 2022), mechanically there is selfsimilarity for each chromosome that the relatively strong $\mathcal{l}}_{i$ for juxtaposed grids along the diagonal but weak $\mathcal{l}}_{i\cdot j$ for grids slightly offdiagonal. So, for a chromosomal $\mathcal{l}}_{i$ , we can further express it as
in which $\mathcal{l}}_{u$ is the mean LD for a diagonal grid, $\mathcal{l}}_{uv$ the mean LD for offdiagonal grids, and ${m}_{i}$ the number of SNPs on the ith chromosome. Consider a linear model below (see Figure 1C for its illustration),
in which $\mathcal{l}$ represents a vector composed of $\mathcal{C}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{l}}_{i},x$ , $\mathit{x}$ represents a vector composed of $\mathcal{C}$${x}_{i}$, and ${x}_{i}=\frac{1}{{m}_{i}}$ the inversion of the SNP number of the ith chromosome. The regression coefficient and intercept can be estimated as below:
and
There are some technical details in order to find the interpretation for ${b}_{0}$ and ${b}_{1}$ . We itemize them briefly. For the mean and variance of $x$:
For $E\left(x\mathcal{l}\right)$:
For $E\left(x\right)E\left(\mathcal{l}\right)$:
Then we integrate these items to have the expectation for ${b}_{1}:$
Similarly, we plug in $E\left({b}_{1}\right)$ so as to derive ${b}_{0}$ :
After some algebra, if $E\left({\mathcal{l}}_{u}\right)\gg E\left({\mathcal{l}}_{uv}\right)$—say if the former is far greater than the latter, the interpretation of ${b}_{1}$ and ${b}_{0}$ can be
It should be noticed that $E\left({b}_{1}\right)\approx E\left({\mathcal{l}}_{u}\right)\mathcal{m}$ quantifies the averaged LD decay of the genome. Conventional LD decay is analyzed via the wellknown LD decay analysis, but Equation 4 provides a direct estimate of both LD decay and possible existence of extended LD. We will see the application of the model in Figure 5 that the strength of the longdistance LD is associated with population structure. Of note, the underlying assumption of Equations 3 and 4 is genomewide spread of recombination hotspots, an established result that has been revealed and confirmed (Hinch et al., 2019).
Efficient estimation for $\mathcal{l}}_{g$, $\mathcal{l}}_{i$, and $\mathcal{l}}_{i\cdot j$
For the aforementioned analyses, the bottleneck obviously lies in the computational cost in estimating $\mathcal{l}}_{i$ and $\mathcal{l}}_{i\cdot j$ . $\mathcal{l}}_{i$ and $\mathcal{l}}_{i\cdot j$ are used to be estimated via the current benchmark algorithm as implemented in PLINK (Chang et al., 2015), and the computational time complex is proportional to $\mathcal{O}\left(n{m}^{2}\right)$. We present a novel approach to estimate $\mathcal{l}}_{i$ and $\mathcal{l}}_{i\cdot j$ . Given a genotypic matrix $\mathbf{X}$, a $n\times m$ matrix, if we assume that there are ${m}_{i}$ and ${m}_{j}$ SNPs on chromosomes $i$ and $j$, respectively, we can construct $n\times n$ genetic relatedness matrices as below:
in which ${\stackrel{~}{\mathbf{X}}}_{i}$ is the standardized ${\mathbf{X}}_{i}$ and ${\stackrel{~}{x}}_{kl}=\frac{{x}_{kl}2{p}_{l}}{\sqrt{2(1+F){p}_{l}{q}_{l}}}$ , where ${x}_{kl}$ is the genotype for the kth individual at the lth biallelic locus, $F$ is the inbreeding coefficient having the value of 0 for random mating population and 1 for an inbred population, and ${p}_{l}$ and ${q}_{l}$ are the frequencies of the reference and the alternative alleles (${p}_{l}+{q}_{l}=1$), respectively. When GRM is given, we can obtain some statistical characters of ${\mathbf{K}}_{i}$ . We extract two vectors ${\mathit{k}}_{{i}_{o}}$ , which stacks the offdiagonal elements of ${\mathbf{K}}_{i}$ , and ${\mathit{k}}_{{i}_{d}}$ , which takes the diagonal elements of ${\mathbf{K}}_{i}$ . The mathematical expectation of ${\mathit{k}}_{{i}_{o}}^{2}$ , in which $E({\mathit{k}}_{{i}_{o}}^{2})=\frac{1}{n(n1)}\sum _{{k}_{1}\ne {k}_{2}}^{n}{k}_{{k}_{1},{k}_{2}}^{2}$ , can be established according to Isserlis’s theorem in terms of the fourorder moment (Isserlis, 1918),
in which $E({\theta}_{{k}_{1}{k}_{2}})={\left(\frac{1}{2}\right)}^{r}$ is the expected relatedness score and $r$ indicates the rthdegree relatives. $r=0$ for the same individual, and $r=1$ for the firstdegree relatives. Similarly, we can derive for $E\left({{\mathit{k}}_{{i}_{o}}\mathit{k}}_{{j}_{o}}\right)$.Equation 6 establishes the connection between GRM and the aggregated LD estimation that ${\mathcal{l}}_{i}=E\left({\mathit{k}}_{{i}_{o}}^{2}\right)$ . According to Delta method as exampled in Appendix I of Lynch and Walsh, 1998, the means and the sampling variances for $\mathcal{l}}_{i$ and $\mathcal{l}}_{i\cdot j$ are
in which $var({\mathit{k}}_{{i}_{o}})=E({\mathit{k}}_{{i}_{o}}^{2}){\left[E\left({\mathit{k}}_{{i}_{o}}\right)\right]}^{2}={\mathcal{l}}_{i}\frac{1}{{\left(n1\right)}^{2}}$ and $cov\left({\mathit{k}}_{{i}_{o}},{\mathit{k}}_{{j}_{o}}\right)=E\left({\mathit{k}}_{{i}_{o}}{\mathit{k}}_{{j}_{o}}\right)E\left({\mathit{k}}_{{i}_{o}}\right)E\left({\mathit{k}}_{{j}_{o}}\right)={\mathcal{l}}_{i\cdot j}\frac{1}{{\left(n1\right)}^{2}}$ , respectively. Of note, the properties of $\mathcal{l}}_{g$ can be derived similarly if we replace $\mathcal{l}}_{i$ with $\mathcal{l}}_{g$ in Equation 7. We can develop $\stackrel{\sim}{\mathcal{l}}}_{i\cdot j$ , a scaled version of $\mathcal{l}}_{i\cdot j$ , as below:
in which $\stackrel{\sim}{\mathcal{l}}}_{i}=\frac{{m}_{i}{\mathcal{l}}_{i}1}{{m}_{i}1$ , a modification that removed the LD with itself. According to Delta method, the sampling variance of $\stackrel{\sim}{\mathcal{l}}}_{i\cdot j$ is
Of note, when there is no LD between a pair of loci, $\mathcal{l}$ yields zero and its counterpart PLINK estimate yields $\frac{1}{n}$ , a difference that can be reconciled in practice (see Figure 2).
Raise of LD due to population structure
In this study, the connection between LD and population structure is bridged via two pathways below, in terms of a pair of loci and of the aggregated LD for all pair of loci. For a pair of loci, their LD is often simplified as ${\rho}_{{l}_{1}{l}_{2}}^{2}=\frac{{D}_{{l}_{1}{l}_{2}}^{2}}{{p}_{{l}_{1}}{q}_{{l}_{1}}{p}_{{l}_{2}}{q}_{{l}_{2}}}$ , but will be inflated if there are subgroups (Nei and Li, 1973). In addition, it is well established the connection between population structure and eigenvalues, and in particular the largest eigenvalue is associated with divergence of subgroups (Patterson et al., 2006). In this study, the existence of subgroups of cohort is surrogated by the largest eigenvalue ${\lambda}_{1}$ or ${\stackrel{}{F}}_{st}\approx \frac{{\lambda}_{1}}{n}$ .
Data description and quality control
The 1KG (Auton et al., 2015), which was launched to produce a deep catalog of human genomic variation by wholegenome sequencing (WGS) or wholeexome sequencing (WES), and 2503 strategically selected individuals of global diversity were included (containing 26 cohorts). We used the following criteria for SNP inclusion for each of the 26 1KG cohorts: (1) autosomal SNPs only; (2) SNPs with missing genotype rates higher than 0.2 were removed, and missing genotypes were imputed; and (3) only SNPs with minor allele frequencies higher than 0.05 were retained. Then 2,997,635 consensus SNPs that were present in each of the 26 cohorts were retained. According to their origins, the 26 cohorts are grouped as African (AFR: MSL, GWD, YRI, ESN, ACB, LWK, and ASW), European (EUR: TSI, IBS, CEU, GBR, and FIN), East Asian (EA: CHS, CDX, KHV, CHB, and JPT), South Asian (SA: BEB, ITU, STU, PJL, and GIH), and American (AMR: MXL, PUR, CLM, and PEL), respectively.
In addition, to test the capacity of the developed software (XLD), we also included CONVERGE cohort ($n=\mathrm{10,640}$), which was used to investigate major depressive disorder (MDD) in the Han Chinese population (Cai et al., 2015). We performed the same criteria for SNP inclusion as that of the 1KG cohorts, and $m=\mathrm{5,215,820}$ SNPs remained for analyses.
XLD software implementation
The proposed algorithm has been realized in our XLD software, which is written in C++ and reads in binary genotype data as often used in PLINK. As multithread programming is adopted, the efficiency of XLD can be improved upon the availability of computational resources. We have tested XLD in various independent datasets for its reliability and robustness. Certain data management options, such as flexible inclusion or exclusion of chromosomes, have been built into the commands of XLD. In XLD, missing genotypes are naively imputed according to Hardy–Weinberg proportions; however, when the missing rate is high, we suggest the genotype matrix should be imputed by other advanced imputation tools.
The most timeconsuming part of XLD was the construction of GRM $\mathbf{K}=\frac{1}{m}\stackrel{~}{\mathbf{X}}{\stackrel{~}{\mathbf{X}}}^{T}$ , and the established computational time complex was $\mathcal{O}\left({n}^{2}m\right)$. However, if $\stackrel{~}{\mathbf{X}}$ is decomposed into $\stackrel{~}{\mathbf{X}}=[{\stackrel{~}{\mathbf{X}}}_{\left[{t}_{1},\right]}\vdots {\stackrel{~}{\mathbf{X}}}_{\left[{t}_{2},\right]}\vdots \cdots \vdots {\stackrel{~}{\mathbf{X}}}_{\left[{t}_{z},\right]}]$, in which ${\stackrel{~}{\mathit{X}}}_{[{t}_{i},]}$ has dimension of $n\times B$, using Mailman algorithm the computational time complex for building $\mathbf{K}$ can be reduced to $\mathcal{O}\left(\frac{{n}^{2}m}{{\mathrm{log}}_{3}m}\right)$ (Liberty and Zucker, 2009). This idea of embedding Mailman algorithm into certain highthroughput genomic studies has been successful, and our XLD software is also leveraged by absorbing its recent practice in genetic application (Wu and Sankararaman, 2018).
Results
Statistical properties of the proposed method
Table 1 introduces the symbols frequently cited in this study. As schematically illustrated in Figure 1, $\mathcal{l}}_{g$ could be decomposed into $\mathcal{C}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\mathcal{l}}_{i}$ and $\frac{\mathcal{C}\left(\mathcal{C}1\right)}{2}$ unique $\mathcal{l}}_{i\cdot j$ components. We compared the estimated $\mathcal{l}}_{i$ and $\mathcal{l}}_{i\cdot j$ in XLD with those being estimated in PLINK (known as ‘r2,’ and the estimated squared Pearson’s correlation LD is denoted as ${r}^{2}$). Considering the substantial computational cost of PLINK, only 100,000 randomly selected autosome SNPs were used for each 1KG cohort, and 22 $\hat{\mathcal{l}}}_{i$ and 231 $\hat{\mathcal{l}}}_{i\cdot j$ were estimated. After regressing 22 $\hat{\mathcal{l}}}_{i$ against those of PLINK, we found that the regression slope was close to unity and bore an anticipated intercept a quantity of approximately $\frac{1}{n}$ (Figure 2A and B). In other words, PLINK gave $\frac{1}{n}$ even for SNPs of no LD. However, when regressing 231 $\hat{\mathcal{l}}}_{i\cdot j$ estimates against those of PLINK, it was found that largely because of the tiny quantity of $\hat{\mathcal{l}}}_{i\cdot j$ it was slightly smaller than 1 but statistically insignificant from 1 in these 26 1KG cohorts (mean of 0.86 and SD of 0.10, and its 95% CI was (0.664, 1.056)); when the entire 1KG samples were used, its much larger LD due to subgroups, nearly no estimation bias was found (Figure 2A and B). In contrast, because of their much larger values, $\hat{\mathcal{l}}}_{i$ components were always consistent with their corresponding estimates from PLINK (mean of 1.03 and SD of 0.012, 95% CI was (1.006, 1.053), bearing an ignorable bias). Furthermore, we also combined the African cohorts together (MSL, GWD, YRI, ESN, LWK, totaling 599 individuals), the East Asian cohorts together (CHS, CDX, KHV, CHB, and JPT, totaling 504 individuals), and the European cohorts together (TSI, IBS, CEU, GBR, and FIN, totaling 503 individuals), and the resemblance pattern between XLD and PLINK was similar as observed in each cohort alone (Figure 2—figure supplement 1). The empirical data in 1KG verified that the proposed method was sufficiently accurate.
To fairly evaluate the computational efficiency of the proposed method, the benchmark comparison was conducted on the first chromosome of the entire 1KG dataset ($n=\mathrm{2,503}$ and $m=\mathrm{225,967}$), and 10 CPUs were used for multithread computing. Compared with PLINK, the calculation efficiency of XLD was nearly 30–40 times faster for the tested chromosome, and its computational time of XLD was proportional to $\mathcal{O}\left(\frac{{n}^{2}m}{{\mathrm{log}}_{3}m}\right)$ (Figure 2—figure supplement 2). So, XLD provided a feasible and reliable estimation of largescale complex LD patterns. More detailed computational time of the tested tasks is reported in their corresponding sections below; since each 1KG cohort had a sample size of around 100, otherwise specified the computational time was reported for CHB ($n=103$) as a reference (Table 2). In order to test the capability of the software, the largest dataset tested was CONVERGE ($n=\mathrm{10,640}$, and $m=\mathrm{5,215,820}$), and it took 77,508.00 s, about 22 hr, to estimate 22 autosomal $\hat{\mathcal{l}}}_{i$ and 231 $\hat{\mathcal{l}}}_{i\cdot j$ (Figure 1A); when zooming into chromosome 2 of CONVERGE, on which 420,949 SNPs had been evenly split into 1000 blocks and yielded 1000 $\hat{\mathcal{l}}}_{u$ grids, and 499,500 $\hat{\mathcal{l}}}_{uv$ LD grids, it took 45,125.00 s, about 12.6 hr, to finish the task (Figure 1B).
Ubiquitously extended LD and population structure/admixture
We partitioned the 2,997,635 SNPs into 22 autosomes ( Figure 3A , Figure 3—figure supplement 1), and the general LD patterns were as illustrated for CEU, CHB, YRI, ASW, and 1KG. As expected, $\hat{\mathcal{l}}}_{i\cdot j}<{\hat{\mathcal{l}}}_{g}<{\hat{\mathcal{l}}}_{i$ for each cohort (Figure 3B). As observed in these 1KG cohorts, all three LD measures were associated with population structure, which was surrogated by ${\stackrel{}{F}}_{st}\approx \frac{{\lambda}_{1}}{n}$ , and their squared correlation ${R}^{2}$ was greater than 0.8. ACB, ASW, PEL, and MXL, which all showed certain admixture, tended to have much greater $\hat{\mathcal{l}}}_{g$ , $\hat{\mathcal{l}}}_{i$ , and $\hat{\mathcal{l}}}_{i\cdot j$ (Table 3 and Figure 3B). In contrast, East Asian (EA) and European (EUR)orientated cohorts, which showed little withincohort genetic differentiation—as their largest eigenvalues were slightly greater than 1—had their aggregated LD relatively low and resembled each other (Table 3). Furthermore, for several European (TSI, IBS, and FIN) and East Asian (JPT) cohorts, the ratio between $\hat{\mathcal{l}}}_{i\cdot j$ and $\hat{\mathcal{l}}}_{i$ components could be smaller than 0.1, and the smallest ratio was found to be about 0.091 in FIN. The largest ratio was found in 1KG that ${\hat{\mathcal{l}}}_{i\cdot j}=5.7e3$ and ${\hat{\mathcal{l}}}_{i}=6.5e3$, and the ratio was 0.877 because of the inflated LD due to population structure. A more concise statistic to describe the ratio between $\mathcal{l}}_{i\cdot j$ and $\mathcal{l}}_{i$ was $\stackrel{\sim}{\mathcal{l}}}_{i\cdot j$ (Equation 8), and the corresponding value for 231 scaled $\stackrel{\sim}{\mathcal{l}}}_{i\cdot j$ for FIN was ${\hat{\stackrel{\sim}{\mathcal{l}}}}_{i\cdot j}=0.10$ (SD of 0.027) and for 1KG was ${\hat{\stackrel{\sim}{\mathcal{l}}}}_{i\cdot j}=0.88$ (SD of 0.028).
In terms of computational time, for 103 CHB samples, it took about 101.34 s to estimate 22 autosomal $\hat{\mathcal{l}}}_{i$ and 231 $\hat{\mathcal{l}}}_{i\cdot j$ ; for all 1KG 2503 samples, XLD took about 3008.29 s (Table 1). Conventional methods took too long to complete the analyses in this section, so no comparable computational time was provided. For detailed 22 $\hat{\mathcal{l}}}_{i$ and 231 $\hat{\mathcal{l}}}_{i\cdot j$ estimates for each 1KG cohort, please refer to Supplementary file 1 (Excel sheet 1–27).
Detecting exceedingly high LD grids shaped by variable recombination rates
We further explored each autosome with highresolution grid LD visualization. We set $\mathcal{m}=250$, so each grid had the $\mathcal{l}}_{uv$ for 250 × 250 SNP pairs. The computational time complex was $\mathcal{O}\left({n}^{2}\left({m}_{i}+\frac{{\beta}_{i}^{2}}{4}\right)\right)$, in which ${\beta}_{i}=\frac{{m}_{i}}{250}$ , and with our proposed method in CHB it cost 66.86 s for chromosome 2, which had 241,241 SNPs and totaled 466,095 unique grids, and 3.22 s only for chromosome 22, which had 40,378 SNPs and totaled 13,203 unique grids (Table 1). In contrast, under conventional methods those LD grids were not very likely to be exhaustively surveyed because its computational cost was $\mathcal{O}\left(n{m}_{i}^{2}\right)$: for CHB chromosome 2, it would have taken about 40 hr as estimated. As the result was very similar for $\mathcal{m}=500$ (Figure 4—figure supplement 1), we only report the results under $\mathcal{m}=250$ below.
As expected, chromosome 6 (206,165 SNPs, totaling 340,725 unique grids) had its HLA cluster showing much higher LD than the rest of chromosome 6. In addition, we found a very dramatic variation of the HLA cluster LD $\hat{\mathcal{l}}}_{HLA$ (28,477,797–33,448,354 bp, totaling 3160 unique grids) across ethnicities. For CEU, CHB, YRI, and ASW, their ${\hat{\mathcal{l}}}_{6}=0.0010$, 0.00090, 0.00064, and 0.0019, respectively, but their corresponding HLA cluster grids had ${\hat{\mathcal{l}}}_{HLA}=0.042$, 0.029, 0.025, and 0.022, respectively (Figure 4). Consequently, the largest ratio for $\frac{{\hat{\mathcal{l}}}_{HLA}}{{\hat{\mathcal{l}}}_{6}}$ was 42.00 in CEU, 39.06 in YRI, and 32.22 in CHB, but was reduced to 11.58 in ASW. Before the release of CHM13 (Hoyt et al., 2022), chromosome 11 had the most completely sequenced centromere region, which had much rarer recombination events, and all four cohorts showed a strong LD $\hat{\mathcal{l}}}_{11.c$ around the centromere (46,061,947–59,413,484 bp, totaling 1035 unique grids) regardless of their ethnicities (Figure 4). ${\hat{\mathcal{l}}}_{11}=0.0012$, 0.0012, 0.00084, and 0.0022, respectively, and ${\hat{\mathcal{l}}}_{11.c}=0.098$, 0.10, 0.079, and 0.094, respectively; the ratio for $\frac{{\hat{\mathcal{l}}}_{11.c}}{{\hat{\mathcal{l}}}_{11}}=$ 81.67, 83,33, and 94,05, for CEU, CHB, and YRI, respectively; the lowest ratio was found in ASW of 42.73. In addition, removing the HLA region of chromosome 6 or the centromere region of chromosome 11 would significantly reduce $\hat{\mathcal{l}}}_{6$ or $\hat{\mathcal{l}}}_{11$ in comparison with the random removal of other regions (Figure 4—figure supplement 2).
Modelbased LD decay regression revealed LD composition
The real LD block size was not exact of $\mathcal{m}=250$ or $\mathcal{m}=500$, but an unknown parameter that should be inferred in computational intensive ‘LD decay’ analysis (Zhang et al., 2019; Chang et al., 2015). We conducted the conventional LD decay for the 26 1KG cohorts (Figure 5A), and the time cost was 1491.94 s for CHB. For each cohort, we took the area under the LD decay curve in the LD decay plot, and it quantified approximately the LD decay score for each cohort. The smallest score was 0.0421 for MSL, and the largest was 0.0598 for PEL (Table 5). However, this estimation did not take into account the real extent of LD, so it was not precise enough to reflect the LD decay score. For example, for admixture population, such as the American cohorts, the extent of LD would be longer.
In contrast, we proposed a modelbased method, as given in Equation 3, which could estimate LD decay score (regression coefficient ${b}_{1}$) and longdistance LD score (intercept ${b}_{0}$) jointly. Given the estimated 22 $\hat{\mathcal{l}}}_{i$ (Supplementary file 1; see Table 4 for four representative cohorts and Supplementary R code), we regressed each autosomal $\hat{\mathcal{l}}}_{i$ against its corresponding inversion of SNP number, and all yielded positive slopes (Pearson’s correlation $\mathcal{R}>0.80$, Table 5 and Figure 5B), an observation that was consistent with genomewide spread of recombination hotspots. This linear relationship could consequently be considered the norm for a relatively homogeneous population as observed in most 1KG cohorts (Figure 5—figure supplement 1), while for all the 2503 1KG samples $\mathcal{R}=0.55$ only (Table 5), indicating that the population structure and possible differentiated recombination hotspots across ethnicities disturbed the assumption underlying Equation 3 and smeared the linearity. We extracted ${\widehat{b}}_{0}$ and ${\widehat{b}}_{1}$ for the 26 1KG cohorts for further analysis. The rates of LD decay score, as indicated by ${\widehat{b}}_{1}$ , within the African cohorts (AFR) were significantly faster than the other continents, consistent with previous observation that the African population had relatively shorter LD Gabriel et al., 2002; while subgroups within the American continent (AMR) tended to have extended LD range due to their admixed genetic composition (Table 4 and Figure 5). Notably, the correlation between ${\widehat{b}}_{1}$ and the approximated LD decay score was $\mathcal{R}=0.88$. The estimated ${\stackrel{}{F}}_{st}$ was highly correlated with ${\widehat{b}}_{0}$ ($\mathcal{R}=0.94$).
A common feature was universally relative high LD of chromosome 6 and 11 in the 26 1KG cohorts (Figure 5—figure supplement 1). We quantified the impact of chromosome 6 and 11 by leaveonechromosomeout test in CEU, CHB, YRI, and ASW for details (Figure 6A and B) and found that dropping chromosome 6 off could lift $\mathcal{R}$ on average by 0.017 and chromosome 11 by 0.046. One possible explanation was that the centromere regions of chromosomes 6 and 11 have been assembled more completely than other chromosomes before the completion of CHM13 (Hoyt et al., 2022), whereas meiotic recombination tended to be reduced around the centromeres (Hinch et al., 2019). We estimated $\mathcal{l}}_{i$ after having knocked out the centromere region (46,061,947–59,413,484 bp, chr 11) in CEU, CHB, YRI, and ASW, and chromosome 11 then did not deviate much from their respective fitted lines (Figure 6C). A notable exceptional pattern was found in ASW, chromosome 8 of which had even more deviation than chromosome 11 ($\mathcal{R}$ was 0.83 and 0.87 with and without chromosome 8 in leaveonechromosome out test) (Figure 6B). The deviation of chromosome 8 of ASW was consistent even more SNPs were added (Figure 6—figure supplement 1). We also provided highresolution LD grids illustration for chromosome 8 (163,436 SNPs, totaling 214,185 grids) of the four representative cohorts for more detailed virtualization (Figure 6D). ASW had ${\hat{\mathcal{l}}}_{8}=$ 0.0022, but 0.00075, 0.00069, and 0.00043 for CEU, CHB, and YRI, respectively.
Discussion
In this study, we present a computationally efficient method to estimate the mean LD of genomic grids of many SNP pairs. Our LD analysis framework is based on GRM, which has been embedded in variance component analysis for complex traits and genomic selection (Goddard, 2009; Visscher et al., 2014; Chen, 2014). The key connection from GRM to LD is bridged via the transformation between $n\times n$ matrix and $m\times m$ matrix, in particular here via Isserlis’s theorem under the fourthorder moment (Isserlis, 1918). With this connection, the computational cost for estimating the mean LD of $m\times m$ SNP pairs is reduced from $\mathcal{O}(n{m}^{2})$ to $\mathcal{O}({n}^{2}m)$, and the statistical properties of the proposed method are derived in theory and validated in 1KG datasets. In addition, as the genotype matrix $\mathbf{X}$ is of limited entries {0, 1, 2}, assuming missing genotypes are imputed first, using Mailman algorithm the computational cost of GRM can be further reduced to $\mathcal{O}\left(\frac{{n}^{2}m}{{\mathrm{log}}_{3}m}\right)$ (Liberty and Zucker, 2009). The largest data tested so far for the proposed method has a sample size of 10,640 and more than 5 million SNPs, so it can complete genomic LD analysis in 77,508.00 s (Table 1). The weakness of the proposed method is obvious that the algorithm remains slow when the sample size is large or the grid resolution is increased. With the availability of such a UK Biobank data (Bycroft et al., 2018), the proposed method may not be adequate, and much advanced methods, such as randomized implementation for the proposed methods, are needed.
We also applied the proposed method into 1KG and revealed certain characteristics of the human genomes. Firstly, we found the ubiquitous existence of extended LD, which likely emerged because of population structure, even very slightly, and admixture history. We quantified the $\hat{\mathcal{l}}}_{i$ and $\hat{\mathcal{l}}}_{i\cdot j$ in 1KG, and as indicated by $\stackrel{\sim}{\mathcal{l}}}_{i\cdot j$ we found that the interchromosomal LD was nearly an order lower than intrachromosomal LD; for admixed cohorts, the ratio was much higher, even very close to each other such as in all 1KG samples. Secondly, variable recombination rates shaped peak of local LD. For example, the HLA region showed high LD in the European and East Asian cohorts, but relatively low LD in such as YRI, consistent with their much longer population history. Thirdly, there existed a general linear correlation between $\mathcal{l}}_{i$ and the inversion of the SNP number, a longanticipated result that is as predicted with genomewide spread of recombination hotspots (Hinch et al., 2019). One outlier of this linear norm was chromosome 11, which had so far the most completely genotyped centromere and consequently had more elevated LD compared with other autosomes. We anticipate that with the release of CHM13 the linear correlation should be much closer to unity (Hoyt et al., 2022). Of note, under the variance component analysis for complex traits, it is often a positive correlation between the length of a chromosome (as surrogated by the number of SNPs) and the proportion of heritability explained (Chen et al., 2014).
In contrast, throughout the study recurrent outstanding observations were found in ASW. For example, in ASW the ratio of $\hat{\mathcal{l}}}_{HLA}/{\hat{\mathcal{l}}}_{6$ substantially dropped compared with that of CEU, CHB, or YRI as illustrated in Figure 4. Furthermore, chromosome 8 in ASW fluctuated upward most from the linear correlation (Figure 6) even after various analyses, such as expanding SNP numbers. One possible explanation may lie under the complex demographic history of ASW, which can be investigated and tested in additional African American samples or possible existence for epistatic fitness (Ni et al., 2020).
Data availability
Public genetic datasets used in this study can be freely downloaded from the following URLs.1000 Genomes Project: https://www.ebi.ac.uk/eva/?evastudy=PRJEB30460. CONVERGE: https://www.ebi.ac.uk/eva/?evastudy=PRJNA289433. All data generated or analysed during this study are included in the manuscript and supporting file. The CONVERGE dataset was used to generate Figure 1 and the other figures were generated from 1000 Genomes Project data.

European Nucleotide ArchiveID PRJEB30460. Variant calling on GRCh38 with the 1000 genomes samples.

European Nucleotide ArchiveID PRJNA289433. Study of Major Depression in Chinese women.
References

Patterns of linkage disequilibrium in the human genomeNature Reviews. Genetics 3:299–309.https://doi.org/10.1038/nrg777

Estimation and partitioning of (co)heritability of inflammatory bowel disease from GWAS and immunochip dataHuman Molecular Genetics 23:4710–4720.https://doi.org/10.1093/hmg/ddu174

Nextgeneration genotype imputation service and methodsNature Genetics 48:1284–1287.https://doi.org/10.1038/ng.3656

The effect of linkage on limits to artificial selectionGenetical Research 8:269–294.

The Mailman algorithm: A note on matrix–vector multiplicationInformation Processing Letters 109:179–182.https://doi.org/10.1016/j.ipl.2008.09.028

quickLD: An efficient software for linkage disequilibrium analysesMolecular Ecology Resources 21:2580–2587.https://doi.org/10.1111/17550998.13438

Modeling linkage disequilibrium increases accuracy of polygenic risk scoresAmerican Journal of Human Genetics 97:576–592.https://doi.org/10.1016/j.ajhg.2015.09.001

Linkage disequilibrium and association mappingAnnual Review of Genomics and Human Genetics 9:129–142.https://doi.org/10.1146/annurev.genom.9.081307.164347

A scalable estimator of SNP heritability for biobankscale dataBioinformatics 34:i187–i194.https://doi.org/10.1093/bioinformatics/bty253

Genomic inflation factors under polygenic inheritanceEuropean Journal of Human Genetics 19:807–812.https://doi.org/10.1038/ejhg.2011.39

Accurate and scalable construction of polygenic scores in large biobank data setsAmerican Journal of Human Genetics 106:679–693.https://doi.org/10.1016/j.ajhg.2020.03.013

A unified framework for variance component estimation with summary statistics in genomewide association studiesThe Annals of Applied Statistics 11:2027–2051.https://doi.org/10.1214/17AOAS1052
Article and author information
Author details
Funding
National Natural Science Foundation of China (31771392)
 GuoBo Chen
China National Tobacco Corporation (110202101032(JY09))
 GuoBo Chen
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
GBC conceived and initiated the study. XH, TNZ, and YL conducted simulation and analyzed data. GBC, XH, TNZ, GAQ, and JZ developed the software. GBC wrote the first draft of the article. All authors contributed to the writing, discussion of the paper, and validation of the results. This work was supported by the National Natural Science Foundation of China (31771392), 110202101032 (JY09), and GZYZJKJ23001. The funders played no role in the design, preparation, and submission of the article.
Ethics
Human subjects: This study investigated variation in previously published anonymized genome data from the 1000 Genomes Project.
Version history
 Preprint posted:
 Sent for peer review:
 Reviewed Preprint version 1:
 Reviewed Preprint version 2:
 Version of Record published:
 Version of Record updated:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.90636. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Huang, Zhu et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 850
 views

 86
 downloads

 6
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Evolutionary Biology
 Microbiology and Infectious Disease
HERVK(HML2), the youngest clade of human endogenous retroviruses (HERVs), includes many intact or nearly intact proviruses, but no replication competent HML2 proviruses have been identified in humans. HML2related proviruses are present in other primates, including rhesus macaques, but the extent and timing of HML2 activity in macaques remains unclear. We have identified 145 HML2like proviruses in rhesus macaques, including a clade of young, rhesusspecific insertions. Age estimates, intact open reading frames, and insertional polymorphism of these insertions are consistent with recent or ongoing infectious activity in macaques. 106 of the proviruses form a clade characterized by an ~750 bp sequence between env and the 3′ long terminal repeat (LTR), derived from an ancient recombination with a HERVK(HML8)related virus. This clade is found in Old World monkeys (OWM), but not great apes, suggesting it originated after the ape/OWM split. We identified similar proviruses in whitecheeked gibbons; the gibbon insertions cluster within the OWM recombinant clade, suggesting interspecies transmission from OWM to gibbons. The LTRs of the youngest proviruses have deletions in U3, which disrupt the Rec Response Element (RcRE), required for nuclear export of unspliced viral RNA. We show that the HML8derived region functions as a Recindependent constitutive transport element (CTE), indicating the ancestral Rec–RcRE export system was replaced by a CTE mechanism.

 Evolutionary Biology
 Genetics and Genomics
The nearly neutral theory of molecular evolution posits variation among species in the effectiveness of selection. In an idealized model, the census population size determines both this minimum magnitude of the selection coefficient required for deleterious variants to be reliably purged, and the amount of neutral diversity. Empirically, an ‘effective population size’ is often estimated from the amount of putatively neutral genetic diversity and is assumed to also capture a species’ effectiveness of selection. A potentially more direct measure of the effectiveness of selection is the degree to which selection maintains preferred codons. However, past metrics that compare codon bias across species are confounded by amongspecies variation in %GC content and/or amino acid composition. Here, we propose a new Codon Adaptation Index of Species (CAIS), based on Kullback–Leibler divergence, that corrects for both confounders. We demonstrate the use of CAIS correlations, as well as the Effective Number of Codons, to show that the protein domains of more highly adapted vertebrate species evolve higher intrinsic structural disorder.