Schematic illustration for large-scale LD analysis as exampled for CONVERGE cohort.

A) The 22 human autosomes have consequently 22 and 231 , without (left) and with (right) scaling transformation; Scaling transformation is given in Eq 8. B) If zoom into chromosome 2 of 420,946 SNPs, a chromosome of relative neutrality is expected to have self-similarity structure that harbors many approximately strong along the diagonal, and relatively weak off-diagonally. Here chromosome 2 of CONVERGE has been split into 1,000 blocks and yielded 1000 LD grids, and 499,500 LD grids.

Reconciliation for LD estimators in the 26 1KG cohorts.

A) Consistency examination for the 26 1KG cohorts for their and estimated by X-LD and PLINK (--r2). In each figure, the 22 fitting line is in purple, whereas the 231 fitting line is in green. The gray solid line, , in which the sample size of each cohort, represents the expected fit between PLINK and X-LD estimates, and the two estimated regression models at the top-right corner of each plot shown this consistency. The sample size of each cohort is in parentheses. B) Distribution of R2 of and fitting lines is based on X-LD and PLINK algorithms in the 26 cohorts.

26 1KG cohorts: MSL (Mende in Sierra Leone), GWD (Gambian in Western Division, The Gambia), YRI (Yoruba in Ibadan, Nigeria), ESN (Esan in Nigeria), ACB (African Caribbean in Barbados), LWK (Luhya in Webuye, Kenya), ASW (African Ancestry in Southwest US); CHS (Han Chinese South), CDX (Chinese Dai in Xishuangbanna, China), KHV (Kinh in Ho Chi Minh City, Vietnam), CHB (Han Chinese in Beijing, China), JPT (Japanese in Tokyo, Japan); BEB (Bengali in Bangladesh), ITU (Indian Telugu in the UK), STU (Sri Lankan Tamil in the UK), PJL (Punjabi in Lahore, Pakistan), GIH (Gujarati Indian in Houston, TX); TSI (Toscani in Italia), IBS (Iberian populations in Spain), CEU (Utah residents (CEPH) with Northern and Western European ancestry), GBR (British in England and Scotland), FIN (Finnish in Finland); MXL (Mexican Ancestry in Los Angeles, California), PUR (Puerto Rican in Puerto Rico), CLM (Colombian in Medellin, Colombia), PEL (Peruvian in Lima, Peru).

Various LD components for the 26 1KG cohorts.

A) Chromosomal scale LD components for 5 representative cohorts (CEU, CHB, YRI, ASW, and 1KG). The upper parts of each figure represent (along the diagonal) and (off-diagonal), and the lower part as in Eq 8. For visualization purposes, the quantity of LD before scaling is transformed to a -log10 scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD. B) The relationship between the degree of population structure (approximated by ) and , and in the 26 1KG cohorts.

High-resolution illustration for LD grids for CEU, CHB, YRI, and ASW (m= 250).

For each cohort, we partition chromosomes 6 and 11 into high-resolution LD grids (each LD grid contains 250 × 250 SNP pairs). The bottom half of each figure shows the LD grids for the entire chromosome. Further zooming into HLA on chromosome 6 and the centromere region on chromosome 11, and their detailed LD in the relevant regions are also provided in the upper half of each figure. For visualization purposes, LD is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

LD decay analysis for 26 1KG cohorts.

A) Conventional LD decay analysis in PLINK for 26 cohorts. To eliminate the influence of sample size, the inverse of sample size has been subtracted from the original LD values. The YRI cohort, represented by the orange dotted line, is chosen as the reference cohort in each plot. The top-down arrow shows the order of LDdecay values according to Table 4. B) Model-based LD decay analysis for the 26 1KG cohorts. We regressed each autosomal against its corresponding inversion of the SNP number for each cohort. Regression coefficient quantifies the averaged LD decay of the genome and intercept provides a direct estimate of possible existence of long-distance LD. The values in the first three plot indicate the correlation between and LD decay score in three different physical distance and the correlation between (left-side vertical axis) and LD decay score (right-side vertical axis) and the correlation between (left-side vertical axis) and (right-side vertical axis), respectively. The last plot assessed the impact of centromere region of chromosome 11 on the linear relationship between chromosomal LD and the inverse of the SNP number. The dark and light gray dashed lines represent the mean of the with and without the presence of centromere region of chromosome 11.

The correlation between the inversion of the SNP number and .

A) The correlation between the inversion of the SNP number and in CEU, CHB, YRI, and ASW. B) Leave-one-chromosome-out strategy is adopted to evaluate the contribution of a certain chromosome on the correlation between the inverse of the SNP number and . C) The correlation between the inversion of the SNP number and chromosomal LD in CEU, CHB, YRI, and ASW after removing the centromere region of chromosome 11. D) High-resolution illustration for LD grids for chromosome 8 in CEU, CHB, YRI, and ASW. For each cohort, we partition chromosome 8 into consecutive LD grids (each LD grid contains 250 × 250 SNP pairs). For visualization purposes, LD is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

Computational time for the demonstrated estimation tasks

X-LD estimation for complex LD components (2,997,635 SNPs)

Estimates for 22 autosomal in CEU, CHB, YRI, and ASW, respectively

LD decay regression analysis for 26 cohorts

Reconciliation for LD estimators in AFR, EAS, and EUR.

In each figure, the 22 fit line is in purple, whereas the 231 fit line is in green. The gray solid line, , in which the sample size, represents the expected fit between PLINK and X-LD, and the two estimated regression models at the top-right corner shown this consistency.

The computational efficiency of X-LD algorithm.

Considering the high computational cost of PLINK, only the first chromosome was chosen. In the process of evaluating computational efficiency, we kept adding SNPs until the inclusion of entire chromosome. The bar chart and line chart show the actual calculation time and theoretical calculation complexity, respectively.

Chromosomal scale LD components for 26 cohorts in 1KG.

The upper and lower parts of each figure represent the LD before and after scaling according to Eq 8. and are represented by the diagonal and the off-diagonal elements, respectively. For visualization purposes, LD before scaling is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

High-resolution illustration for LD grids for CEU, CHB, YRI, and ASW (m= 500).

For each cohort, we partitioned each chromosome into consecutive LD grids (each LD grid containing 500 SNPs). For visualization purposes, LD is transformed to a -log10-scale, with smaller values (red hues) representing larger LD, and a value of 0 representing that all SNPs are in LD.

Influence of HLA region on chromosome 6 and centromere region on chromosome 11 on chromosomal LD in CEU, CHB, YRI, and ASW.

When other region was removed, to avoid chance, the same number of consecutive SNPs as HLA region or centromere region were randomly removed from the genomic region, and this operation was repeated 100 times.

The correlation between the inverse of the SNP number and chromosomal LD in 26 cohorts of 1KG.

Influence of expanding of SNP numbers on the correlation between the inverse of the SNP number and chromosomal LD in ASW.

Randomly selected SNPs that were presented in ASW but were not 2,997,635 consensus SNPs were added to the ASW cohort to demonstrate the stable pattern of chromosome 8.