Efficient estimation for large-scale linkage disequilibrium patterns of the human genome

  1. Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang Province, China
  2. Center for General Practice Medicine, Department of General Practice Medicine; Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China
  3. Alibaba Group, Hangzhou, Zhejiang, China
  4. Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Alexander Young
    University of California, Los Angeles, Los Angeles, United States of America
  • Senior Editor
    Molly Przeworski
    Columbia University, New York, United States of America

Reviewer #1 (Public Review):

Summary:
Huang and colleagues present a method for approximation of linkage disequilibrium (LD) matrices. The problem of computing LD matrices is the problem of computing a correlation matrix. In the cases considered by the authors, the number of rows (n), corresponding to individuals, is small compared to the number of columns (m), corresponding to the number of variants. Computing the correlation matrix has cubic time complexity [O(nm^2)], which is prohibitive for large samples. The authors approach this using three main strategies: 1. they compute a coarsened approximation of the LD matrix by dividing the genome into variant-wise blocks which statistics are effectively averaged over; 2. they use a trick to get the coarsened LD matrix from a coarsened genomic relatedness matrix (GRM), which, with O(n^2 m) time complexity, is faster when n << m; 3. they use the Mailman algorithm to improve the speed of basic linear algebra operations by a factor of log(max(m,n)). The authors apply this approach to several datasets.

Strengths:
- the authors demonstrate that their proposed method performs in line with theoretical explanations
- the coarsened LD matrix is useful for describing global patterns of LD, which do not necessarily require variant-level resolution
- they provide an open-source implementation of their software

Weaknesses:
- the coarsened LD matrix is of limited utility outside of analyzing macroscale LD characteristics
- the method still essentially has cubic complexity--albeit the factors are smaller and Mailman reduces this appreciably. It would be interesting if the authors were able to apply randomized or iterative approaches to achieve more fundamental gains. The algorithm remains slow when n is large and/or the grid resolution is increased.

Reviewer #2 (Public Review):

Summary:
In this paper, the authors point out that the standard approach of estimating LD is inefficient for datasets with large numbers of SNPs, with a computational cost of O(nm^2), where n is the number of individuals and m is the number of SNPs. Using the known relationship between the LD matrix and the genomic-relatedness matrix, they can calculate the mean level of LD within the genome or across genomic segments with a computational cost of O(n^2m). Since in most datasets, n<
Strengths:
Generally, for computational papers like this, the proof is in the pudding, and the authors appear to have been successful at their aim of producing an efficient computational tool. The most compelling evidence of this in the paper is Figure 2 and Supplementary Figure S2. In Figure 2, they report how well their X-LD estimates of LD compare to estimates based on the standard approach using PLINK. They appear to have very good agreement. In Figure S2, they report the computational runtime of X-LD vs PLINK, and as expected X-LD is faster than PLINK as long as it is evaluating LD for more than 8000 SNPs.

Weakness:
While the X-LD software appears to work well, I had a hard time following the manuscript enough to make a very good assessment of the work. This is partly because many parameters used are not defined clearly or at all in some cases. My best effort to intuit what the parameters meant often led me to find what appeared to be errors in their derivation. As a result, I am left worrying if the performance of X-LD is due to errors cancelling out in the particular setting they consider, making it potentially prone to errors when taken to different contexts.

Impact:
I feel like there is value in the work that has been done here if there were more clarity in the writing. Currently, LD calculations are a costly step in tools like LD score regression and Bayesian prediction algorithms, so a more efficient way to conduct these calculations would be useful broadly. However, given the difficulty I had following the manuscript, I was not able to assess when the authors' approach would be appropriate for an extension such as that.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation