Limitations of principal components in quantitative genetic association models for human studies
Abstract
Principal Component Analysis (PCA) and the Linear Mixedeffects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCALMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.
Editor's evaluation
This is an important paper that presents compelling arguments (based on simulation and comprehensively reviewed background theory) that Linear Mixed Models generally should perform better at correcting for genetic and environmental confounding in GWAS than more commonly used Principal Components methods.
https://doi.org/10.7554/eLife.79238.sa0Introduction
The goal of a genetic association study is to identify loci whose genotype variation is significantly correlated to given trait. Naive association tests assume that genotypes are drawn independently from a common allele frequency. This assumption does not hold for structured populations, which includes multiethnic cohorts and admixed individuals (ancient relatedness), and for family data (recent relatedness; Astle and Balding, 2009). Association studies of admixed and multiethnic cohorts, the focus of this work, are becoming more common, are believed to be more powerful, and are necessary to bring more equity to genetic medicine (Rosenberg et al., 2010; Hoffman and Dubé, 2013; Coram et al., 2013; MedinaGomez et al., 2015; Conomos et al., 2016a; Hodonsky et al., 2017; Martin et al., 2017a; Martin et al., 2017b; Hindorff et al., 2018; Hoffmann et al., 2018; Mogil et al., 2018; Roselli et al., 2018; Wojcik et al., 2019; Peterson et al., 2019; Zhong et al., 2019; Hu et al., 2020; SimoninWilmer et al., 2021; Kamariza et al., 2021; Lin et al., 2021; Mahajan et al., 2022; Hou et al., 2023a). When insufficient approaches are applied to data with relatedness, their association statistics are miscalibrated, resulting in excess false positives and loss of power (Devlin and Roeder, 1999; Voight and Pritchard, 2005; Astle and Balding, 2009). Therefore, many specialized approaches have been developed for genetic association under relatedness, of which PCA and LMM are the most popular.
Genetic association with PCA consists of including the top eigenvectors of the population kinship matrix as covariates in a generalized linear model (Zhang et al., 2003; Price et al., 2006; Bouaziz et al., 2011). These top eigenvectors are a new set of coordinates for individuals that are commonly referred to as PCs in genetics (Patterson et al., 2006), the convention adopted here, but in other fields PCs instead denote what in genetics would be the projections of loci onto eigenvectors, which are new independent coordinates for loci (Jolliffe, 2002). The direct ancestor of PCA association is structured association, in which inferred ancestry (genetic cluster membership, often corresponding with labels such as “European”, “African”, “Asian”, etc.) or admixture proportions of these ancestries are used as regression covariates (Pritchard et al., 2000). These models are deeply connected because PCs map to ancestry empirically (Alexander et al., 2009; Zhou et al., 2016) and theoretically (McVean, 2009; Zheng and Weir, 2016; Cabreros and Storey, 2019; Chiu et al., 2022), and they work as well as global ancestry in association studies but are estimated more easily (Patterson et al., 2006; Zhao et al., 2007; Alexander et al., 2009; Bouaziz et al., 2011). Another approach closely related to PCA is nonmetric multidimensional scaling (Zhu and Yu, 2009). PCs are also proposed for modeling environment effects that are correlated to ancestry, for example, through geography (Novembre et al., 2008; Zhang and Pan, 2015; Lin et al., 2021). The strength of PCA is its simplicity, which as covariates can be readily included in more complex models, such as haplotype association (Xu and Guan, 2014) and polygenic models (Qian et al., 2020). However, PCA assumes that the underlying relatedness space is low dimensional (or low rank), so it can be well modeled with a small number of PCs, which may limit its applicability. PCA is known to be inadequate for family data (Patterson et al., 2006; Zhu and Yu, 2009; Thornton and McPeek, 2010; Price et al., 2010), which is called ‘cryptic relatedness’ when it is unknown to the researchers, but no other troublesome cases have been confidently identified. Recent work has focused on developing more scalable versions of the PCA algorithm (Lee et al., 2012; Abraham and Inouye, 2014; Galinsky et al., 2016; Abraham et al., 2017; Agrawal et al., 2020). PCA remains a popular and powerful approach for association studies.
The other dominant association model under relatedness is the LMM, which includes a random effect parameterized by the kinship matrix. Unlike PCA, LMM does not assume that relatedness is lowdimensional, and explicitly models families via the kinship matrix. Early LMMs used kinship matrices estimated from known pedigrees or using methods that captured recent relatedness only, and modeled population structure (ancestry) as fixed effects (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). Modern LMMs estimate kinship from genotypes using a nonparametric estimator, often referred to as a genetic relationship matrix, that captures the combined covariance due to family relatedness and ancestry (Kang et al., 2008; Astle and Balding, 2009; Ochoa and Storey, 2021). Like PCA, LMM has also been proposed for modeling environment correlated to genetics (Vilhjálmsson and Nordborg, 2013; Wang et al., 2022). The classic LMM assumes a quantitative (continuous) complex trait, the focus of our work. Although casecontrol (binary) traits and their underlying ascertainment are theoretically a challenge (Yang et al., 2014), LMMs have been applied successfully to balanced casecontrol studies (Astle and Balding, 2009; Kang et al., 2010) and simulations (Price et al., 2010; Wu et al., 2011; Sul and Eskin, 2013), and have been adapted for unbalanced casecontrol studies (Zhou et al., 2018). However, LMMs tend to be considerably slower than PCA and other models, so much effort has focused on improving their runtime and scalability (Aulchenko et al., 2007; Kang et al., 2008; Kang et al., 2010; Zhang et al., 2010; Lippert et al., 2011; Yang et al., 2011; Listgarten et al., 2012; Zhou and Stephens, 2012; Svishcheva et al., 2012; Loh et al., 2015; Zhou et al., 2018).
An LMM variant that incorporates PCs as fixed covariates is tested thoroughly in our work. Since PCs are the top eigenvectors of the same kinship matrix estimate used in modern LMMs (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), then population structure is modeled twice in an LMM with PCs. However, some previous work has found the apparent redundancy of an LMM with PCs beneficial (Price et al., 2010; Tucker et al., 2014; Zhang and Pan, 2015), while others did not (Liu et al., 2011; Janss et al., 2012), and the approach continues to be used (Zeng et al., 2018; Mbatchou et al., 2021), although not always (Matoba et al., 2020). Recall that early LMMs used kinship to model family relatedness only, so population structure had to be modeled separately in those models, in practice as admixture fractions instead of PCs (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). The LMM with PCs (vs no PCs) is also believed to help better model loci that have experienced selection (Price et al., 2010; Vilhjálmsson and Nordborg, 2013) and environment effects correlated with genetics (Zhang and Pan, 2015).
LMM and PCA are closely related models (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), so similar performance is expected particularly under lowdimensional relatedness. Direct comparisons have yielded mixed results, with several studies finding superior performance for LMM, notably from papers promoting advances in LMMs, while many others report comparable performance (Table 1). No papers find that PCA outperforms LMM decisively, although PCA occasionally performs better in isolated and artificial cases or individual measures, often with unknown significance. Previous studies generally used either only simulated or only real genotypes, with only two studies using both. The simulated genotype studies, which tended to have low model dimensions and ${F}_{\text{ST}}$, were more likely to report ties or mixed results (6/8), whereas real genotypes tended to clearly favor LMMs (9/11). Similarly, 10/12 papers with quantitative traits favor LMMs, whereas 6/9 papers with casecontrol traits gave ties or mixed results—the only factor we do not explore in this work. Additionally, although all previous evaluations measured type I error (or proxies such as genomic inflation factors Devlin and Roeder, 1999 or QQ plots), a large fraction (6/17) did not measure power (or proxies such as ROC curves), and only four used more than one number of PCs for PCA. Lastly, no consensus has emerged as to why LMM might outperform PCA or vice versa (Price et al., 2010; Sul and Eskin, 2013; Price et al., 2013; Hoffman and Dubé, 2013), or which features of the real datasets are critical for the LMM advantage other than family relatedness, resulting in unclear guidance for using PCA. Hence, our work includes real and simulated genotypes with higher model dimensions and ${F}_{\text{ST}}$ matching that of multiethnic human cohorts (Ochoa and Storey, 2021; Ochoa and Storey, 2019), we vary the number of PCs, and measure robust proxies for type I error control and calibrated power.
In this work, we evaluate the PCA and LMM association models under various numbers of PCs, which are included in LMMs too. We use genotype simulations (admixture, family, and subpopulation tree models) and three real datasets: the 1000 Genomes Project (Abecasis et al., 2010; Abecasis et al., 2012), the Human Genome Diversity Panel (HGDP) (Cann et al., 2002; Rosenberg et al., 2002; Bergström et al., 2020), and Human Origins (Patterson et al., 2012; Lazaridis et al., 2014; Lazaridis et al., 2016; Skoglund et al., 2016). We simulate quantitative traits from two models: fixed effect sizes (FES) construct coefficients inverse to allele frequency, which matches real data (Park et al., 2011; Zeng et al., 2018; O’Connor et al., 2019) and corresponds to high pleiotropy and strong balancing selection (Simons et al., 2018) and strong negative selection (Zeng et al., 2018; O’Connor et al., 2019), which are appropriate assumptions for diseases; and random coefficients (RC), which are drawn independent of allele frequency, and corresponds to neutral traits (Zeng et al., 2018; Simons et al., 2018). LMM without PCs consistently performs best in simulations without environment, and greatly outperforms PCA in the family simulation and in all real datasets. The tree simulations, which model subpopulations with the tree but exclude family structure, do not recapitulate the real data results, suggesting that family relatedness in real data is the reason for poor PCA performance. Lastly, removing up to 4th degree relatives in the real datasets recapitulates poor PCA performance, showing that the more numerous distant relatives explain the result, and suggesting that PCA is generally not an appropriate model for real data. We find that both LMM and PCA are able to model environment effects correlated with genetics, and LMM with PCs gains a small advantage in this setting only, but direct modeling of environment performs much better. All together, we find that LMMs without PCs are generally a preferable association model, and present novel simulation and evaluation approaches to measure the performance of these and other genetic association approaches.
Results
Overview of evaluations
We use three real genotype datasets and simulated genotypes from six population structure scenarios to cover various features of interest (Table 2). We introduce them in sets of three, as they appear in the rest of our results. Population kinship matrices, which combine population and family relatedness, are estimated without bias using popkin (Ochoa and Storey, 2021; Figure 1). The first set of three simulated genotypes are based on an admixture model with 10 ancestries (Figure 1A; Ochoa and Storey, 2021; Gopalan et al., 2016; Cabreros and Storey, 2019). The ‘large’ version (1000 individuals) illustrates asymptotic performance, while the ‘small’ simulation (100 individuals) illustrates model overfitting. The ‘family’ simulation has admixed founders and draws a 20generation random pedigree with assortative mating, resulting in a complex joint family and ancestry structure in the last generation (Figure 1B). The second set of three are the real human datasets representing global human diversity: Human Origins (Figure 1D), HGDP (Figure 1G), and 1000 Genomes (Figure 1J), which are enriched for small minor allele frequencies even after MAF <1% filter (Figure 1C). Last are subpopulation tree simulations (Figure 1F, I, L) fit to the kinship (Figure 1E, H and K) and MAF (Figure 1C) of each real human dataset, which by design do not have family structure.
All traits in this work are simulated. We repeated all evaluations on two additive quantitative trait models, fixed effect sizes (FES) and random coefficients (RC), which differ in how causal coefficients are constructed. The FES model captures the rough inverse relationship between coefficient and minor allele frequency that arises under strong negative and balancing selection and has been observed in numerous diseases and other traits (Park et al., 2011; Zeng et al., 2018; Simons et al., 2018; O’Connor et al., 2019), so it is the focus of our results. The RC model draws coefficients independent of allele frequency, corresponding to neutral traits (Zeng et al., 2018; Simons et al., 2018), which results in a wider effect size distribution that reduces association power and effective polygenicity compared to FES.
We evaluate using two complementary measures: (1) ${\text{SRMSD}}_{p}$ (pvalue signed root mean square deviation) measures pvalue calibration (closer to zero is better), and (2) ${\text{AUC}}_{\text{PR}}$ (precisionrecall area under the curve) measures causal locus classification performance (higher is better; Figure 2). ${\text{SRMSD}}_{p}$ is a more robust alternative to the common inflation factor $\lambda $ and type I error control measures; there is a correspondence between $\lambda $ and ${\text{SRMSD}}_{p}$, with ${\text{SRMSD}}_{p}>0.01$ giving $\lambda >1.06$ (Figure 2—figure supplement 1) and thus evidence of miscalibration close to the rule of thumb of $\lambda >1.05$ (Price et al., 2010). There is also a monotonic correspondence between ${\text{SRMSD}}_{p}$ and type I error rate (Figure 2—figure supplement 2). ${\text{AUC}}_{\text{PR}}$ has been used to evaluate association models (Rakitsch et al., 2013), and reflects calibrated statistical power (Figure 2—figure supplement 3) while being robust to miscalibrated models (Appendix 2).
Both PCA and LMM are evaluated in each replicate dataset including a number of PCs $r$ between 0 and 90 as fixed covariates. In terms of pvalue calibration, for PCA the best number of PCs $r$ (minimizing mean ${\text{SRMSD}}_{p}$ over replicates) is typically large across all datasets (Table 3), although much smaller $r$ values often performed as well (shown in following sections). Most cases have a mean ${\text{SRMSD}}_{p}<0.01$, whose pvalues are effectively calibrated. However, PCA is often miscalibrated on the family simulation and real datasets (Table 3). In contrast, for LMM, $r=0$ (no PCs) is always best, and is always calibrated. Comparing LMM with $r=0$ to PCA with its best $r$, LMM always has significantly smaller ${\text{SRMSD}}_{p}$ than PCA or is statistically tied. For ${\text{AUC}}_{\text{PR}}$ and PCA, the best $r$ is always smaller than the best $r$ for ${\text{SRMSD}}_{p}$, so there is often a tradeoff between calibrated pvalues versus classification performance. For LMM, there is no tradeoff, as $r=0$ often has the best mean ${\text{AUC}}_{\text{PR}}$, and otherwise is not significantly different from the best $r$. Lastly, LMM with $r=0$ always has significantly greater or statistically tied ${\text{AUC}}_{\text{PR}}$ than PCA with its best $r$.
Evaluations in admixture simulations
Now we look more closely at results per dataset. The complete ${\text{SRMSD}}_{p}$ and ${\text{AUC}}_{\text{PR}}$ distributions for the admixture simulations and FES traits are in Figure 3. RC traits gave qualitatively similar results (Figure 3—figure supplement 1).
In the large admixture simulation, the ${\text{SRMSD}}_{p}$ of PCA is largest when $r=0$ (no PCs) and decreases rapidly to near zero at $r=3$, where it stays for up to $r=90$ (Figure 3A). Thus, PCA has calibrated pvalues for $r\ge 3$, smaller than the theoretical optimum for this simulation of $r=K1=9$. In contrast, the ${\text{SRMSD}}_{p}$ for LMM starts near zero for $r=0$, but becomes negative as $r$ increases (pvalues are conservative). The ${\text{AUC}}_{\text{PR}}$ distribution of PCA is similarly worst at $r=0$, increases rapidly and peaks at $r=3$, then decreases slowly for $r>3$, while the ${\text{AUC}}_{\text{PR}}$ distribution for LMM starts near its maximum at $r=0$ and decreases with $r$. Although the ${\text{AUC}}_{\text{PR}}$ distributions for LMM and PCA overlap considerably at each $r$, LMM with $r=0$ has significantly greater ${\text{AUC}}_{\text{PR}}$ values than PCA with $r=3$ (Table 3). However, qualitatively PCA performs nearly as well as LMM in this simulation.
The observed robustness to large $r$ led us to consider smaller sample sizes. A model with large numbers of parameters $r$ should overfit more as $r$ approaches the sample size $n$. Rather than increase $r$ beyond 90, we reduce individuals to $n=100$, which is small for typical association studies but may occur in studies of rare diseases, pilot studies, or other constraints. To compensate for the loss of power due to reducing $n$, we also reduce the number of causal loci (see Trait Simulation), which increases perlocus effect sizes. We found a large decrease in performance for both models as $r$ increases, and best performance for $r=1$ for PCA and $r=0$ for LMM (Figure 3B). Remarkably, LMM attains much larger negative ${\text{SRMSD}}_{p}$ values than in our other evaluations. LMM with $r=0$ is significantly better than PCA ($r=1$ to 4) in both measures (Table 3), but qualitatively the difference is negligible.
The family simulation adds a 20generation random family to our large admixture simulation. Only the last generation is studied for association, which contains numerous siblings, first cousins, etc., with the initial admixture structure preserved by geographically biased mating. Our evaluation reveals a sizable gap in both measures between LMM and PCA across all $r$ (Figure 3C). LMM again performs best with $r=0$ and achieves mean ${\text{SRMSD}}_{p}<0.01$. However, PCA does not achieve mean ${\text{SRMSD}}_{p}<0.01$ at any $r$, and its best mean ${\text{AUC}}_{\text{PR}}$ is considerably worse than that of LMM. Thus, LMM is conclusively superior to PCA, and the only calibrated model, when there is family structure.
Evaluations in real human genotype datasets
Next, we repeat our evaluations with real human genotype data, which differs from our simulations in allele frequency distributions and more complex population structures with greater ${F}_{\text{ST}}$, numerous correlated subpopulations, and potential cryptic family relatedness.
Human Origins has the greatest number and diversity of subpopulations. The ${\text{SRMSD}}_{p}$ and ${\text{AUC}}_{\text{PR}}$ distributions in this dataset and FES traits (Figure 4A) most resemble those from the family simulation (Figure 3C). In particular, while LMM with $r=0$ performed optimally (both measures) and satisfies mean ${\text{SRMSD}}_{p}<0.01$, PCA maintained ${\text{SRMSD}}_{p}>0.01$ for all $r$ and its ${\text{AUC}}_{\text{PR}}$ were all considerably smaller than the best ${\text{AUC}}_{\text{PR}}$ of LMM.
HGDP has the fewest individuals among real datasets, but compared to Human Origins contains more loci and lowfrequency variants. Performance (Figure 4B) again most resembled the family simulations. In particular, LMM with $r=0$ achieves mean ${\text{SRMSD}}_{p}<0.01$ (pvalues are calibrated), while PCA does not, and there is a sizable ${\text{AUC}}_{\text{PR}}$ gap between LMM and PCA. Maximum ${\text{AUC}}_{\text{PR}}$ values were lowest in HGDP compared to the two other real datasets.
1000 Genomes has the fewest subpopulations but largest number of individuals per subpopulation. Thus, although this dataset has the simplest subpopulation structure among the real datasets, we find ${\text{SRMSD}}_{p}$ and ${\text{AUC}}_{\text{PR}}$ distributions (Figure 4C) that again most resemble our earlier family simulation, with mean ${\text{SRMSD}}_{p}<0.01$ for LMM only and large ${\text{AUC}}_{\text{PR}}$ gaps between LMM and PCA.
Our results are qualitatively different for RC traits, which had smaller ${\text{AUC}}_{\text{PR}}$ gaps between LMM and PCA (Figure 4—figure supplement 1). Maximum ${\text{AUC}}_{\text{PR}}$ were smaller in RC compared to FES in Human Origins and 1000 Genomes, suggesting lower power for RC traits across association models. Nevertheless, LMM with $r=0$ was significantly better than PCA for all measures in the real datasets and RC traits (Table 3).
Evaluations in subpopulation tree simulations fit to human data
To better understand which features of the real datasets lead to the large differences in performance between LMM and PCA, we carried out subpopulation tree simulations. Human subpopulations are related roughly by trees, which induce the strongest correlations, so we fit trees to each real dataset and tested if data simulated from these complex tree structures could recapitulate our previous results (Figure 1). These tree simulations also feature nonuniform ancestral allele frequency distributions, which recapitulated some of the skew for smaller minor allele frequencies of the real datasets (Figure 1C). The ${\text{SRMSD}}_{p}$ and ${\text{AUC}}_{\text{PR}}$ distributions for these tree simulations (Figure 5) resembled our admixture simulation more than either the family simulation (Figure 3) or real data results (Figure 4). Both LMM with $r=0$ and PCA (various $r$) achieve mean ${\text{SRMSD}}_{p}<0.01$ (Table 3). The ${\text{AUC}}_{\text{PR}}$ distributions of both LMM and PCA track closely as $r$ is varied, although there is a small gap resulting in LMM ($r=0$) besting PCA in all three simulations. The results are qualitatively similar for RC traits (Figure 5—figure supplement 1, Table 3). Overall, these subpopulation tree simulations do not recapitulate the large LMM advantage over PCA observed on the real data.
Numerous distant relatives explain poor PCA performance in real data
In principle, PCA performance should be determined by the dimension of relatedness, or kinship matrix rank, since PCA is a lowdimensional model whereas LMM can model highdimensional relatedness without overfitting. We used the TracyWidom test (Patterson et al., 2006) with $p<0.01$ to estimate kinship matrix rank as the number of significant PCs (Figure 6—figure supplement 1A). The true rank of our simulations is slightly underestimated (Table 2), but we confirm that the family simulation has the greatest rank, and real datasets have greater estimates than their respective subpopulation tree simulations, which confirms our hypothesis to some extent. However, estimated ranks do not separate real datasets from tree simulations, as required to predict the observed PCA performance. Moreover, the HGDP and 1000 Genomes rank estimates are 45 and 61, respectively, yet PCA performed poorly for all $r\le 90$ numbers of PCs (Figure 4). The top eigenvalue explained a proportion of variance proportional to ${F}_{\text{ST}}$ (Table 2), but the rest of the top 10 eigenvalues show no clear differences between datasets, except the small simulation had larger variances explained per eigenvalue (expected since it has fewer eigenvalues; Figure 6—figure supplement 1). Comparing cumulative variance explained versus rank fraction across all eigenvalues, all datasets increase from their starting point almost linearly until they reach 1, except the family simulation has much greater variance explained by midrank eigenvalues (Figure 6—figure supplement 1). We also calculated the number of PCs that are significantly associated with the trait, and observed similar results, namely that while the family simulation has more significant PCs than the nonfamily admixture simulations, the real datasets and their tree simulated counterparts have similar numbers of significant PCs (Figure 6—figure supplement 2). Overall, there is no separation between real datasets (where PCA performed poorly) and subpopulation tree simulations (where PCA performed relatively well) in terms of their eigenvalues or kinship matrix rank estimates.
Local kinship, which is recent relatedness due to family structure excluding population structure, is the presumed cause of the LMM to PCA performance gap observed in real datasets but not their subpopulation tree simulation counterparts. Instead of inferring local kinship through increased kinship matrix rank, as attempted in the last paragraph, now we measure it directly using the KINGrobust estimator (Manichaikul et al., 2010). We observe more large local kinship in the real datasets and the family simulation compared to the other simulations (Figure 6). However, for real data this distribution depends on the subpopulation structure, since locally related pairs are most likely in the same subpopulation. Therefore, the only comparable curve to each real dataset is their corresponding subpopulation tree simulation, which matches subpopulation structure. In all real datasets, we identified highly related individual pairs with kinship above the 4th degree relative threshold of 0.022 (Manichaikul et al., 2010; Conomos et al., 2016b). However, these highly related pairs are vastly outnumbered by more distant pairs with evident nonzero local kinship as compared to the extreme tree simulation values.
To try to improve PCA performance, we followed the standard practice of removing 4th degree relatives, which reduced sample sizes between 5% and 10% (Table 4). Only $r=0$ for LMM and $r=20$ for PCA were tested, as these performed well in our earlier evaluation, and only FES traits were tested because they previously displayed the large PCALMM performance gap. LMM significantly outperforms PCA in all these cases (Wilcoxon paired 1tailed $p<0.01$; Figure 7). Notably, PCA still had miscalibrated pvalues two of the three real datasets (${\text{SRMSD}}_{p}>0.01$), the only marginally calibrated case being HGDP which is also the smallest of these datasets. Otherwise, ${\text{AUC}}_{\text{PR}}$ and ${\text{SRMSD}}_{p}$ ranges were similar here as in our earlier evaluation. Therefore, the removal of the small number of highly related individual pairs had a negligible effect in PCA performance, so the larger number of more distantly related pairs explain the poor PCA performance in the real datasets.
Low heritability and environment simulations
Our main evaluations were repeated with traits simulated under a lower heritability value of ${h}^{2}=0.3$. We reduced the number of causal loci in response to this change in heritability, to result in equal average effect size per locus compared to the previous high heritability evaluations (see Trait Simulation). Despite that, these low heritability evaluations measured lower ${\text{AUC}}_{\text{PR}}$ values than their high heritability counterparts (Figure 3—figure supplement 2, Figure 3—figure supplement 3, Figure 4—figure supplement 2, Figure 4—figure supplement 3, Figure 7—figure supplement 1). The gap between LMM and PCA was reduced in these evaluations, but the main conclusion of the high heritability evaluation holds for low heritability as well, namely that LMM with $r=0$ significantly outperforms or ties LMM with $r>0$ and PCA in all cases (Table 5).
Lastly, we simulated traits with both low heritability and large environment effects determined by geography and subpopulation labels, so they are strongly correlated to the lowdimensional population structure. For that reason, PCs may be expected to perform better in this setting (in either PCA or LMM). However, we find that both PCA and LMM (even without PCs) increase their ${\text{AUC}}_{\text{PR}}$ values compared to the lowheritability evaluations (Figure 8—figure supplement 1; Figure 8 also shows representative numbers of PCs, which performed optimally or nearly so in individual simulations shown in Figure 3—figure supplement 4, Figure 3—figure supplement 5, Figure 4—figure supplement 4, Figure 4—figure supplement 5). pValue calibration is comparable with or without environment effects, for LMM for all $r$ and for PCA once $r$ is large enough (Figure 8—figure supplement 1). These simulations are the only where we occasionally observed for both metrics a significant, though small, advantage of LMM with PCs versus LMM without PCs (Table 6). Additionally, on RC traits only, PCA significantly outperforms LMM in the three real human datasets (Table 6), the only cases in all of our evaluations where this is observed. For comparison, we also evaluate an ‘oracle’ LMM without PCs but with the finest group labels, the same used to simulate environment, as fixed categorical covariates (‘LMM lab.’), and see much larger ${\text{AUC}}_{\text{PR}}$ values than either LMM with PCs or PCA (Figure 8, Figure 3—figure supplement 4, Figure 3—figure supplement 5, Figure 4—figure supplement 4, Figure 4—figure supplement 5, Table 6). However, LMM with labels is often more poorly calibrated than LMM or PCA without labels, which may be since these numerous labels are inappropriately modeled as fixed rather than random effects. Overall, we find that association studies with correlated environment and genetic effects remain a challenge for PCA and LMM, that addition of PCs to an LMM improves performance only marginally, and that if the environment effect is driven by geography or ethnicity then use of those labels greatly improves performance compared to using PCs.
Discussion
Our evaluations conclusively determined that LMM without PCs performs better than PCA (for any number of PCs) across all scenarios without environment effects, including all real and simulated genotypes and two trait simulation models. Although the addition of a few PCs to LMM does not greatly hurt its performance (except for small sample sizes), they generally did not improve it either (Table 3, Table 5), which agrees with previous observations (Liu et al., 2011; Janss et al., 2012) but contradicts others (Zhao et al., 2007; Price et al., 2010). Our findings make sense since PCs are the eigenvectors of the same kinship matrix that parameterized random effects, so including both is redundant.
The presence of environment effects that are correlated to relatedness presents the only scenario where occasionally PCA and LMM with PCs outperform LMM without PCs (Table 6). It is commonly believed that PCs model such environment effects well (Novembre et al., 2008; Zhang and Pan, 2015; Lin et al., 2021). However, we observe that LMM without PCs models environment effects nearly as well as with PCs (Figure 8), consistent with previous findings (Vilhjálmsson and Nordborg, 2013; Wang et al., 2022) and with environment inflating heritability estimates using LMM (Heckerman et al., 2016). Moreover, modeling the true environment groups as fixed categorical effects always substantially improved ${\text{AUC}}_{\text{PR}}$ compared to modeling them with PCs (Figure 8, Table 6). Modeling numerous environment groups as fixed effects does result in deflated pvalues (Figure 8, Table 6), which we expect would be avoided by modeling them as random effects, a strategy we chose not to pursue here as it is both a circular evaluation (the true effects were drawn from that model) and out of scope. Overall, including PCs to model environment effects yields limited power gains if at all, even in an LMM, and is no replacement for more adequate modeling of environment whenever possible.
Previous studies found that PCA was better calibrated than LMM for unusually differentiated markers (Price et al., 2010; Wu et al., 2011; Yang et al., 2014), which as simulated were an artificial scenario not based on a population genetics model, and are otherwise believed to be unusual (Sul and Eskin, 2013; Price et al., 2013). Our evaluations on real human data, which contain such loci in relevant proportions if they exist, do not replicate that result. Family relatedness strongly favors LMM, an advantage that probably outweighs this potential PCA benefit in real data.
Relative to LMM, the behavior of PCA fell between two extremes. When PCA performed well, there was a small number of PCs with both calibrated pvalues and ${\text{AUC}}_{\text{PR}}$ near that of LMM without PCs. Conversely, PCA performed poorly when no number of PCs had either calibrated pvalues or acceptably large ${\text{AUC}}_{\text{PR}}$. There were no cases where high numbers of PCs optimized an acceptable ${\text{AUC}}_{\text{PR}}$, or cases with miscalibrated pvalues but high ${\text{AUC}}_{\text{PR}}$. PCA performed well in the admixture simulations (without families, both trait models), real human genotypes with RC traits, and the subpopulation tree simulations (both trait models). Conversely, PCA performed poorly in the admixed family simulation (both trait models) and the real human genotypes with FES traits.
PCA assumes that genetic relatedness is restricted to a lowdimensional subspace, whereas LMM can handle highdimensional relatedness. Thus, PCA performs well in the admixture simulation, which is explicitly lowdimensional (see Genotype simulation from the admixture model), and our subpopulation tree simulations, which are likely well approximated by a few dimensions despite the large number of subpopulations because there are few long branches. Conversely, PCA performs poorly under family structure because its kinship matrix is highdimensional (Figure 6—figure supplement 1). However, estimating the latent space dimensions of real datasets is challenging because estimated eigenvalues have biased distributions (Hayashi et al., 2018). Kinship matrix rank estimated using the TracyWidom test (Patterson et al., 2006) did not fully predict the datasets that PCA performs well on. In contrast, estimated local kinship finds considerable cryptic family relatedness in all real human datasets and better explains why PCA performs poorly there. The trait model also influences the relative performance of PCA, so genotypeonly parameters (eigenvalues or local kinship) alone cannot tell the full story. There are related tests for numbers of dimensions that consider the trait which we did not consider, including the Bayesian information criterion for the regression with PCs against the trait (Zhu and Yu, 2009). Additionally, PCA and LMM goodness of fit could be compared using the coefficient of determination generalized for LMMs (Sun et al., 2010).
PCA is at best underpowered relative to LMMs, and at worst miscalibrated regardless of the numbers of PCs included, in real human genotype tests. Among our simulations, such poor performance occurred only in the admixed family. Local kinship estimates reveal considerable family relatedness in the real datasets absent in the corresponding subpopulation tree simulations. Admixture is also absent in our tree simulations, but our simulations and theory show that admixture is well handled by PCA. Hundreds of close relative pairs have been identified in 1000 Genomes (Gazal et al., 2015; Al Khudhair et al., 2015; Fedorova et al., 2016; Schlauch et al., 2017), but their removal does not improve PCA performance sufficiently in our tests, so the larger number of more distantly related pairs are PCA’s most serious obstacle in practice. Distant relatives are expected to be numerous in any large human dataset (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). Our FES trait tests show that family relatedness is more challenging when rarer variants have larger coefficients. Overall, the high relatedness dimensions induced by family relatedness is the key challenge for PCA association in modern datasets that is readily overcome by LMM.
Our tests also found PCA robust to large numbers of PCs, far beyond the optimal choice, agreeing with previous anecdotal observations (Price et al., 2006; Kang et al., 2010), in contrast to using too few PCs for which there is a large performance penalty. The exception was the small sample size simulation, where only small numbers of PCs performed well. In contrast, LMM is simpler since there is no need to choose the number of PCs. However, an LMM with a large number of covariates may have conservative pvalues, as observed for LMM with large numbers of PCs, which is a weakness of the score test used by the LMM we evaluated that may be overcome with other statistical tests. Simulations or post hoc evaluations remain crucial for ensuring that statistics are calibrated.
There are several variants of the PCA and LMM analyses, most designed for better modeling linkage disequilibrium (LD), that we did not evaluate directly, in which PCs are no longer exactly the top eigenvectors of the kinship matrix (if estimated with different approaches), although this is not a crucial aspect of our arguments. We do not consider the case where samples are projected onto PCs estimated from an external sample (Privé et al., 2020), which is uncommon in association studies, and whose primary effect is shrinkage, so if all samples are projected then they are all equally affected and larger regression coefficients compensate for the shrinkage, although this will no longer be the case if only a portion of the sample is projected onto the PCs of the rest of the sample. Another approach tests PCs for association against every locus in the genome in order to identify and exclude PCs that capture LD structure (which is localized) instead of ancestry (which should be present across the genome; Privé et al., 2020); a previous proposal removes LD using an autocorrelation model prior to estimating PCs (Patterson et al., 2006). These improved PCs remain inadequate models of family relatedness, so an LMM will continue to outperform them in that setting. Similarly, the leaveonechromosomeout (LOCO) approach for estimating kinship matrices for LMMs prevents the test locus and loci in LD with it from being modeled by the random effect as well, which is called ‘proximal contamination’ (Lippert et al., 2011; Yang et al., 2014). While LOCO kinship estimates vary for each chromosome, they continue to model family relatedness, thus maintaining their key advantage over PCA. The LDAK model estimates kinship instead by weighing loci taking LD into account (Speed et al., 2012). LD effects must be adjusted for, if present, so in unfiltered data we advise the previous methods be applied. However, in this work, simulated genotypes do not have LD, and the real datasets were filtered to remove LD, so here there is no proximal contamination and LD confounding is minimized if present at all, so these evaluations may be considered the ideal situation where LD effects have been adjusted successfully, and in this setting LMM outperforms PCA. Overall, these alternative PCs or kinship matrices differ from their basic counterparts by either the extent to which LD influences the estimates (which may be a confounder in a small portion of the genome, by definition) or by sampling noise, neither of which are expected to change our key conclusion.
One of the limitations of this work include relatively small sample sizes compared to modern association studies. However, our conclusions are not expected to change with larger sample sizes, as cryptic family relatedness will continue to be abundant in such data, if not increase in abundance, and thus give LMMs an advantage over PCA (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). One reason PCA has been favored over classic LMMs is because PCA’s runtime scales much better with increasing sample size. However, recent approaches not tested in this work have made LMMs more scalable and applicable to biobankscale data (Loh et al., 2015; Zhou et al., 2018; Mbatchou et al., 2021), so one clear next step is carefully evaluating these approaches in simulations with larger sample sizes. A different benefit for including PCs were recently reported for BOLTLMM, which does not result in greater power but rather in reduced runtime, a property that may be specific to its use of scalable algorithms such as conjugate gradient and variational Bayes (Loh et al., 2018). Many of these newer LMMs also no longer follow the infinitesimal model of the basic LMM (Loh et al., 2015; Mbatchou et al., 2021), and employ novel approximations, which are features not evaluated in this work and worthy of future study.
Another limitation of this work is ignoring rare variants, a necessity given our smaller sample sizes, where rare variant association is miscalibrated and underpowered. Using simulations mimicking the UK Biobank, recent work has found that rare variants can have a more pronounced structure than common variants, and that modeling this rare variant structure (with either PCA and LMM) may better model environment confounding, reduce inflation in association studies, and ameliorate stratification in polygenic risk scores (Zaidi and Mathieson, 2020). Better modeling rare variants and their structure is a key next step in association studies.
The largest limitation of our work is that we only considered quantitative traits. Previous evaluations involving casecontrol traits tended to report PCALMM ties or mixed results, an observation potentially confounded by the use of lowdimensional simulations without family relatedness (Table 1). An additional concern is casecontrol ascertainment bias and imbalance, which appears to affect LMMs more severely, although recent work appears to solve this problem (Yang et al., 2014; Zhou et al., 2018). Future evaluations should aim to include our simulations and real datasets, to ensure that previous results were not biased in favor of PCA by not simulating family structure or larger coefficients for rare variants that are expected for diseases by various selection models.
Overall, our results lead us to recommend LMM over PCA for association studies in general. Although PCA offer flexibility and speed compared to LMM, additional work is required to ensure that PCA is adequate, including removal of close relatives (lowering sample size and wasting resources) followed by simulations or other evaluations of statistics, and even then PCA may perform poorly in terms of both type I error control and power. The large numbers of distant relatives expected of any real dataset all but ensures that PCA will perform poorly compared to LMM (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). Our findings also suggest that related applications such as polygenic models may enjoy gains in power and accuracy by employing an LMM instead of PCA to model relatedness (Rakitsch et al., 2013; Qian et al., 2020). PCA remains indispensable across population genetics, from visualizing population structure and performing quality control to its deep connection to admixture models, but the time has come to limit its use in association testing in favor of LMM or other, richer models capable of modeling all forms of relatedness.
Materials and methods
The complex trait model and PCA and LMM approximations
Request a detailed protocolLet ${x}_{ij}\in \{0,1,2\}$ be the genotype at the biallelic locus $i$ for individual $j$, which counts the number of reference alleles. Suppose there are $n$ individuals and $m$ loci, $\mathbf{X}=({x}_{ij})$ is their $m\times n$ genotype matrix, and $\mathbf{y}$ is the length$n$ column vector of individual trait values. The additive linear model for a quantitative (continuous) trait is:
where 1 is a length$n$ vector of ones, $\alpha $ is the scalar intercept coefficient, $\mathit{\beta}$ is the length$m$ vector of locus coefficients, $\mathbf{Z}$ is a design matrix of environment effects and other covariates, $\mathit{\eta}$ is the vector of environment coefficients, $\mathit{\u03f5}$ is a length$n$ vector of residuals, and the superscript prime symbol ($\mathrm{\prime}$) denotes matrix transposition. The residuals follow ${\u03f5}_{j}\sim \text{Normal}(0,{\sigma}_{\u03f5}^{2})$ independently per individual $j$, for some ${\sigma}_{\u03f5}^{2}$.
The full model of Equation 1, which has a coefficient for each of the $m$ loci, is underdetermined in current datasets where $m\gg n$. The PCA and LMM models, respectively, approximate the full model fit at a single locus $i$:
where ${\mathbf{x}}_{i}$ is the length$n$ vector of genotypes at locus $i$ only, ${\beta}_{i}$ is the locus coefficient, ${\mathbf{U}}_{r}$ is an $n\times r$ matrix of PCs, ${\mathit{\gamma}}_{r}$ is the length$r$ vector of PC coefficients, $\mathbf{s}$ is a length$n$ vector of random effects, ${\mathbf{\Phi}}^{T}=({\phi}_{jk}^{T})$ is the $n\times n$ kinship matrix conditioned on the ancestral population $T$, and ${\sigma}_{s}^{2}$ is a variance factor. Both models condition the regression of the focal locus $i$ on an approximation of the total polygenic effect ${\mathbf{X}}^{\prime}\mathit{\beta}$ with the same covariance structure, which is parameterized by the kinship matrix. Under the kinship model, genotypes are random variables obeying
where ${p}_{i}^{T}$ is the ancestral allele frequency of locus $i$ (Malécot, 1948; Wright, 1949; Jacquard, 1970; Astle and Balding, 2009). Assuming independent loci, the covariance of the polygenic effect is
which is readily modeled by the LMM random effect $\mathbf{s}$, where the difference in mean is absorbed by the intercept. Alternatively, consider the eigendecomposition of the kinship matrix ${\mathbf{\Phi}}^{T}=\mathbf{U}\mathbf{\Lambda}{\mathbf{U}}^{\prime}$ where $\mathbf{U}$ is the $n\times n$ eigenvector matrix and $\mathbf{\Lambda}$ is the $n\times n$ diagonal matrix of eigenvalues. The random effect can be written as
which follows from the affine transformation property of multivariate normal distributions. Therefore, the PCA term ${\mathbf{U}}_{r}{\mathit{\gamma}}_{r}$ can be derived from the above equation under the additional assumption that the kinship matrix has approximate rank $r$ and the coefficients ${\mathit{\gamma}}_{r}$ are fit without constraints. In contrast, the LMM uses all eigenvectors, while effectively shrinking their coefficients ${\mathit{\gamma}}_{\text{LMM}}$ as all random effects models do, although these parameters are marginalized (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015). PCA has more parameters than LMM, so it may overfit more: ignoring the shared terms in Equation 2 and Equation 3, PCA fits $r$ parameters (length of $\mathit{\gamma}$), whereas LMMs fit only one (${\sigma}_{s}^{2}$).
In practice, the kinship matrix used for PCA and LMM is estimated with variations of a methodofmoments formula applied to standardized genotypes ${\mathbf{X}}_{S}$, which is derived from Equation 4:
where the unknown ${p}_{i}^{T}$ is estimated by ${\widehat{p}}_{i}^{T}=\frac{1}{2n}{\sum}_{j=1}^{n}{x}_{ij}$ (Price et al., 2006; Kang et al., 2008; Kang et al., 2010; Yang et al., 2011; Zhou and Stephens, 2012; Yang et al., 2014; Loh et al., 2015; Sul et al., 2018; Zhou et al., 2018). However, this kinship estimator has a complex bias that differs for every individual pair, which arises due to the use of this estimated ${\widehat{p}}_{i}^{T}$(Ochoa and Storey, 2021; Ochoa and Storey, 2019). Nevertheless, in PCA and LMM these biased estimates perform as well as unbiased ones (Hou et al., 2023b).
We selected fast and robust software implementing the basic PCA and LMM models. PCA association was performed with plink2 (Chang et al., 2015). The quantitative trait association model is a linear regression with covariates, evaluated using the ttest. PCs were calculated with plink2, which equal the top eigenvectors of Equation 5 after removing loci with minor allele frequency $\text{MAF}<0.1$.
LMM association was performed using GCTA (Yang et al., 2011; Yang et al., 2014). Its kinship estimator equals Equation 5. PCs were calculated using GCTA from its kinship estimate. Association significance is evaluated with a score test. In the small simulation only, GCTA with large numbers of PCs had convergence and singularity errors in some replicates, which were treated as missing data.
Simulations
Every simulation was replicated 50 times, drawing anew all genotypes (except for real datasets) and traits. Below we use the notation ${f}_{A}^{B}$ for the inbreeding coefficient of a subpopulation $A$ from another subpopulation $B$ ancestral to $A$. In the special case of the total inbreeding of $A$, ${f}_{A}^{T}$, $T$ is an overall ancestral population, which is ancestral to every individual under consideration, such as the most recent common ancestor (MRCA) population.
Genotype simulation from the admixture model
Request a detailed protocolThe basic admixture model is as described previously (Ochoa and Storey, 2021) and is implemented in the R package bnpsd. Both Large and Family simulations have $n=1,000$ individuals, while Small has $n=100$. The number of loci is $m=100,000$. Individuals are admixed from $K=10$ intermediate subpopulations, or ancestries. Each subpopulation ${S}_{u}$ ($u\in \{1,\mathrm{\dots},K\}$) is at coordinate $u$ and has an inbreeding coefficient ${f}_{{S}_{u}}^{T}=u\tau $ for some $\tau $. Ancestry proportions ${q}_{ju}$ for individual $j$ and ${S}_{u}$ arise from a random walk with spread $\sigma $ on the 1D geography, and $\tau $ and $\sigma $ are fit to give ${F}_{\text{ST}}=0.1$ and mean kinship ${\overline{\theta}}^{T}=0.5{F}_{\text{ST}}$ for the admixed individuals (Ochoa and Storey, 2021). Random ancestral allele frequencies ${p}_{i}^{T}$, subpopulation allele frequencies ${p}_{i}^{{S}_{u}}$, individualspecific allele frequencies ${\pi}_{ij}$, and genotypes ${x}_{ij}$ are drawn from this hierarchical model:
where this Beta is the BaldingNichols distribution (Balding and Nichols, 1995) with mean ${p}_{i}^{T}$ and variance ${p}_{i}^{T}\left(1{p}_{i}^{T}\right){f}_{{S}_{u}}^{T}$. Fixed loci ($i$ where ${x}_{ij}=0$ for all $j$, or ${x}_{ij}=2$ for all $j$) are drawn again from the model, starting from ${p}_{i}^{T}$, iterating until no loci are fixed. Each replicate draws a genotypes starting from ${p}_{i}^{T}$.
As a brief aside, we prove that global ancestry proportions as covariates is equivalent in expectation to using PCs under the admixture model. Note that the latent space of $\mathbf{X}$, which is the subspace to which the data is constrained by the admixture model, is given by $({\pi}_{ij})$, which has $K$ dimensions (number of columns of $\mathbf{Q}=({q}_{ju})$), so the top $K$ PCs span this space. Since associations include an intercept term ($\mathbf{1}\alpha $ in Equation 2), estimated PCs are orthogonal to 1 (note $\hat{\mathrm{\Phi}}}^{T}\mathbf{1}=\mathbf{0$ because ${\mathbf{X}}_{S}\mathbf{1}=\mathbf{0}$), and the sum of rows of $\mathbf{Q}$ sums to one, then only $K1$ PCs plus the intercept are needed to span the latent space of this admixture model.
Genotype simulation from random admixed families
Request a detailed protocolWe simulated a pedigree with admixed founders, no close relative pairings, assortative mating based on a 1D geography (to preserve admixture structure), random family sizes, and arbitrary numbers of generations (20 here). This simulation is implemented in the R package simfam. Generations are drawn iteratively. Generation 1 has $n=1000$ individuals from the above admixture simulation ordered by their 1D geography. Local kinship measures pedigree relatedness; in the first generation, everybody is locally unrelated and outbred. Individuals are randomly assigned sex. In the next generation, individuals are paired iteratively, removing random males from the pool of available males and pairing them with the nearest available female with local kinship $<1/{4}^{3}$ (stay unpaired if there are no matches), until there are no more available males or females. Let $n=1000$ be the desired population size, ${n}_{m}=1$ the minimum number of children per family and n_{f} the number of families (paired parents) in the current generation, then the number of additional children (beyond the minimum) is drawn from $\text{Poisson}(n/{n}_{f}{n}_{m})$. Let $\delta $ be the difference between desired and current population sizes. If $\delta >0$, then $\delta $ random families are incremented by 1. If $\delta <0$, then $\delta $ random families with at least ${n}_{m}+1$ children are decremented by 1. If $\delta $ exceeds the number of families, all families are incremented or decremented as needed and the process is iterated. Children are assigned sex randomly, and are reordered by the average coordinate of their parents. Children draw alleles from their parents independently per locus. A new random pedigree is drawn for each replicate, as well as new founder genotypes from the admixture model.
Genotype simulation from a subpopulation tree model
Request a detailed protocolThis model draws subpopulations allele frequencies from a hierarchical model parameterized by a tree, which is also implemented in bnpsd and relies on the R package ape for general tree data structures and methods (Paradis and Schliep, 2019). The ancestral population $T$ is the root, and each node is a subpopulation ${S}_{w}$ indexed arbitrarily. Each edge between ${S}_{w}$ and its parent population ${P}_{w}$ has an inbreeding coefficient $f}_{{S}_{w}}^{{P}_{w}$. $P}_{i}^{T$ are drawn from a given distribution, which is constructed to mimic each real dataset in Appendix 1. Given the allele frequencies ${p}_{i}^{{P}_{w}}$ of the parent population, ${S}_{w}$’s allele frequencies are drawn from:
Individuals $j$ in ${S}_{w}$ draw genotypes from its allele frequency: ${x}_{ij}{p}_{i}^{{S}_{w}}\sim \mathrm{B}\mathrm{i}\mathrm{n}\mathrm{o}\mathrm{m}\mathrm{i}\mathrm{a}\mathrm{l}\left(2,{p}_{i}^{{S}_{w}}\right).$ Loci with $\text{MAF}<0.01$ are drawn again starting from the ${p}_{i}^{T}$ distribution, iterating until no such loci remain.
Fitting subpopulation tree to real data
Request a detailed protocolWe developed new methods to fit trees to real data based on unbiased kinship estimates from popkin, implemented in bnpsd. A tree with given inbreeding coefficients ${f}_{{S}_{w}}^{{P}_{w}}$ for its edges (between subpopulation ${S}_{w}$ and its parent ${P}_{w}$) gives rise to a coancestry matrix ${\vartheta}_{uv}^{T}$ for a subpopulation pair (${S}_{u},{S}_{v}$), and the goal is to recover these edge inbreeding coefficients from coancestry estimates. Coancestry values are total inbreeding coefficients of the MRCA population of each subpopulation pair. Therefore, we calculate ${f}_{{S}_{w}}^{T}$ for every ${S}_{w}$ recursively from the root as follows. Nodes with parent ${P}_{w}=T$ are already as desired. Given ${f}_{{P}_{w}}^{T}$, the desired ${f}_{{S}_{w}}^{T}$ is calculated via the ‘additive edge’ ${\delta}_{w}$ (Ochoa and Storey, 2021):
These ${\delta}_{w}\ge 0$ because $0\le {f}_{{S}_{w}}^{{P}_{w}},{f}_{{P}_{w}}^{T}\le 1$ for every $w$. Edge inbreeding coefficients can be recovered from additive edges: ${f}_{{S}_{w}}^{{P}_{w}}={\delta}_{w}/(1{f}_{{P}_{w}}^{T})$. Overall, coancestry values are sums of ${\delta}_{w}$ over common ancestor nodes,
where the sum includes all $w$, and ${I}_{w}(u,v)$ equals 1 if ${S}_{w}$ is a common ancestor of ${S}_{u},{S}_{v}$, 0 otherwise. Note that ${I}_{w}(u,v)$ reflects tree topology and ${\delta}_{w}$ edge values.
To estimate populationlevel coancestry, first kinship (${\widehat{\phi}}_{jk}^{T}$) is estimated using popkin (Ochoa and Storey, 2021). Individual coancestry (${\widehat{\theta}}_{jk}^{T}$) is estimated from kinship using
Lastly, coancestry ${\widehat{\vartheta}}_{uv}^{T}$ between subpopulations are averages of individual coancestry values:
Topology is estimated with hierarchical clustering using the weighted pair group method with arithmetic mean (Sokal and Michener, 1958), with distance function $d({S}_{u},{S}_{v})=\text{max}\left\{{\widehat{\vartheta}}_{uv}^{T}\right\}{\widehat{\vartheta}}_{uv}^{T},$ which succeeds due to the monotonic relationship between node depth and coancestry (Equation 7). This algorithm recovers the true topology from the true coancestry values, and performs well for estimates from genotypes.
To estimate tree edge lengths, first ${\delta}_{w}$ are estimated from ${\widehat{\vartheta}}_{uv}^{T}$ and the topology using Equation 7 and nonnegative least squares linear regression (Lawson and Hanson, 1974) (implemented in nnls; Mullen, 2012) to yield nonnegative ${\delta}_{w}$, and ${f}_{{S}_{w}}^{{P}_{w}}$ are calculated from ${\delta}_{w}$ by reversing Equation 5. To account for small biases in coancestry estimation, an intercept term ${\delta}_{0}$ is included (${I}_{0}(u,v)=1$ for all $u,v$), and when converting ${\delta}_{w}$ to ${f}_{{S}_{w}}^{{P}_{w}}$, ${\delta}_{0}$ is treated as an additional edge to the root, but is ignored when drawing allele frequencies from the tree.
Trait simulation
Request a detailed protocolTraits are simulated from the quantitative trait model of Equation 1, with novel bias corrections for simulating the desired heritability from real data relying on the unbiased kinship estimator popkin (Ochoa and Storey, 2021). This simulation is implemented in the R package simtrait. All simulations have a fixed narrowsense heritability of ${h}^{2}$, a variance proportion due to environment effects ${\sigma}_{\eta}^{2}$, and residuals are drawn from ${\u03f5}_{j}\sim \text{Normal}(0,{\sigma}_{\u03f5}^{2})$ with ${\sigma}_{\u03f5}^{2}=1{h}^{2}{\sigma}_{\eta}^{2}$. The number of causal loci m_{1}, which determines the average coefficient size, is chosen with the heuristic formula ${m}_{1}=\mathrm{round}(n{h}^{2}/8)$, which empirically balances power well with varying $n$ and ${h}^{2}$. The set of causal loci $C$ is drawn anew for each replicate, from loci with $\text{MAF}\ge 0.01$ to avoid rare causal variants, which are not discoverable by PCA or LMM at the sample sizes we considered. Letting ${v}_{i}^{T}={p}_{i}^{T}\left(1{p}_{i}^{T}\right)$, the effect size of locus $i$ equals $2{v}_{i}^{T}{\beta}_{i}^{2}$, its contribution of the trait variance (Park et al., 2010). Under the fixed effect sizes (FES) model, initial causal coefficients are
for known ${p}_{i}^{T}$; otherwise ${v}_{i}^{T}$ is replaced by the unbiased estimator (Ochoa and Storey, 2021) ${\widehat{v}}_{i}^{T}={\widehat{p}}_{i}^{T}\left(1{\widehat{p}}_{i}^{T}\right)/(1{\overline{\phi}}^{T}),$ where ${\overline{\phi}}^{T}$ is the mean kinship estimated with popkin. Each causal locus is multiplied by –1 with probability 0.5. Alternatively, under the random coefficients (RC) model, initial causal coefficients are drawn independently from ${\beta}_{i}\sim \text{Normal}(0,1)$. For both models, the initial genetic variance is ${\sigma}_{0}^{2}={\sum}_{i\in C}2{v}_{i}^{T}{\beta}_{i}^{2},$ replacing ${v}_{i}^{T}$ with ${\widehat{v}}_{i}^{T}$ for unknown ${p}_{i}^{T}$ (so ${\sigma}_{0}^{2}$ is an unbiased estimate), so we multiply every initial ${\beta}_{i}$ by $\frac{h}{{\sigma}_{0}}$ to have the desired heritability. Lastly, for known ${p}_{i}^{T}$, the intercept coefficient is $\alpha ={\sum}_{i\in C}2{p}_{i}^{T}{\beta}_{i}.$ When ${p}_{i}^{T}$ are unknown, ${\widehat{p}}_{i}^{T}$ should not replace ${p}_{i}^{T}$ since that distorts the trait covariance (for the same reason the standard kinship estimator in Equation 5 is biased), which is avoided with
Simulations optionally included multiple environment group effects, similarly to previous models (Zhang and Pan, 2015; Wang et al., 2022), as follows. Each independent environment $i$ has predefined groups, and each group $g$ has random coefficients drawn independent from ${\eta}_{gi}\sim \text{Normal}(0,{\sigma}_{\eta i}^{2})$ where ${\sigma}_{\eta i}^{2}$ is a specified variance proportion for environment $i$. $\mathit{Z}$ has individuals along columns and environmentgroups along rows, and it contains indicator variables: 1 if the individual belongs to the environmentgroup, 0 otherwise.
We performed trait simulations with the following variance parameters (Table 7): high heritability used ${h}^{2}=0.8$ and no environment effects; low heritability used ${h}^{2}=0.3$ and no environment effects; lastly, environment used ${h}^{2}=0.3,{\sigma}_{\eta 1}^{2}=0.3,{\sigma}_{\eta 2}^{2}=0.2$ (total ${\sigma}_{\eta}^{2}={\sigma}_{\eta 1}^{2}+{\sigma}_{\eta 2}^{2}=0.5$). For real genotype datasets, the groups are the continental (environment 1) and finegrained (environment 2) subpopulation labels given (see next subsection). For simulated genotypes, we created these labels by grouping by the index $j$ (geographical coordinate) of each simulated individual, assigning group $g=\text{ceiling}(j{k}_{i}/n)$ where k_{i} is the number of groups in environment $i$, and we selected ${k}_{1}=5$ and ${k}_{2}=25$ to mimic the number of groups in each level of 1000 Genomes (Table 2).
Real human genotype datasets
Request a detailed protocolThe three datasets were processed as before (Ochoa and Storey, 2019; summarized below), except with an additional filter so loci are in approximate linkage equilibrium and rare variants are removed. All processing was performed with plink2 (Chang et al., 2015), and analysis was uniquely enabled by the R packages BEDMatrix (Grueneberg and de Los Campos, 2019) and genio. Each dataset groups individuals in a twolevel hierarchy: continental and finegrained subpopulations. Final dataset sizes are in Table 2.
We obtained the full (including nonpublic) Human Origins by contacting the authors and agreeing to their usage restrictions. The Pacific data (Skoglund et al., 2016) was obtained separately from the rest (Lazaridis et al., 2014; Lazaridis et al., 2016), and datasets were merged using the intersection of loci. We removed ancient individuals, and individuals from singleton and nonnative subpopulations. Nonautosomal loci were removed. Our analysis of both the wholegenome sequencing (WGS) version of HGDP (Bergström et al., 2020) and the highcoverage NYGC version of 1000 Genomes (Fairley et al., 2020) was restricted to autosomal biallelic SNP loci with filter “PASS”.
Since our evaluations assume uncorrelated loci, we filtered each real dataset with plink2 using parameters “indeppairwise 1000kb 0.3”, which iteratively removes loci that have a greater than 0.3 squared correlation coefficient with another locus that is within 1000 kb, stopping until no such loci remain. Since all real datasets have numerous rare variants, while PCA and LMM are not able to detect associations involving rare variants, we removed all loci with $\text{MAF}<0.01$. Lastly, only HGDP had loci with over 10% missingness removed, as they were otherwise 17% of remaining loci (for Human Origins and 1000 Genomes they were under 1% of loci so they were not removed). Kinship matrix rank and eigenvalues were calculated from popkin kinship estimates. Eigenvalues were assigned pvalues with twstats of the Eigensoft package (Patterson et al., 2006), and kinship matrix rank was estimated as the largest number of consecutive eigenvalue from the start that all satisfy $p<0.01$ (pvalues did not increase monotonically). For the evaluation with close relatives removed, each dataset was filtered with plink2 with option “kingcutoff” with cutoff 0.02209709 ($={2}^{11/2}$) for removing up to 4th degree relatives using KINGrobust (Manichaikul et al., 2010), and $\text{MAF}<0.01$ filter is reapplied (Table 4).
Evaluation of performance
Request a detailed protocolAll approaches are evaluated using two complementary metrics: ${\text{SRMSD}}_{p}$ quantifies pvalue uniformity, and ${\text{AUC}}_{\text{PR}}$ measures causal locus classification performance and reflects power while ranking miscalibrated models fairly. These measures are more robust alternatives to previous measures from the literature (Appendix 2), and are implemented in simtrait.
Pvalues for continuous test statistics have a uniform distribution when the null hypothesis holds, a crucial assumption for type I error and FDR control (Storey, 2003; Storey and Tibshirani, 2003). We use the Signed Root Mean Square Deviation (${\text{SRMSD}}_{p}$) to measure the difference between the observed null pvalue quantiles and the expected uniform quantiles:
where ${m}_{0}=m{m}_{1}$ is the number of null (noncausal) loci, here $i$ indexes null loci only, ${p}_{(i)}$ is the $i$ th ordered null pvalue, ${u}_{i}=(i0.5)/{m}_{0}$ is its expectation, ${p}_{\text{median}}$ is the median observed null pvalue, ${u}_{\text{median}}=\frac{1}{2}$ is its expectation, and sgn is the sign function (1 if $u}_{\text{median}}\ge {p}_{\text{median}$, –1 otherwise). Thus, ${\text{SRMSD}}_{p}=0$ corresponds to calibrated pvalues, ${\text{SRMSD}}_{p}>0$ indicate anticonservative pvalues, and ${\text{SRMSD}}_{p}<0$ are conservative pvalues. The maximum ${\text{SRMSD}}_{p}$ is achieved when all pvalues are zero (the limit of anticonservative pvalues), which for infinite loci approaches
The same value with a negative sign occurs for all pvalues of 1.
Precision and recall are standard performance measures for binary classifiers that do not require calibrated pvalues (Grau et al., 2015). Given the total numbers of true positives (TP), false positives (FP) and false negatives (FN) at some threshold or parameter $t$, precision and recall are
Precision and Recall trace a curve as $t$ is varied, and the area under this curve is ${\text{AUC}}_{\text{PR}}$. We use the R package PRROC to integrate the correct nonlinear piecewise function when interpolating between points. A model obtains the maximum ${\text{AUC}}_{\text{PR}}=1$ if there is a $t$ that classifies all loci perfectly. In contrast, the worst models, which classify at random, have an expected precision ($={\text{AUC}}_{\text{PR}}$) equal to the overall proportion of causal loci: ${m}_{1}/m$.
Data and code availability
Request a detailed protocolThe data and code generated during this study are available on GitHub at https://github.com/OchoaLab/pcaassocpaper (copy archived at Ochoa, 2023). The public subset of Human Origins is available on the Reich Lab website at https://reich.hms.harvard.edu/datasets; nonpublic samples have to be requested from David Reich. The WGS version of HGDP was downloaded from the Wellcome Sanger Institute FTP site at ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/. The highcoverage version of the 1000 Genomes Project was downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/.
Web resources
Request a detailed protocolplink2, https://www.coggenomics.org/plink/2.0/ ; GCTA, https://yanglab.westlake.edu.cn/software/gcta/ ; Eigensoft, https://github.com/DReichLab/EIG ; bnpsd, https://cran.rproject.org/package=bnpsd ; simfam, https://cran.rproject.org/package=simfam ; simtrait, https://cran.rproject.org/package=simtrait ; genio, https://cran.rproject.org/package=genio ; popkin, https://cran.rproject.org/package=popkin ; ape, https://cran.rproject.org/package=ape ; nnls, https://cran.rproject.org/package=nnls ; PRROC, https://cran.rproject.org/package=PRROC ; BEDMatrix, https://cran.rproject.org/package=BEDMatrix.
Appendix 1
Fitting ancestral allele frequency distribution to real data
We calculated ${\widehat{p}}_{i}^{T}$ distributions of each real dataset. However, population structure increases the variance of these sample ${\widehat{p}}_{i}^{T}$ relative to the true ${p}_{i}^{T}$(Ochoa and Storey, 2021). We present a new algorithm for constructing a new distribution based on the input data but with the lower variance of the true ancestral distribution. Suppose the ${p}_{i}^{T}$ distribution over loci $i$ satisfies $\mathrm{E}\left[{p}_{i}^{T}\right]=\frac{1}{2}$ and $\mathrm{Var}\left({p}_{i}^{T}\right)={V}^{T}$. The sample allele frequency ${\widehat{p}}_{i}^{T}$, conditioned on ${p}_{i}^{T}$, satisfies
where ${\overline{\phi}}^{T}=\frac{1}{{n}^{2}}{\sum}_{j=1}^{n}{\sum}_{k=1}^{n}{\phi}_{jk}^{T}$ is the mean kinship over all individual (Ochoa and Storey, 2021). The unconditional moments of ${\widehat{p}}_{i}^{T}$ follow from the laws of total expectation and variance: $\mathrm{E}\left[{\widehat{p}}_{i}^{T}\right]=\frac{1}{2}$ and
Since $V}^{T}\le \frac{1}{4$ and ${\overline{\phi}}^{T}\ge 0$, then $W}^{T}\ge {V}^{T$. Thus, the goal is to construct a new distribution with the original, lower variance of
We use the unbiased estimator ${\widehat{W}}^{T}=\frac{1}{m}{\sum}_{i=1}^{m}{\left({\widehat{p}}_{i}^{T}\frac{1}{2}\right)}^{2},$ while ${\overline{\phi}}^{T}$ is calculated from the tree parameters: the subpopulation coancestry matrix (Equation 7), expanded from subpopulations to individuals, the diagonal converted to kinship (reversing Equation 5), and the matrix averaged. However, since our model ignores the MAF filters imposed in our simulations, ${\overline{\phi}}^{T}$ was adjusted. For Human Origins the true model ${\overline{\phi}}^{T}$ of 0.143 was used. For 1000 Genomes and HGDP the true ${\overline{\phi}}^{T}$ are 0.126 and 0.124, respectively, but 0.4 for both produced a better fit.
Lastly, we construct new allele frequencies,
by a weighted average of ${\widehat{p}}_{i}^{T}$ and $q\in (0,1)$ drawn independently from a different distribution. $\mathrm{E}[q]=\frac{1}{2}$ is required to have $\mathrm{E}\left[{p}^{*}\right]=\frac{1}{2}$. The resulting variance is
which we equate to the desired ${V}^{T}$ (Equation 9) and solve for $w$. For simplicity, we also set $\mathrm{Var}(q)={V}^{T}$, which is achieved with:
Although $w=0$ yields $\mathrm{Var}({p}^{*})={V}^{T}$, we use the second root of the quadratic equation to use ${\widehat{p}}_{i}^{T}$:
Appendix 2
Comparisons between ${\text{SRMSD}}_{p}$, ${\text{AUC}}_{\text{PR}}$, and evaluation measures from the literature
2.1 The inflation factor $\lambda $
Test statistic inflation has been used to measure model calibration (Astle and Balding, 2009; Price et al., 2010). The inflation factor $\lambda $ is defined as the median ${\chi}^{2}$ association statistic divided by theoretical median under the null hypothesis (Devlin and Roeder, 1999). To compare pvalues from non${\chi}^{2}$ tests (such as tstatistics), $\lambda $ can be calculated from pvalues using
where ${p}_{\text{median}}$ is the median observed pvalue (including causal loci), ${u}_{\text{median}}=\frac{1}{2}$ is its null expectation, and $F$ is the ${\chi}^{2}$ cumulative density function (${F}^{1}$ is the quantile function).
To compare $\lambda $ and ${\text{SRMSD}}_{p}$ directly, for simplicity assume that all pvalues are null. In this case, calibrated pvalues give $\lambda =1$ and ${\text{SRMSD}}_{p}=0$. However, nonuniform pvalues with the expected median, such as from genomic control (Devlin and Roeder, 1999), result in $\lambda =1$, but ${\text{SRMSD}}_{p}\ne 0$ except for uniform pvalues, a key flaw of $\lambda $ that ${\text{SRMSD}}_{p}$ overcomes. Inflated statistics (anticonservative pvalues) give $\lambda >1$ and ${\text{SRMSD}}_{p}>0$. Deflated statistics (conservative pvalues) give $\lambda <1$ and ${\text{SRMSD}}_{p}<0$. Thus, $\lambda \ne 1$ always implies ${\text{SRMSD}}_{p}\ne 0$ (where $\lambda 1$ and ${\text{SRMSD}}_{p}$ have the same sign), but not the other way around. Overall, $\lambda $ depends only on the median pvalue, while ${\text{SRMSD}}_{p}$ uses the complete distribution. However, ${\text{SRMSD}}_{p}$ requires knowing which loci are null, so unlike $\lambda $ it is only applicable to simulated traits.
2.2 Empirical comparison of ${\text{SRMSD}}_{p}$ and $\lambda $
There is a near onetoone correspondence between $\lambda $ and ${\text{SRMSD}}_{p}$ in our data (Figure 2—figure supplement 1). PCA tended to be inflated ($\lambda >1$ and ${\text{SRMSD}}_{p}>0$) whereas LMM tended to be deflated ($\lambda <1$ and ${\text{SRMSD}}_{p}<0$), otherwise the data for both models fall on the same contiguous curve. We fit a sigmoidal function to this data,
which for $a,b>0$ satisfies ${\mathrm{S}\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{D}}_{p}(\lambda =1)=0$ and reflects $\mathrm{log}(\lambda )$ about zero ($\lambda =1$):
We fit this model to $\lambda >1$ only since it was less noisy and of greater interest, and obtained the curve shown in Figure 2—figure supplement 1 with $a=0.564$ and $b=0.619$. The value $\lambda =1.05$, a common threshold for benign inflation (Price et al., 2010), corresponds to ${\text{SRMSD}}_{p}=0.0085$ according to Equation 10. Conversely, ${\text{SRMSD}}_{p}=0.01$, serving as a simpler rule of thumb, corresponds to $\lambda =1.06$.
2.3 Type I error rate
The type I error rate is the proportion of null pvalues with $p\le t$. Calibrated pvalues have type I error rate near $t$, which may be evaluated with a binomial test. This measure may give different results for different $t$, for example be significantly miscalibrated only for large $t$ (due to lack of power for smaller $t$), and it requires large simulations to estimate well as it depends on the tail of the distribution. In contrast, $\mathrm{S}\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{D}}_{p$ uses the entire distribution so it is easier to estimate, ${\mathrm{S}\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{D}}_{p}=0$ guarantees calibrated type I error rates at all $t$, while large ${\mathrm{S}\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{D}}_{p}$ indicates incorrect type I errors for a range of $t$. Empirically, we find the expected agreement and monotonic relationship between $\mathrm{S}\mathrm{R}\mathrm{M}\mathrm{S}\mathrm{D}}_{p$ and type I error rate (Figure 2—figure supplement 2).
2.4 Statistical power and comparison to ${\text{AUC}}_{\text{PR}}$
Power is the probability that a test is declared significant when the alternative hypothesis H_{1} holds. At a pvalue threshold $t$, power equals
$F(t)$ is a cumulative function, so it is monotonically increasing and has an inverse. Like type I error control, power may rank models differently depending on $t$, and it is also harder to estimate than ${\text{AUC}}_{\text{PR}}$ because power depends on the tail of the distribution.
Power is not meaningful when pvalues are not calibrated. To establish a clear connection to ${\text{AUC}}_{\text{PR}}$, assume calibrated (uniform) null pvalues: $Pr(p<t{H}_{0})=t$. TPs, FPs, and FNs at $t$ are
where ${\pi}_{0}=\mathrm{Pr}({H}_{0})$ is the proportion of null cases and ${\pi}_{1}=1{\pi}_{0}$ of alternative cases. Therefore,
Noting that $t={F}^{1}(\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l})$, precision can be written as a function of recall, the power function, and constants:
This last form leads most clearly to $\mathrm{A}\mathrm{U}\mathrm{C}}_{\mathrm{P}\mathrm{R}}={\int}_{0}^{1}\mathrm{P}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}(\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l})d\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l$ .
Lastly, consider a simple yet common case in which model $A$ is uniformly more powerful than model $B:{F}_{A}(t)>{F}_{B}(t)$ for every $t$. Therefore ${F}_{A}^{1}(\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l})<{F}_{B}^{1}(\mathrm{R}\mathrm{e}\mathrm{c}\mathrm{a}\mathrm{l}\mathrm{l})$ for every recall value. This ensures that the precision of $A$ is greater than that of $B$ at every recall value, so ${\text{AUC}}_{\text{PR}}$ is greater for $A$ than $B$. Thus, ${\text{AUC}}_{\text{PR}}$ ranks calibrated models according to power.
Empirically, we find the predicted positive correlation between ${\text{AUC}}_{\text{PR}}$ and calibrated power (Figure 2—figure supplement 3). The correlation is clear when considered separately per dataset, but the slope varies per dataset, which is expected because the proportion of alternative cases ${\pi}_{1}$ varies per dataset.
Data availability
The current manuscript is a computational study, so no data have been generated for this manuscript. Code is available at https://github.com/OchoaLab/pcaassocpaper (copy archived at Ochoa, 2023).

International Genome Sample ResourceID NYGC_GATK/. 1000 Genomes Project, highcoverage version.

Wellcome Sanger InstituteID wgs.20190516/. Human Genome Diversity Panel, wholegenome sequencing version.
References

Fast modelbased estimation of ancestry in unrelated individualsGenome Research 19:1655–1664.https://doi.org/10.1101/gr.094052.109

Inference of distant genetic relations in humans using "1000 Genomes"Genome Biology and Evolution 7:481–492.https://doi.org/10.1093/gbe/evv003

Population structure and cryptic relatedness in genetic Association studiesStatistical Science 24:451–471.https://doi.org/10.1214/09STS307

Inferring population structure in biobankscale genomic dataAmerican Journal of Human Genetics 109:727–737.https://doi.org/10.1016/j.ajhg.2022.02.015

Genetic diversity and Association studies in US Hispanic/Latino populations: Applications in the Hispanic community health study/study of LatinosThe American Journal of Human Genetics 98:165–184.https://doi.org/10.1016/j.ajhg.2015.12.001

Modelfree estimation of recent genetic relatednessThe American Journal of Human Genetics 98:127–148.https://doi.org/10.1016/j.ajhg.2015.11.022

Genomic control for Association studiesBiometrics 55:997–1004.https://doi.org/10.1111/j.0006341x.1999.00997.x

The International genome sample resource (IGSR) collection of open human Genomic variation resourcesNucleic Acids Research 48:D941–D947.https://doi.org/10.1093/nar/gkz836

Atlas of cryptic genetic relatedness among 1000 human GenomesGenome Biology and Evolution 8:777–790.https://doi.org/10.1093/gbe/evw034

Fast principalcomponent analysis reveals CONVERGENT evolution of Adh1B in Europe and East AsiaAmerican Journal of Human Genetics 98:456–472.https://doi.org/10.1016/j.ajhg.2015.12.022

High level of inbreeding in final phase of 1000 Genomes projectScientific Reports 5:17453.https://doi.org/10.1038/srep17453

Scaling probabilistic models of genetic variation to millions of humansNature Genetics 48:1587–1590.https://doi.org/10.1038/ng.3710

Bgdata  A suite of R packages for Genomic analysis with big dataG3: Genes, Genomes, Genetics 9:1377–1383.https://doi.org/10.1534/g3.119.400018

BookOn the bias in Eigenvalues of sample covariance matrixIn: Wiberg M, Culpepper S, Janssen R, González J, Molenaar D, editors. Quantitative Psychology Springer Proceedings in Mathematics & Statistics. Cham: Springer International Publishing. pp. 221–233.https://doi.org/10.1007/9783319772493_19

Prioritizing diversity in human Genomics researchNature Reviews Genetics 19:175–185.https://doi.org/10.1038/nrg.2017.89

Misuse of the term ‘Transethnic’ in Genomics researchNature Genetics 53:1520–1521.https://doi.org/10.1038/s41588021009526

Fast linear mixed models for genomewide Association studiesNature Methods 8:833–835.https://doi.org/10.1038/nmeth.1681

Controlling population structure in human genetic association studies with samples of unrelated individualsStatistics and Its Interface 4:317–326.https://doi.org/10.4310/sii.2011.v4.n3.a6

Mixedmodel association for biobankscale datasetsNature Genetics 50:906–908.https://doi.org/10.1038/s4158801801446

Human demographic history impacts genetic risk prediction across diverse populationsAmerican Journal of Human Genetics 100:635–649.https://doi.org/10.1016/j.ajhg.2017.03.004

Challenges in conducting genomewide Association studies in highly admixed multiethnic populations: The generation R studyEuropean Journal of Epidemiology 30:317–330.https://doi.org/10.1007/s1065401599984

SoftwareStokkum Ihmv, Nnls: The LawsonHanson algorithm for nonnegative least squares (NNLS)The Comprehensive R Archive Network.

SoftwarePcaAssocpaper, version swh:1:rev:8549eafe6c27583894640e6cd8639232ed15cadeSoftware Heritage.

Extreme Polygenicity of complex traits is explained by negative selectionAmerican Journal of Human Genetics 105:456–476.https://doi.org/10.1016/j.ajhg.2019.07.003

Ancient admixture in human historyGenetics 192:1065–1093.https://doi.org/10.1534/genetics.112.145037

New approaches to population stratification in genomewide association studiesNature Reviews Genetics 11:459–463.https://doi.org/10.1038/nrg2813

Association mapping in structured populationsAmerican Journal of Human Genetics 67:170–181.https://doi.org/10.1086/302959

Multiethnic genomewide Association study for atrial fibrillationNature Genetics 50:1225–1233.https://doi.org/10.1038/s4158801801339

Genetic structure of human populationsScience 298:2381–2385.https://doi.org/10.1126/science.1078311

Genomewide Association studies in diverse populationsNature Reviews Genetics 11:356–366.https://doi.org/10.1038/nrg2760

On the number of siblings and pth cousins in a large population sampleJournal of Mathematical Biology 77:1279–1298.https://doi.org/10.1007/s0028501812528

A statistical method for evaluating systematic relationshipsUniv Kansas, Sci Bull 38:1409–1438.

Testing for genetic associations in arbitrarily structured populationsNature Genetics 47:550–554.https://doi.org/10.1038/ng.3244

Improved Heritability estimation from genomewide SNPsAmerican Journal of Human Genetics 91:1011–1021.https://doi.org/10.1016/j.ajhg.2012.10.010

The positive false discovery rate: A Bayesian interpretation and the QvalueThe Annals of Statistics 31:2013–2035.https://doi.org/10.1214/aos/1074290335

Mixed models can correct for population structure for Genomic regions under selectionNature Reviews Genetics 14:300.https://doi.org/10.1038/nrg2813c1

Rapid variance components–based method for wholegenome Association analysisNature Genetics 44:1166–1170.https://doi.org/10.1038/ng.2410

ROADTRIPS: Casecontrol Association testing with partially or completely unknown population and pedigree structureAmerican Journal of Human Genetics 86:172–184.https://doi.org/10.1016/j.ajhg.2010.01.001

The nature of confounding in genomewide Association studiesNature Reviews Genetics 14:1–2.https://doi.org/10.1038/nrg3382

Tradeoffs of linear mixed models in genomewide Association studiesJournal of Computational Biology 29:233–242.https://doi.org/10.1089/cmb.2021.0157

The Genetical structure of populationsAnnals of Eugenics 15:323–354.https://doi.org/10.1111/j.14691809.1949.tb02451.x

GCTA: a tool for genomewide complex trait analysisThe American Journal of Human Genetics 88:76–82.https://doi.org/10.1016/j.ajhg.2010.11.011

Mixed linear model approach adapted for genomewide Association studiesNature Genetics 42:355–360.https://doi.org/10.1038/ng.546

Eigenanalysis of SNP data with an identity by descent interpretationTheoretical Population Biology 107:65–76.https://doi.org/10.1016/j.tpb.2015.09.004

GenomeWide efficient mixedmodel analysis for association studiesNature Genetics 44:821–824.https://doi.org/10.1038/ng.2310

Strong selection at MHC in Mexicans since admixturePLOS Genetics 12:e1005847.https://doi.org/10.1371/journal.pgen.1005847
Decision letter

Magnus NordborgReviewing Editor; Gregor Mendel Institute, Austria

Detlef WeigelSenior Editor; Max Planck Institute for Biology Tübingen, Germany

Magnus NordborgReviewer; Gregor Mendel Institute, Austria
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Limitations of principal components in quantitative genetic association models for human studies" for consideration by eLife. Your article has been reviewed by 4 peer reviewers, including Magnus Nordborg as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Detlef Weigel as the Senior Editor.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. Individual reviews are also included as they are generally helpful. As you will see, there is strong support for your work, but it is clear that improvements are in order to justify your general claims.
Essential revisions:
1) You must at least discuss that the scenarios you test are in a sense as unrealistic as the (much more limited) simulations previously run, both in terms of sample size and population structure. Actual human GWAS are run "within populations" to minimize environmental confounding (more on this below) as much longrange LD and involve much larger sample sizes. Does this affect your conclusions? Related to this, the historical context for several methods you cite is not given.
2) Practical considerations for why PCs were/are used are not discussed either. Unbalanced designs, metastudies, sample sizes.
3) You are not discussing environmental confounding, which is not only likely more of a problem in human GWAS than any of the other things discussed here, but could potentially justify a combination of LMM and PCs. Selection could have a similar effect, as discussed in the original LMM paper by Yu et al. (2006). Ideally, this should be simulated, but it certainly must be discussed, and the conclusions must take this into account.
4) You are similarly not addressing the issue of rare alleles and heterogeneity, which is a major preoccupation in human genetics. How does genetic architecture affect your conclusions? If you could address this, it would be a very nice (and timely) contribution (whereas the manuscript as currently written feels like it's fighting a 10year old battle).
5) Finally, the description of the theory is often hard to follow, relies heavily on a particular model (ancient allele frequencies), and appears to be wrong in a few specific cases (see comments below). The results table is likewise inscrutable, and it is difficult to relate your simulations to reallife scenarios.
Reviewer #1 (Recommendations for the authors):
At the very least you need to comment on the practicability and relevance issues noted above.
I would also recommend adjusting the writing to make your (otherwise lucid) review of the theory more accessible to a broader audience. Not only do you assume considerable knowledge of mathematical statistics, but you are also deeply rooted in a classical quantitative genetics framework. Terms like "inbreeding edges" are likely to befuddle all but a tiny fraction of humanity!
Reviewer #2 (Recommendations for the authors):
1. Environmental effects are known to correlate with population structure. A silly example of this is chopstick skills, which obviously correlate with broad levels of EastAsian ancestry, e.g. in a global population sample it would likely correlate with the first two top PCs, and additional PCs would not explain more. Hence, for such a trait the generative model could really just be the top PCs, in which case including top PCs as covariates would be optimal. An LMM would in this example probably try to account for a much more complicated structure in addition to the top PCs. Hence, in summary, I would appreciate some simulations where you have environmental contributions that are perhaps strongly correlated with individual PCs, as I believe such scenarios may exist in real data analyses.
2. I really appreciate the detailed simulations in this study, but perhaps it would make sense to simulate traits given UKB genotypes? Also, and perhaps more importantly, how about also evaluating the two approaches on UKB data, to see if you really find more hits using LMMs? (similar to Loh et al., NG 2018)
3. You note that PCs are the top eigenvectors of the kinship matrix. This is usually true, but not always as when deriving PCs one should ideally apply some LD adjustment to avoid PCs capturing longrange LD regions, which can otherwise reduce power to detect variants in longrange regions. See e.g. Patterson et al. (PLoS Genet 2006) or Privé et al. (Bioinformatics 2020).
4. Following up on the previous comment, I wonder if the conclusions in this paper change if PCs are derived the way they often are, i.e. excluding longrange LD regions and/or using some LD adjustments.
5. Some of the LMMs cited are not classical LMMs, in the sense that they do not assume an infinitesimal genetic architecture. E.g. BOLTLMM (Loh et al., Nat Genet 2015) assumes a mixture of two Gaussians as a prior for the effects, which they show can improve power further. Mbatchou et al. (Nat Genet 2021) also use blockwise ridge regression (infinitesimal model), which is effectively a more flexible prior. Using these, the relationship between the PCs and the model becomes yet more complicated. Similarly, the GCTA mixed model GWAS uses a LOCO approach to improve power, which I believe means that a slightly different kinship is used for each chromosome.
6. Your derivation assumes that there are some ancestral allele frequencies underlying the true model, but that seems perhaps unnecessary to me because these allele frequencies only represent the "correct" weights for the variants, and we know that they are probably wrong anyway. Indeed alternative and likely better weightings can be used, see Speed and Balding (AJHG 2012, NGR 2015, NG 2019). If you instead assume the sample frequencies are the right frequencies, then all the math becomes simple by standardizing the variants.
7. Another publication that also provides some similar theory on the relationship between PCA and mixed models is Janss et al., (Genetics 2012). (Interestingly, although their work is very nice their PCs did however have major longrange LD issues.)
8. The authors mention that LMMs are not well suited for unbalanced samples, e.g. where the casecontrol ratio in the sample is <5%, and cite Zhou et al., Nat Genet 2018 as a solution. However, before Zhou et al. there were no computationally efficient generalized mixed models capable of being able to be applied to UKB sample sizes, which could help explain why LMMs haven't been adopted as quickly as one might have expected.
9. I believe there is some concern regarding metaanalyzing LMM GWAS summary statistics, e.g. whether one should use the actual effective sample sizes or the effective sample sizes (which I believe is the best approach). I believe this is probably the main reason why LMMs are not as widely used as we would like. I would appreciate some discussion or reflection on this.
Reviewer #3 (Recommendations for the authors):
The presentation of the results in Table 3 is nearly incomprehensible.
The authors are incorrect about no other troublesome cases on line 48. The differential variance between populations can induce bias (see papers from Xihong Lin).
The author's discussion of the benefits of including sample PCs in LMM (eg 468472 on p28) is misinformed. In particular, it is a critical step for methods like BOLT.
The authors say "lowdimensional" when they mean "lowrank".
The Sul and Eskin paper does not result in a tie. It states that creating a second kinship matrix from the PCs can address the issue presented in Price et al. entirely with LMMs.
It would also be interesting to know to what extent the singular value distribution matters versus the rank? The authors have some discussion of this in 5b but it is not well developed.
Reviewer #4 (Recommendations for the authors):
I have the following comments for consideration:
Historical context: The overall reasoning for controlling population stratification and relative kinship for different types of populations was given in [26]. It would be helpful to revise the introduction part to focus more on the original research papers and their reasonings: [5] (PCA), [9] (Q), [26] (LMM, Q in MM), and [16] (PC in LMM and actual association study). The later studies with LMM from the groups behind GCTA should be mentioned as the (emerging/emerged) trend, in addition to those comparison studies.
A couple of relevant research may be examined and discussed, in terms of different types of populations, determining the number of PCs to be included in the LMM, model comparison, and modeling fitting (redundancy). https://doi.org/10.1534/genetics.108.098863 https://doi.org/10.1038/hdy.2010.11
One question is the justification for examining a single heritability of 0.8. This level is generally regarded very high. It would be more convincing to see the results when h2 = 0.30.6, particularly if this is set for a large population with different subgroups.
The choice of m1=n/10 is not adequately justified. Even though we typically assume that there are many loci and the detection power increases with larger sample sizes, the actual detected numbers of loci in empirical studies are lower than that. With the current set m1 (Table 2), what were the power values, and are they close to the empirical studies?
"The largest limitation of our work is that we only considered quantitative traits". This may be expanded to include that with the overall simulation scheme, you assume the causal variants are functioning across different groups of a large population (2.2.5 Trait Simulation)? Real data with measured traits may be different. This can also be different for different types of complex diseases. Some clear context information should be given at the beginning too.
Among these simulated population and real data, have you examined whether a small number of PCs are indeed significant to explain some trait variation? I think this was the original thinking of applying PCs to correct the population stratification.
L46 and L487. Assume the lowdimension part of the relatedness matters in terms of trait differences among groups, which need to be adjusted.
L5357. This is not an accurate summary of the relevant papers. Earlier papers set up the general LMM framework with different components that may be included or included and individual components that can have varied forms or reported the first actual association study with the LMM framework. These two are earlier papers that (re)introduce LMM to the association studies, and markers were used for kinship, structure, and PCA.
L6470. Redundancy is not an issue since the objective is to control false positives using two types of covariates.
L166. Large "and"(?) Family.
L282. Theoretical and empirical evidence of these two needs to be provided.
Table 3. Not clear about the asterisk.
https://doi.org/10.7554/eLife.79238.sa1Author response
Essential revisions:
1) You must at least discuss that the scenarios you test are in a sense as unrealistic as the (much more limited) simulations previously run, both in terms of sample size and population structure. Actual human GWAS are run "within populations" to minimize environmental confounding (more on this below) as much longrange LD and involve much larger sample sizes. Does this affect your conclusions? Related to this, the historical context for several methods you cite is not given.
We added to our discussion comments on sample size, which is smaller than many of the largest modern studies. However, we do not expect our key conclusion to change for larger sample sizes, since cryptic family relatedness is not only present there but it is expected to be even more abundant compared to smaller sample sizes.
Many modern GWAS are multiethnic or include admixed individuals, so this is not an artificial setting that has no bearing on real analyses. We now cite 21 recent papers on this topic to back up this assertion. We focused on these more extreme cases as this is where challenges to PCA and LMM performance are most expected. We agree that environmental confounding is an important issue for multiethnic studies, and have now included numerous simulations with considerable environment effects.
Our simulated genotypes do not have LD, and our real data was pruned to exclude short range LD in order to simplify evaluations. The reviewers point out several variants to PC and kinship estimation that can improve handling of LD, and we now incorporate those in our discussion. We agree LD should be handled with appropriate estimators when present, but overall we do not think the presence of LD changes our key conclusion, namely that LMM outperforms PCA and that this is mostly due to the presence of cryptic family relatedness.
For clarity, we opted to focus on describing the most common modern versions of PCA and LMM which are tested, as opposed to numerous older variants we did not test here, which modeled population and family structure in somewhat different ways. We provide historical context strictly as needed to avoid confusion and keep our already lengthy work as brief as possible.
2) Practical considerations for why PCs were/are used are not discussed either. Unbalanced designs, metastudies, sample sizes.
We added a discussion paragraph to sample sizes and related questions. Indeed, PCA scales better than classic LMMs with sample size, although many recent LMMs are now scaling better, although some do so by changing the inference model somewhat (by taking new shortcuts, for example), so those new models deserve further study. However, cryptic family relatedness is expected to become more abundant at larger sample sizes, so PCA should perform worse there.
We also briefly discuss unbalanced casecontrol designs. However, quantitative trait studies do not typically exhibit unbalanced designs, which is the sole focus of this work, so this interesting question is out of scope.
Metaanalysis does not give either PCA or LMM an advantage, as it can be applied to summary statistics from either or even a combination of models.
3) You are not discussing environmental confounding, which is not only likely more of a problem in human GWAS than any of the other things discussed here, but could potentially justify a combination of LMM and PCs. Selection could have a similar effect, as discussed in the original LMM paper by Yu et al. (2006). Ideally, this should be simulated, but it certainly must be discussed, and the conclusions must take this into account.
We added numerous simulations with large environment effects. Although an LMM with PCs can occasionally significantly outperform an LMM without PCs, in these cases the difference in performance tended to be small, and including environment variables as fixed effects in our regression improves performance much more than using PCs. Therefore, we are hesitant to recommend always including PCs in an LMM since this tends to be a poor solution to the problem of environmental confounding, and more explicit modeling of environment effects is instead recommended.
4) You are similarly not addressing the issue of rare alleles and heterogeneity, which is a major preoccupation in human genetics. How does genetic architecture affect your conclusions? If you could address this, it would be a very nice (and timely) contribution (whereas the manuscript as currently written feels like it's fighting a 10year old battle).
We now include rare variants in our discussion. In our evaluations we had to ignore rare variants because at our small sample sizes we have no power to detect association at these variants with low counts (and pvalues were often miscalibrated in those cases, a well known issue). However, we agree that modeling rare variants is important in biobankscale analyses, and their covariance structure is different than for common variants (Zaidi and Mathieson, 2020), which raises very interesting challenges we hope to tackle in future work.
By heterogeneity, one reviewer clarifies below to mean effect size heterogeneity across populations, which is the focus of much recent work. However, this questions is not relevant to our LMM vs PCA comparisons, as neither of these basic models is equipped to model effect size heterogeneity, neither is expected a priori to perform better than the other, and both models can be extended to model ancestryspecific effects in the way it is done in TRACTOR, for example.
5) Finally, the description of the theory is often hard to follow, relies heavily on a particular model (ancient allele frequencies), and appears to be wrong in a few specific cases (see comments below). The results table is likewise inscrutable, and it is difficult to relate your simulations to reallife scenarios.
We updated the descriptions of our theory as suggested in the more specific comments below, and there we also address the particulars of our model. We improved Table 3 with the main statistical evaluations.
Reviewer #1 (Recommendations for the authors):
At the very least you need to comment on the practicability and relevance issues noted above.
I would also recommend adjusting the writing to make your (otherwise lucid) review of the theory more accessible to a broader audience. Not only do you assume considerable knowledge of mathematical statistics, but you are also deeply rooted in a classical quantitative genetics framework. Terms like "inbreeding edges" are likely to befuddle all but a tiny fraction of humanity!
We clarified the concepts highlighted and additionally edited based on feedback from additional members of our lab who reviewed the revised manuscript.
Reviewer #2 (Recommendations for the authors):
1. Environmental effects are known to correlate with population structure. A silly example of this is chopstick skills, which obviously correlate with broad levels of EastAsian ancestry, e.g. in a global population sample it would likely correlate with the first two top PCs, and additional PCs would not explain more. Hence, for such a trait the generative model could really just be the top PCs, in which case including top PCs as covariates would be optimal. An LMM would in this example probably try to account for a much more complicated structure in addition to the top PCs. Hence, in summary, I would appreciate some simulations where you have environmental contributions that are perhaps strongly correlated with individual PCs, as I believe such scenarios may exist in real data analyses.
We added numerous simulations with random environment effects generated from the population labels given by the three real datasets (and with labels that group individuals by geographic coordinates for the simulated datasets). These environment effects are strongly correlated with ancestry and are also lowdimensional, which could give PCA an advantage based on those properties as you and many others expect. By necessity, the environment simulations have a lower heritability, to accommodate for large environment effects. Thus, for comparison, we also simulated numerous low heritability traits without environment, whose results are comparable to the original high heritability simulations. Surprisingly, we find that LMMs fit environment effects nearly as well as PCA, since both increase their AUCs substantially compared to the low heritability simulations (Figure 8 Figure supp 1). We do find that with environment, occasionally, an LMM with PCs slightly but significantly outperforms an LMM without PCs, and in other cases PCA similarly outperforms LMM. But we do want to emphasize that the effect is often not there, and that in general LMM without PCs performs nearly as well as the best competing method, and significant differences are invariably small. Lastly, we included an LMM with the true group labels as fixed covariates, and in most cases that was the best method and its AUCs were greater by considerable amounts. Overall, we do see this potential advantage to including PCs, but it is in no way a reliable solution, and it is no replacement for more direct modeling of environment effects.
2. I really appreciate the detailed simulations in this study, but perhaps it would make sense to simulate traits given UKB genotypes? Also, and perhaps more importantly, how about also evaluating the two approaches on UKB data, to see if you really find more hits using LMMs? (similar to Loh et al., NG 2018)
We are interested in this question of larger datasets, but it is out of scope of the current work. One immediate concern is that the LMM used in our paper (GCTA) cannot be used in these larger datasets. GCTA is in a sense a classic LMM that in our evaluations performs as well as older methods (EMMAX, GEMMA) but is faster than those, so we have no concerns that either of those would have similar power if we were able to run them on UKB (but this is not possible). In contrast, the scalable LMM approaches applicable to the UKB (BOLTLMM, SAIGE, REGENIE, and others) do so by making additional approximations that older LMMs do not use, which worry us as they may potentially reduce power and could confound the results. BOLTLMM in particularly has performed poorly in our hands in simulations with n = 1000 individuals (unpublished results), so this is not a merely hypothetical concern. A fair and complete evaluation that considers all of these variables is an entire line of work separate from the current evaluation, but it is definitely the next step.
That said, we believe the main conclusion will not change on UKB or other large datasets. In particular, we expect cryptic family relatedness to remain a serious problem, and perhaps have a larger impact as the abundance of distant relatives only increases with sample size.
3. You note that PCs are the top eigenvectors of the kinship matrix. This is usually true, but not always as when deriving PCs one should ideally apply some LD adjustment to avoid PCs capturing longrange LD regions, which can otherwise reduce power to detect variants in longrange regions. See e.g. Patterson et al. (PLoS Genet 2006) or Privé et al. (Bioinformatics 2020).
4. Following up on the previous comment, I wonder if the conclusions in this paper change if PCs are derived the way they often are, i.e. excluding longrange LD regions and/or using some LD adjustments.
We incorporated these variants of the PCA approach into our discussion, and in particular added the following conclusion: “These improved PCs remain inadequate models of family relatedness, so an LMM will continue to outperform them in that setting.”
5. Some of the LMMs cited are not classical LMMs, in the sense that they do not assume an infinitesimal genetic architecture. E.g. BOLTLMM (Loh et al., Nat Genet 2015) assumes a mixture of two Gaussians as a prior for the effects, which they show can improve power further. Mbatchou et al. (Nat Genet 2021) also use blockwise ridge regression (infinitesimal model), which is effectively a more flexible prior. Using these, the relationship between the PCs and the model becomes yet more complicated. Similarly, the GCTA mixed model GWAS uses a LOCO approach to improve power, which I believe means that a slightly different kinship is used for each chromosome.
We incorporated the LOCO approach into our discussion, and added the following conclusion: “While LOCO kinship estimates vary for each chromosome, they continue to model family relatedness, thus maintaining their key advantage over PCA.”
We also incorporated BOLTLMM and REGENIE into a different part of our discussion, and agreed with your conclusion: “Many of these newer LMMs also no longer follow the infinitesimal model of the basic LMM (Loh et al., 2015; Mbatchou et al., 2021), and employ novel approximations, which are features not evaluated in this work and worthy of future study.”
6. Your derivation assumes that there are some ancestral allele frequencies underlying the true model, but that seems perhaps unnecessary to me because these allele frequencies only represent the "correct" weights for the variants, and we know that they are probably wrong anyway. Indeed alternative and likely better weightings can be used, see Speed and Balding (AJHG 2012, NGR 2015, NG 2019). If you instead assume the sample frequencies are the right frequencies, then all the math becomes simple by standardizing the variants.
Our methods contained early on the following explanation: “However, this kinship estimator has a complex bias that differs for every individual pair, which arises due to the use of this estimated pˆ^{T}_{i} (Ochoa and Storey, 2021, 2019). Nevertheless, in PCA and LMM these biased estimates perform as well as unbiased ones (Hou and Ochoa, 2023).”
To further explain, our recent work (Ochoa and Storey, 2021) has shown that sample allele frequencies are not the right frequencies, although that is a pervasive assumption, and in particular kinship estimates are biased when making this precise assumption. In that work we developed a new estimator, popkin, that is unbiased and overcomes the problem that the true ancestral allele frequencies are unknown, and estimates kinship without bias without having to estimate such allele frequencies at all. We use popkin to visualize population structure in Figure 1. and to calculate eigenvalues later. However, our followup work found that PCA and LMM work just fine with standard biased kinship estimates, because the intercept compensates (in a complicated way) for the bias in PCs or random effects (Hou and Ochoa, 2023). Nevertheless, it is very important to distinguish between true and estimated allele frequencies, as biases creep in when such distinctions are ignored, and regardless of whether this is ultimately the correct model or not (no model is completely correct, of course).
Since Speed and Balding (2012) is relevant to a new related discussion about PCA and LMM variants, all of which attempt to improve LD modeling, we added this paper to that discussion as well.
7. Another publication that also provides some similar theory on the relationship between PCA and mixed models is Janss et al., (Genetics 2012). (Interestingly, although their work is very nice their PCs did however have major longrange LD issues.)
We incorporated Janss et al., (2012) in our paper, cited in introduction, methods, and discussion.
8. The authors mention that LMMs are not well suited for unbalanced samples, e.g. where the casecontrol ratio in the sample is <5%, and cite Zhou et al., Nat Genet 2018 as a solution. However, before Zhou et al. there were no computationally efficient generalized mixed models capable of being able to be applied to UKB sample sizes, which could help explain why LMMs haven't been adopted as quickly as one might have expected.
We agree, and added a discussion paragraph concerning the sample size limitations of the present work that echo these points.
9. I believe there is some concern regarding metaanalyzing LMM GWAS summary statistics, e.g. whether one should use the actual effective sample sizes or the effective sample sizes (which I believe is the best approach). I believe this is probably the main reason why LMMs are not as widely used as we would like. I would appreciate some discussion or reflection on this.
We agree these are important questions to be analyzed in future work, but metaanalysis is a very different kind of model that falls outside the scope of this present work.
Reviewer #3 (Recommendations for the authors):
The presentation of the results in Table 3 is nearly incomprehensible.
We modified Table 3 considerably for greater clarity. We simplified the statistical tests to center on the hypothesis that LMM with no PCs is better than all other cases (before we instead sought to identify smaller numbers of PCs that performed the same, statistically, as the best number of PCs, but that was both confusing and did not directly address our central hypothesis). For additional clarity we now include all pvalues. Now we use a Bonferroni threshold to declare significance (the original used a fixed pvalue threshold).
The authors are incorrect about no other troublesome cases on line 48. The differential variance between populations can induce bias (see papers from Xihong Lin).
We apologize, we were unable to identify the precise papers from Xihong Lin you are referring to, using your keywords of PCA and differential variance between populations. However, we stand by our comment, in that numerous papers have proposed various explanations for poor PCA performance that have not stood the test of time, and family relatedness is the only explanation that recurs and that numerous papers/authors agree upon.
The author's discussion of the benefits of including sample PCs in LMM (eg 468472 on p28) is misinformed. In particular, it is a critical step for methods like BOLT.
We added the following to the discussion (appears a few paragraphs after the lines cited above): “A different benefit for including PCs were recently reported for BOLTLMM, which does not result in greater power but rather in reduced runtime, a property that may be specific to its use of scalable algorithms such as conjugate gradient and variational Bayes (Loh et al., 2018).”
To further clarify, the BOLTLMM authors make no claim that there are gains in performance (which in our paper refers exclusively to calibration and power) due to inclusion of PCs, it is solely a runtime advantage. Further, this appears to be a property of the specific algorithm used by BOLTLMM, and not a general property of LMMs. Thus, our original statement stands.
The authors say "lowdimensional" when they mean "lowrank".
Lowrank is more correct mathematically, so we included a clarification in the first mention of “lowdimensional” relatedness. We also replaced some mentions of dimension with “matrix rank” when that is more precise, and in other cases added the adjective “model dimension” when that was appropriate. We also entirely replaced the vague term “dimensionality” with more precise terms, which varied depending on context. However, we thought “lowrank” may be too technical to use exclusively, so we did not fully excise the more intuitive “lowdimensional” expression from our work.
The Sul and Eskin paper does not result in a tie. It states that creating a second kinship matrix from the PCs can address the issue presented in Price et al. entirely with LMMs.
You are correct. However, we sought to indicate whether a standard LMM (with a single, standard kinship matrix) outperformed PCA, and in that regard it was a tie. LMMs with multiple kinship matrices are not commonly used and are not the subject of our investigation.
It would also be interesting to know to what extent the singular value distribution matters versus the rank? The authors have some discussion of this in 5b but it is not well developed.
We indeed consider this in what is now Figure 6 Figure supp 1, which shows the cumulative distribution of eigenvalues. However, that figure shows that none of the datasets stand out in having exceptional distributions except for the admixed family simulation. Since these distributions do not clearly separate the cases where PCA performed poorly from the rest, we decided that a brief mention of this analysis in the main text was enough to state that it was done but it was inconclusive. We think that spending additional time on this tangent is distracting from the stronger results we obtained from looking at the local kinship distributions.
Reviewer #4 (Recommendations for the authors):
I have the following comments for consideration:
Historical context: The overall reasoning for controlling population stratification and relative kinship for different types of populations was given in [26]. It would be helpful to revise the introduction part to focus more on the original research papers and their reasonings: [5] (PCA), [9] (Q), [26] (LMM, Q in MM), and [16] (PC in LMM and actual association study). The later studies with LMM from the groups behind GCTA should be mentioned as the (emerging/emerged) trend, in addition to those comparison studies.
For reference, these are the citations (as numbers may have changed in revision):
[5]: Price et al., (2006): PCA
[9]: Pritchard et al., (2000): Q
[16]: Zhao et al., (2007): PC in LMM and actual association
[26]: Yu et al., (2006): LMM, Q in MM
Earlier draft versions of our introduction presented the historical context in more detail, similarly to what you are suggesting, but after much consideration we converged onto the current version, which we believe is appropriately more focused on the current approaches that were tested. In particular, our paper does not test the Q association model directly, and only tests one newer LMM that uses the standard kinship estimator (unlike the above early LMMs), so we thought spending more time explaining previous approaches that we are not testing can be distracting and confusing to readers. We thought the text flowed better the way it is written, not only primarily describing current approaches, but also describing PCA and LMM papers separately rather than chronologically (the timelines overlap considerably). We chose to focus on explaining the methods and their properties clearly instead. We do cover the conclusions of those early papers and highlight the differences between those and current approaches, but we do so strictly as needed (not chronologically) and as concisely as possible for clarity.
A couple of relevant research may be examined and discussed, in terms of different types of populations, determining the number of PCs to be included in the LMM, model comparison, and modeling fitting (redundancy). https://doi.org/10.1534/genetics.108.098863 https://doi.org/10.1038/hdy.2010.11
Zhu and Yu, 2009: Thank you for making us aware of this excellent evaluation, we included it in Table 1, and cited it in numerous locations in the introduction and discussion.
Sun et al., 2010: We now cite this paper in the discussion: “Additionally, PCA and LMM goodness of fit could be compared using the coefficient of determination generalized for LMMs (Sun et al., 2010).” Although we would have liked to try calculating these R_{LR}^{2} in all of our simulations, for the baseline PCA and LMM models without loci, it comes at a considerable computational expense for LMMs, especially for the largest datasets, so we decided not to pursue it for this work. Furthermore, as variance components are estimated in GCTA using REML, we are also unsure if those estimates are appropriate for use in R_{LR}^{2} (technically estimates must be ML, not REML).
One question is the justification for examining a single heritability of 0.8. This level is generally regarded very high. It would be more convincing to see the results when h2 = 0.30.6, particularly if this is set for a large population with different subgroups.
We repeated most of our simulations with a heritability of 0.3, and all of our conclusions hold.
The choice of m1=n/10 is not adequately justified. Even though we typically assume that there are many loci and the detection power increases with larger sample sizes, the actual detected numbers of loci in empirical studies are lower than that. With the current set m1 (Table 2), what were the power values, and are they close to the empirical studies?
Since we now consider varying heritability, we extended our heuristic formula to “m_{1} = round(nh^{2}/8), which empirically balances power well with varying n and h^{2}.” Note that for the original heritability of h^{2} = 0.8 the above formula reduces to the original one. This heuristic formula aims to maintain similar power (or AUCs) across studies with different parameters, and it works well enough empirically, though it is not exact and we did observe that AUCs vary somewhat between studies. More parameters surely matter for determining AUC.
We do not recover most of the causal loci we are simulating, reflected in the fact that AUCs are near 0.1 or 0.2 (this is, roughly speaking, close to the proportion of causal loci we are recovering confidently, which are clearly separated from the noise). We added a new figure that compares AUC to power, and find rough agreement between AUC and calibrated power calculated at an empirical type I error rate of 1e4 (Figure 2 Figure supp 3). Therefore, our power at that level is roughly 0.1 to 0.2. In contrast, other empirical studies try to select parameters so that power is near 50%, although the pvalue thresholds at which power is measured varies considerably in those previous studies. We think our simulation is more realistic in the sense that most causal loci are not recovered by GWAS, but their presence is important as they add to the background polygenic effect, so it makes sense to model a large number of causal loci even if we can only hope to recover a small proportion of them.
"The largest limitation of our work is that we only considered quantitative traits". This may be expanded to include that with the overall simulation scheme, you assume the causal variants are functioning across different groups of a large population (2.2.5 Trait Simulation)? Real data with measured traits may be different. This can also be different for different types of complex diseases. Some clear context information should be given at the beginning too.
By this we assume you mean that we did not simulate ancestryspecific effect size heterogeneity (in other words, our coefficients are the same for all individuals, of all ancestries). Although some work has suggested that effect sizes can vary across ancestries, other recent work shows that this is not common (K. Hou et al., 2023). However, this is not a limitation in the context of our paper, as neither the basic PCA or LMM are equipped to handle effect size heterogeneity, so adding such details to our simulation is not expected to benefit either approach. There are more recent approaches that extend that model effect size heterogeneity, such as TRACTOR, but they are not the focus of this evaluation. Additionally, both PCA and LMM can, in principle, be extended to model effect heterogeneity in exactly the same way as TRACTOR (which is a fixed effects model most like PCA).
Among these simulated population and real data, have you examined whether a small number of PCs are indeed significant to explain some trait variation? I think this was the original thinking of applying PCs to correct the population stratification.
We now performed these tests (Figure 6 Figure supp 2) and they always yielded small numbers of PCs that were significant: minimum 1 observed for the small sample size admixture simulation, most other cases had several significant PCs, maximum below 20 observed for the Admixed Family simulation and for Human Origins (both real and the tree simulated version). Overall they were roughly consistent with the numbers of PCs needed for the PCA method to maximize its performance in each of these datasets.
We introduced the following text in the section where we discuss the latent dimension of datasets and whether they could be used to predict when PCA would perform poorly (they don’t): “We also calculated the number of PCs that are significantly associated with the trait, and observed similar results, namely that while the family simulation has more significant PCs than the nonfamily admixture simulations, the real datasets and their tree simulated counterparts have similar numbers of significant PCs (Figure 6 Figure supp 2).”
L46 and L487. Assume the lowdimension part of the relatedness matters in terms of trait differences among groups, which need to be adjusted.
It is unnecessary to state that PCA assumes that the effect on the trait is also lowdimensional. Under nonzero heritability, highdimensional relatedness (family structure) always affects the trait. It is impossible for the effect to be restricted to a lowdimensional component, because close relatives will share an outsize proportion of variants (whether they are common or rare in different groups) and thus have much more similar traits than random people (in the same group or otherwise).
L5357. This is not an accurate summary of the relevant papers. Earlier papers set up the general LMM framework with different components that may be included or included and individual components that can have varied forms or reported the first actual association study with the LMM framework. These two are earlier papers that (re)introduce LMM to the association studies, and markers were used for kinship, structure, and PCA.
We reworded the first sentence to: “Early LMMs used kinship matrices estimated from known pedigrees or using methods that captured recent relatedness only, and modeled population structure (ancestry) as fixed effects (Yu et al., 2006; Zhao et al., 2007; Zhu et al., 2009).” The intent of that sentence was not to describe early LMMs entirely, or as abstract frameworks, but to point out that the specific kinship estimates used then in practice were different than the ones commonly used now, though we agree the original version was misleading.
L6470. Redundancy is not an issue since the objective is to control false positives using two types of covariates.
We agree, redundancy does not often cause problems in practice, but without a clear benefit it’s not worth the clear risk of loss of power (which we demonstrate for large numbers of PCs) and increased computational burden.
L166. Large "and"(?) Family.
We reworded to say “Both Large and Family simulations …”.
L282. Theoretical and empirical evidence of these two needs to be provided.
We previously provided theoretical justification for our SRMSDp and AUC metrics in Appendix B, showing that SRMSDp reflects null pvalue calibration and agrees with the inflation factor and type I error rate, while AUC corresponds to calibrated power. We also provided Figure 2 Figure supp 1, which showed an excellent monotonic correspondence between SRMSDp and the inflation factor. We added two supplementary figures that provide additional empirical justification. Figure 2 Figure supp 2 shows that SRMSDp and type I error are monotonically related, and that they agree as expected for calibrated models. Figure 2 Figure supp 3 shows a positive correlation between AUC and calibrated power, which holds well separately per dataset, although the slope of this relationship varies between datasets because they have different proportions of loci where the alternative hypothesis holds.
Table 3. Not clear about the asterisk.
This table has been reorganized at the request of another reviewer. The asterisk denoted that model with that number of PCs was calibrated (determined by whether the mean absolute SRMSDp was below 0.01). In the new table there is instead a column that denotes if each model is calibrated as a boolean (True or False).
https://doi.org/10.7554/eLife.79238.sa2