Limitations of principal components in quantitative genetic association models for human studies

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Appendix 1
Appendix 2
Data availability
References
Article and author information
Metrics

Abstract

Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.

Editor's evaluation

This is an important paper that presents compelling arguments (based on simulation and comprehensively reviewed background theory) that Linear Mixed Models generally should perform better at correcting for genetic and environmental confounding in GWAS than more commonly used Principal Components methods.

https://doi.org/10.7554/eLife.79238.sa0

Introduction

The goal of a genetic association study is to identify loci whose genotype variation is significantly correlated to given trait. Naive association tests assume that genotypes are drawn independently from a common allele frequency. This assumption does not hold for structured populations, which includes multiethnic cohorts and admixed individuals (ancient relatedness), and for family data (recent relatedness; Astle and Balding, 2009). Association studies of admixed and multiethnic cohorts, the focus of this work, are becoming more common, are believed to be more powerful, and are necessary to bring more equity to genetic medicine (Rosenberg et al., 2010; Hoffman and Dubé, 2013; Coram et al., 2013; Medina-Gomez et al., 2015; Conomos et al., 2016a; Hodonsky et al., 2017; Martin et al., 2017a; Martin et al., 2017b; Hindorff et al., 2018; Hoffmann et al., 2018; Mogil et al., 2018; Roselli et al., 2018; Wojcik et al., 2019; Peterson et al., 2019; Zhong et al., 2019; Hu et al., 2020; Simonin-Wilmer et al., 2021; Kamariza et al., 2021; Lin et al., 2021; Mahajan et al., 2022; Hou et al., 2023a). When insufficient approaches are applied to data with relatedness, their association statistics are miscalibrated, resulting in excess false positives and loss of power (Devlin and Roeder, 1999; Voight and Pritchard, 2005; Astle and Balding, 2009). Therefore, many specialized approaches have been developed for genetic association under relatedness, of which PCA and LMM are the most popular.

Genetic association with PCA consists of including the top eigenvectors of the population kinship matrix as covariates in a generalized linear model (Zhang et al., 2003; Price et al., 2006; Bouaziz et al., 2011). These top eigenvectors are a new set of coordinates for individuals that are commonly referred to as PCs in genetics (Patterson et al., 2006), the convention adopted here, but in other fields PCs instead denote what in genetics would be the projections of loci onto eigenvectors, which are new independent coordinates for loci (Jolliffe, 2002). The direct ancestor of PCA association is structured association, in which inferred ancestry (genetic cluster membership, often corresponding with labels such as “European”, “African”, “Asian”, etc.) or admixture proportions of these ancestries are used as regression covariates (Pritchard et al., 2000). These models are deeply connected because PCs map to ancestry empirically (Alexander et al., 2009; Zhou et al., 2016) and theoretically (McVean, 2009; Zheng and Weir, 2016; Cabreros and Storey, 2019; Chiu et al., 2022), and they work as well as global ancestry in association studies but are estimated more easily (Patterson et al., 2006; Zhao et al., 2007; Alexander et al., 2009; Bouaziz et al., 2011). Another approach closely related to PCA is nonmetric multidimensional scaling (Zhu and Yu, 2009). PCs are also proposed for modeling environment effects that are correlated to ancestry, for example, through geography (Novembre et al., 2008; Zhang and Pan, 2015; Lin et al., 2021). The strength of PCA is its simplicity, which as covariates can be readily included in more complex models, such as haplotype association (Xu and Guan, 2014) and polygenic models (Qian et al., 2020). However, PCA assumes that the underlying relatedness space is low dimensional (or low rank), so it can be well modeled with a small number of PCs, which may limit its applicability. PCA is known to be inadequate for family data (Patterson et al., 2006; Zhu and Yu, 2009; Thornton and McPeek, 2010; Price et al., 2010), which is called ‘cryptic relatedness’ when it is unknown to the researchers, but no other troublesome cases have been confidently identified. Recent work has focused on developing more scalable versions of the PCA algorithm (Lee et al., 2012; Abraham and Inouye, 2014; Galinsky et al., 2016; Abraham et al., 2017; Agrawal et al., 2020). PCA remains a popular and powerful approach for association studies.

The other dominant association model under relatedness is the LMM, which includes a random effect parameterized by the kinship matrix. Unlike PCA, LMM does not assume that relatedness is low-dimensional, and explicitly models families via the kinship matrix. Early LMMs used kinship matrices estimated from known pedigrees or using methods that captured recent relatedness only, and modeled population structure (ancestry) as fixed effects (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). Modern LMMs estimate kinship from genotypes using a non-parametric estimator, often referred to as a genetic relationship matrix, that captures the combined covariance due to family relatedness and ancestry (Kang et al., 2008; Astle and Balding, 2009; Ochoa and Storey, 2021). Like PCA, LMM has also been proposed for modeling environment correlated to genetics (Vilhjálmsson and Nordborg, 2013; Wang et al., 2022). The classic LMM assumes a quantitative (continuous) complex trait, the focus of our work. Although case-control (binary) traits and their underlying ascertainment are theoretically a challenge (Yang et al., 2014), LMMs have been applied successfully to balanced case-control studies (Astle and Balding, 2009; Kang et al., 2010) and simulations (Price et al., 2010; Wu et al., 2011; Sul and Eskin, 2013), and have been adapted for unbalanced case-control studies (Zhou et al., 2018). However, LMMs tend to be considerably slower than PCA and other models, so much effort has focused on improving their runtime and scalability (Aulchenko et al., 2007; Kang et al., 2008; Kang et al., 2010; Zhang et al., 2010; Lippert et al., 2011; Yang et al., 2011; Listgarten et al., 2012; Zhou and Stephens, 2012; Svishcheva et al., 2012; Loh et al., 2015; Zhou et al., 2018).

An LMM variant that incorporates PCs as fixed covariates is tested thoroughly in our work. Since PCs are the top eigenvectors of the same kinship matrix estimate used in modern LMMs (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), then population structure is modeled twice in an LMM with PCs. However, some previous work has found the apparent redundancy of an LMM with PCs beneficial (Price et al., 2010; Tucker et al., 2014; Zhang and Pan, 2015), while others did not (Liu et al., 2011; Janss et al., 2012), and the approach continues to be used (Zeng et al., 2018; Mbatchou et al., 2021), although not always (Matoba et al., 2020). Recall that early LMMs used kinship to model family relatedness only, so population structure had to be modeled separately in those models, in practice as admixture fractions instead of PCs (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). The LMM with PCs (vs no PCs) is also believed to help better model loci that have experienced selection (Price et al., 2010; Vilhjálmsson and Nordborg, 2013) and environment effects correlated with genetics (Zhang and Pan, 2015).

LMM and PCA are closely related models (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), so similar performance is expected particularly under low-dimensional relatedness. Direct comparisons have yielded mixed results, with several studies finding superior performance for LMM, notably from papers promoting advances in LMMs, while many others report comparable performance (Table 1). No papers find that PCA outperforms LMM decisively, although PCA occasionally performs better in isolated and artificial cases or individual measures, often with unknown significance. Previous studies generally used either only simulated or only real genotypes, with only two studies using both. The simulated genotype studies, which tended to have low model dimensions and $F_{ST}$ , were more likely to report ties or mixed results (6/8), whereas real genotypes tended to clearly favor LMMs (9/11). Similarly, 10/12 papers with quantitative traits favor LMMs, whereas 6/9 papers with case-control traits gave ties or mixed results—the only factor we do not explore in this work. Additionally, although all previous evaluations measured type I error (or proxies such as genomic inflation factors Devlin and Roeder, 1999 or QQ plots), a large fraction (6/17) did not measure power (or proxies such as ROC curves), and only four used more than one number of PCs for PCA. Lastly, no consensus has emerged as to why LMM might outperform PCA or vice versa (Price et al., 2010; Sul and Eskin, 2013; Price et al., 2013; Hoffman and Dubé, 2013), or which features of the real datasets are critical for the LMM advantage other than family relatedness, resulting in unclear guidance for using PCA. Hence, our work includes real and simulated genotypes with higher model dimensions and $F_{ST}$ matching that of multiethnic human cohorts (Ochoa and Storey, 2021; Ochoa and Storey, 2019), we vary the number of PCs, and measure robust proxies for type I error control and calibrated power.

Table 1

Previous PCA-LMM evaluations in the literature.

	Sim. Genotypes			General
Publication	Type*	$K$ ^†	$F_{ST}$ ^‡	Real ^§	Trait ^¶	Power	$P C s (r)$	Best
Zhao et al., 2007				✓	Q	✓	8	LMM
Zhu and Yu, 2009	I, A, F	3, 8	≤0.15	✓	Q	✓	1–22	LMM
Astle and Balding, 2009	I	3	0.10		CC	✓	10	Tie
Kang et al., 2010				✓	Both		2–100	LMM
Price et al., 2010	I, F	2	0.01		CC		1	Mixed
Wu et al., 2011	I, A	2–4	0.01		CC	✓	10	Mixed
Liu et al., 2011	S, A	2–3	R		Q	✓	10	Tie
Sul and Eskin, 2013	I	2	0.01		CC		1	Tie
Tucker et al., 2014	I	2	0.05	✓	Both	✓	5	Tie
Yang et al., 2014				✓	CC	✓	5	Tie
Song et al., 2015	S, A	2–3	R		Q		3	LMM
Loh et al., 2015				✓	Q	✓	10	LMM
Zhang and Pan, 2015				✓	Q	✓	20–100	LMM
Liu et al., 2016				✓	Q	✓	3–6	LMM
Sul et al., 2018				✓	Q		100	LMM
Loh et al., 2018				✓	Both	✓	20	LMM
Mbatchou et al., 2021				✓	Both		1	LMM
This work	A, T, F	10–243	≤0.25	✓	Q	✓	0–90	LMM

*

Genotype simulation types. I: Independent subpopulations; S: subpopulations (with parameters drawn from real data); A: Admixture; T: Subpopulation Tree; F: Family.
†

Model dimension (number of subpopulations or ancestries).
‡

R: simulated parameters based on real data, $F_{ST}$ not reported.
§

Evaluations using unmodified real genotypes.
¶

Q: quantitative; CC: case-control.

In this work, we evaluate the PCA and LMM association models under various numbers of PCs, which are included in LMMs too. We use genotype simulations (admixture, family, and subpopulation tree models) and three real datasets: the 1000 Genomes Project (Abecasis et al., 2010; Abecasis et al., 2012), the Human Genome Diversity Panel (HGDP) (Cann et al., 2002; Rosenberg et al., 2002; Bergström et al., 2020), and Human Origins (Patterson et al., 2012; Lazaridis et al., 2014; Lazaridis et al., 2016; Skoglund et al., 2016). We simulate quantitative traits from two models: fixed effect sizes (FES) construct coefficients inverse to allele frequency, which matches real data (Park et al., 2011; Zeng et al., 2018; O’Connor et al., 2019) and corresponds to high pleiotropy and strong balancing selection (Simons et al., 2018) and strong negative selection (Zeng et al., 2018; O’Connor et al., 2019), which are appropriate assumptions for diseases; and random coefficients (RC), which are drawn independent of allele frequency, and corresponds to neutral traits (Zeng et al., 2018; Simons et al., 2018). LMM without PCs consistently performs best in simulations without environment, and greatly outperforms PCA in the family simulation and in all real datasets. The tree simulations, which model subpopulations with the tree but exclude family structure, do not recapitulate the real data results, suggesting that family relatedness in real data is the reason for poor PCA performance. Lastly, removing up to 4th degree relatives in the real datasets recapitulates poor PCA performance, showing that the more numerous distant relatives explain the result, and suggesting that PCA is generally not an appropriate model for real data. We find that both LMM and PCA are able to model environment effects correlated with genetics, and LMM with PCs gains a small advantage in this setting only, but direct modeling of environment performs much better. All together, we find that LMMs without PCs are generally a preferable association model, and present novel simulation and evaluation approaches to measure the performance of these and other genetic association approaches.

Results

Overview of evaluations

We use three real genotype datasets and simulated genotypes from six population structure scenarios to cover various features of interest (Table 2). We introduce them in sets of three, as they appear in the rest of our results. Population kinship matrices, which combine population and family relatedness, are estimated without bias using popkin (Ochoa and Storey, 2021; Figure 1). The first set of three simulated genotypes are based on an admixture model with 10 ancestries (Figure 1A; Ochoa and Storey, 2021; Gopalan et al., 2016; Cabreros and Storey, 2019). The ‘large’ version (1000 individuals) illustrates asymptotic performance, while the ‘small’ simulation (100 individuals) illustrates model overfitting. The ‘family’ simulation has admixed founders and draws a 20-generation random pedigree with assortative mating, resulting in a complex joint family and ancestry structure in the last generation (Figure 1B). The second set of three are the real human datasets representing global human diversity: Human Origins (Figure 1D), HGDP (Figure 1G), and 1000 Genomes (Figure 1J), which are enriched for small minor allele frequencies even after MAF <1% filter (Figure 1C). Last are subpopulation tree simulations (Figure 1F, I, L) fit to the kinship (Figure 1E, H and K) and MAF (Figure 1C) of each real human dataset, which by design do not have family structure.

Table 2

Features of simulated and real human genotype datasets.

Dataset	Type	$L o c i (m)$	Ind. ( $n$ )	Subpops.* ( $K$ )	Causal loci^† ( $m_{1}$ )	$F_{ST}$ ^‡
Admix. Large sim.	Admix.	100 000	1000	10	100	0.1
Admix. Small sim.	Admix.	100 000	100	10	10	0.1
Admix. Family sim.	Admix.+Pedig.	100 000	1000	10	100	0.1
Human Origins	Real	190 394	2922	11–243	292	0.28
HGDP	Real	771 322	929	7–54	93	0.28
1000 Genomes	Real	1 111 266	2504	5–26	250	0.22
Human Origins sim.	Tree	190 394	2922	243	292	0.23
HGDP sim.	Tree	771 322	929	54	93	0.25
1000 Genomes sim.	Tree	1 111 266	2504	26	250	0.21

*

For admixed family, ignores additional model dimension of 20 generation pedigree structure. For real datasets, lower range is continental subpopulations, upper range is number of fine-grained subpopulations.
†

$m_{1} = round (n h^{2} / 8)$ to balance power across datasets, shown for $h^{2} = 0.8$ only.
‡

Model parameter for simulations, estimated value on real datasets.

Figure 1

Download asset Open asset

Population structures of simulated and real human genotype datasets.

First two columns are population kinship matrices as heatmaps: individuals along x- and y-axis, kinship as color. Diagonal shows inbreeding values. (A) Admixture scenario for both Large and Small simulations. (B) Last generation of 20-generation admixed family, shows larger kinship values near diagonal corresponding to siblings, first cousins, etc. (C) Minor allele frequency (MAF) distributions. Real datasets and subpopulation tree simulations had $MAF \geq 0.01$ filter. (D) Human Origins is an array dataset of a large diversity of global populations. (G) Human Genome Diversity Panel (HGDP) is a WGS dataset from global native populations. (J) 1000 Genomes Project is a WGS dataset of global cosmopolitan populations. (**F, I, L**) Trees between subpopulations fit to real data. (**E, H, K**). Simulations from trees fit to the real data recapitulate subpopulation structure.

All traits in this work are simulated. We repeated all evaluations on two additive quantitative trait models, fixed effect sizes (FES) and random coefficients (RC), which differ in how causal coefficients are constructed. The FES model captures the rough inverse relationship between coefficient and minor allele frequency that arises under strong negative and balancing selection and has been observed in numerous diseases and other traits (Park et al., 2011; Zeng et al., 2018; Simons et al., 2018; O’Connor et al., 2019), so it is the focus of our results. The RC model draws coefficients independent of allele frequency, corresponding to neutral traits (Zeng et al., 2018; Simons et al., 2018), which results in a wider effect size distribution that reduces association power and effective polygenicity compared to FES.

We evaluate using two complementary measures: (1) ${SRMSD}_{p}$ (p-value signed root mean square deviation) measures p-value calibration (closer to zero is better), and (2) ${AUC}_{PR}$ (precision-recall area under the curve) measures causal locus classification performance (higher is better; Figure 2). ${SRMSD}_{p}$ is a more robust alternative to the common inflation factor $λ$ and type I error control measures; there is a correspondence between $λ$ and ${SRMSD}_{p}$ , with ${SRMSD}_{p} > 0.01$ giving $λ > 1.06$ (Figure 2—figure supplement 1) and thus evidence of miscalibration close to the rule of thumb of $λ > 1.05$ (Price et al., 2010). There is also a monotonic correspondence between ${SRMSD}_{p}$ and type I error rate (Figure 2—figure supplement 2). ${AUC}_{PR}$ has been used to evaluate association models (Rakitsch et al., 2013), and reflects calibrated statistical power (Figure 2—figure supplement 3) while being robust to miscalibrated models (Appendix 2).

Figure 2 with 3 supplements see all

Download asset Open asset

Illustration of evaluation measures.

Three archetypal models illustrate our complementary measures: M1 is ideal, M2 overfits slightly, M3 is naive. (A) QQ plot of p-values of “null” (non-causal) loci. M1 has desired uniform p-values, M2/M3 are miscalibrated. (B) ${SRMSD}_{p}$ (p-value Signed Root Mean Square Deviation) measures signed distance between observed and expected null p-values (closer to zero is better). (C) Precision and Recall (PR) measure causal locus classification performance (higher is better). (D) ${AUC}_{PR}$ (Area Under the PR Curve) reflects power (higher is better).

Both PCA and LMM are evaluated in each replicate dataset including a number of PCs $r$ between 0 and 90 as fixed covariates. In terms of p-value calibration, for PCA the best number of PCs $r$ (minimizing mean $| {SRMSD}_{p} |$ over replicates) is typically large across all datasets (Table 3), although much smaller $r$ values often performed as well (shown in following sections). Most cases have a mean $| {SRMSD}_{p} | < 0.01$ , whose p-values are effectively calibrated. However, PCA is often miscalibrated on the family simulation and real datasets (Table 3). In contrast, for LMM, $r = 0$ (no PCs) is always best, and is always calibrated. Comparing LMM with $r = 0$ to PCA with its best $r$ , LMM always has significantly smaller $| {SRMSD}_{p} |$ than PCA or is statistically tied. For ${AUC}_{PR}$ and PCA, the best $r$ is always smaller than the best $r$ for $| {SRMSD}_{p} |$ , so there is often a tradeoff between calibrated p-values versus classification performance. For LMM, there is no tradeoff, as $r = 0$ often has the best mean ${AUC}_{PR}$ , and otherwise is not significantly different from the best $r$ . Lastly, LMM with $r = 0$ always has significantly greater or statistically tied ${AUC}_{PR}$ than PCA with its best $r$ .

Table 3

Overview of PCA and LMM evaluations for high heritability simulations.

			LMM $r = 0$ vs best $r$			PCA vs LMM $r = 0$
Dataset	Metric	Trait*	Cal.^†	Best $r$ ^‡	P-value ^§	Best $r$ ^‡	Cal.^†	P-value ^§	Best model ^¶
Admix. Large sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	12	True	0.036	Tie
Admix. Small sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	4	True	0.055	Tie
Admix. Family sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	90	False	3.9e-10*	LMM
Human Origins	$\| {SRMSD}_{p} \|$	FES	True	0	1	89	False	3.9e-10*	LMM
HGDP	$\| {SRMSD}_{p} \|$	FES	True	0	1	87	True	4.4e-10*	LMM
1000 Genomes	$\| {SRMSD}_{p} \|$	FES	True	0	1	90	False	3.9e-10*	LMM
Human Origins sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	88	True	0.017	Tie
HGDP sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	47	True	0.046	Tie
1000 Genomes sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	78	True	9.6e-10*	LMM
Admix. Large sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	26	True	0.11	Tie
Admix. Small sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	4	True	0.00097	Tie
Admix. Family sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	90	False	3.9e-10*	LMM
Human Origins	$\| {SRMSD}_{p} \|$	RC	True	0	1	90	True	0.00065	Tie
HGDP	$\| {SRMSD}_{p} \|$	RC	True	0	1	37	True	1.5e-05*	LMM
1000 Genomes	$\| {SRMSD}_{p} \|$	RC	True	0	1	76	True	3.9e-10*	LMM
Human Origins sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	85	True	0.14	Tie
HGDP sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	44	True	8.8e-07*	LMM
1000 Genomes sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	90	True	3.9e-10*	LMM
Admix. Large sim.	${AUC}_{PR}$	FES		0	1	3		5.9e-06*	LMM
Admix. Small sim.	${AUC}_{PR}$	FES		0	1	2		0.025	Tie
Admix. Family sim.	${AUC}_{PR}$	FES		1	0.35	22		3.9e-10*	LMM
Human Origins	${AUC}_{PR}$	FES		0	1	34		3.9e-10*	LMM
HGDP	${AUC}_{PR}$	FES		1	0.33	16		4.4e-10*	LMM
1000 Genomes	${AUC}_{PR}$	FES		1	0.11	8		3.9e-10*	LMM
Human Origins sim.	${AUC}_{PR}$	FES		0	1	36		3.9e-10*	LMM
HGDP sim.	${AUC}_{PR}$	FES		0	1	17		1.7e-05*	LMM
1000 Genomes sim.	${AUC}_{PR}$	FES		0	1	10		5e-10*	LMM
Admix. Large sim.	${AUC}_{PR}$	RC		0	1	3		1.4e-05*	LMM
Admix. Small sim.	${AUC}_{PR}$	RC		0	1	1		0.095	Tie
Admix. Family sim.	${AUC}_{PR}$	RC		0	1	34		3.9e-10*	LMM
Human Origins	${AUC}_{PR}$	RC		3	0.4	36		9.6e-10*	LMM
HGDP	${AUC}_{PR}$	RC		4	0.21	16		0.013	Tie
1000 Genomes	${AUC}_{PR}$	RC		5	0.004	9		0.00043	Tie
Human Origins sim.	${AUC}_{PR}$	RC		0	1	37		4.1e-10*	LMM
HGDP sim.	${AUC}_{PR}$	RC		3	0.087	17		0.0014	Tie
1000 Genomes sim.	${AUC}_{PR}$	RC		3	0.37	10		8.5e-10*	LMM

*

FES: Fixed Effect Sizes, RC: Random Coefficients.
†

Calibrated: whether mean $| {SRMSD}_{p} | < 0.01$ over 50 replicates.
‡

Value of $r$ (number of PCs) with minimum mean $| {SRMSD}_{p} |$ or maximum mean ${AUC}_{PR}$ .
§

Wilcoxon paired 1-tailed test of distributions ( $| {SRMSD}_{p} |$ or ${AUC}_{PR}$ ) between models in header. Asterisk marks significant value using Bonferroni threshold ( $p < α / n_{tests}$ with $α = 0.01$ and $n_{tests} = 72$ is the number of tests in this table).
¶

Tie if no significant difference using Bonferroni threshold.

Evaluations in admixture simulations

Now we look more closely at results per dataset. The complete ${SRMSD}_{p}$ and ${AUC}_{PR}$ distributions for the admixture simulations and FES traits are in Figure 3. RC traits gave qualitatively similar results (Figure 3—figure supplement 1).

Figure 3 with 5 supplements see all

Download asset Open asset

Evaluations in admixture simulations with FES traits, high heritability.

PCA and LMM models have varying number of PCs ( $r \in {0, \dots, 90}$ on x-axis), with the distributions (y-axis) of ${SRMSD}_{p}$ (top subpanel) and ${AUC}_{PR}$ (bottom subpanel) for 50 replicates. Best performance is zero ${SRMSD}_{p}$ and large ${AUC}_{PR}$ . Zero and maximum median ${AUC}_{PR}$ values are marked with horizontal gray dashed lines, and $| {SRMSD}_{p} | < 0.01$ is marked with a light gray area. LMM performs best with $r = 0$ , PCA with various $r$ . (A) Large simulation ( $n = 1, 000$ individuals). (B) Small simulation ( $n = 100$ ) shows overfitting for large $r$ . (C) Family simulation ( $n = 1, 000$ ) has admixed founders and large numbers of close relatives from a realistic random 20-generation pedigree. PCA performs poorly compared to LMM: ${SRMSD}_{p} > 0$ for all $r$ and large ${AUC}_{PR}$ gap.

In the large admixture simulation, the ${SRMSD}_{p}$ of PCA is largest when $r = 0$ (no PCs) and decreases rapidly to near zero at $r = 3$ , where it stays for up to $r = 90$ (Figure 3A). Thus, PCA has calibrated p-values for $r \geq 3$ , smaller than the theoretical optimum for this simulation of $r = K - 1 = 9$ . In contrast, the ${SRMSD}_{p}$ for LMM starts near zero for $r = 0$ , but becomes negative as $r$ increases (p-values are conservative). The ${AUC}_{PR}$ distribution of PCA is similarly worst at $r = 0$ , increases rapidly and peaks at $r = 3$ , then decreases slowly for $r > 3$ , while the ${AUC}_{PR}$ distribution for LMM starts near its maximum at $r = 0$ and decreases with $r$ . Although the ${AUC}_{PR}$ distributions for LMM and PCA overlap considerably at each $r$ , LMM with $r = 0$ has significantly greater ${AUC}_{PR}$ values than PCA with $r = 3$ (Table 3). However, qualitatively PCA performs nearly as well as LMM in this simulation.

The observed robustness to large $r$ led us to consider smaller sample sizes. A model with large numbers of parameters $r$ should overfit more as $r$ approaches the sample size $n$ . Rather than increase $r$ beyond 90, we reduce individuals to $n = 100$ , which is small for typical association studies but may occur in studies of rare diseases, pilot studies, or other constraints. To compensate for the loss of power due to reducing $n$ , we also reduce the number of causal loci (see Trait Simulation), which increases per-locus effect sizes. We found a large decrease in performance for both models as $r$ increases, and best performance for $r = 1$ for PCA and $r = 0$ for LMM (Figure 3B). Remarkably, LMM attains much larger negative ${SRMSD}_{p}$ values than in our other evaluations. LMM with $r = 0$ is significantly better than PCA ( $r = 1$ to 4) in both measures (Table 3), but qualitatively the difference is negligible.

The family simulation adds a 20-generation random family to our large admixture simulation. Only the last generation is studied for association, which contains numerous siblings, first cousins, etc., with the initial admixture structure preserved by geographically biased mating. Our evaluation reveals a sizable gap in both measures between LMM and PCA across all $r$ (Figure 3C). LMM again performs best with $r = 0$ and achieves mean $| {SRMSD}_{p} | < 0.01$ . However, PCA does not achieve mean $| {SRMSD}_{p} | < 0.01$ at any $r$ , and its best mean ${AUC}_{PR}$ is considerably worse than that of LMM. Thus, LMM is conclusively superior to PCA, and the only calibrated model, when there is family structure.

Evaluations in real human genotype datasets

Next, we repeat our evaluations with real human genotype data, which differs from our simulations in allele frequency distributions and more complex population structures with greater $F_{ST}$ , numerous correlated subpopulations, and potential cryptic family relatedness.

Human Origins has the greatest number and diversity of subpopulations. The ${SRMSD}_{p}$ and ${AUC}_{PR}$ distributions in this dataset and FES traits (Figure 4A) most resemble those from the family simulation (Figure 3C). In particular, while LMM with $r = 0$ performed optimally (both measures) and satisfies mean $| {SRMSD}_{p} | < 0.01$ , PCA maintained ${SRMSD}_{p} > 0.01$ for all $r$ and its ${AUC}_{PR}$ were all considerably smaller than the best ${AUC}_{PR}$ of LMM.

Figure 4 with 5 supplements see all

Download asset Open asset

Evaluations in real human genotype datasets with FES traits, high heritability.

Same setup as Figure 3, see that for details. These datasets strongly favor LMM with no PCs over PCA, with distributions that most resemble the family simulation. (A) Human Origins. (B) Human Genome Diversity Panel (HGDP). (C) 1000 Genomes Project.

HGDP has the fewest individuals among real datasets, but compared to Human Origins contains more loci and low-frequency variants. Performance (Figure 4B) again most resembled the family simulations. In particular, LMM with $r = 0$ achieves mean $| {SRMSD}_{p} | < 0.01$ (p-values are calibrated), while PCA does not, and there is a sizable ${AUC}_{PR}$ gap between LMM and PCA. Maximum ${AUC}_{PR}$ values were lowest in HGDP compared to the two other real datasets.

1000 Genomes has the fewest subpopulations but largest number of individuals per subpopulation. Thus, although this dataset has the simplest subpopulation structure among the real datasets, we find ${SRMSD}_{p}$ and ${AUC}_{PR}$ distributions (Figure 4C) that again most resemble our earlier family simulation, with mean $| {SRMSD}_{p} | < 0.01$ for LMM only and large ${AUC}_{PR}$ gaps between LMM and PCA.

Our results are qualitatively different for RC traits, which had smaller ${AUC}_{PR}$ gaps between LMM and PCA (Figure 4—figure supplement 1). Maximum ${AUC}_{PR}$ were smaller in RC compared to FES in Human Origins and 1000 Genomes, suggesting lower power for RC traits across association models. Nevertheless, LMM with $r = 0$ was significantly better than PCA for all measures in the real datasets and RC traits (Table 3).

Evaluations in subpopulation tree simulations fit to human data

To better understand which features of the real datasets lead to the large differences in performance between LMM and PCA, we carried out subpopulation tree simulations. Human subpopulations are related roughly by trees, which induce the strongest correlations, so we fit trees to each real dataset and tested if data simulated from these complex tree structures could recapitulate our previous results (Figure 1). These tree simulations also feature non-uniform ancestral allele frequency distributions, which recapitulated some of the skew for smaller minor allele frequencies of the real datasets (Figure 1C). The ${SRMSD}_{p}$ and ${AUC}_{PR}$ distributions for these tree simulations (Figure 5) resembled our admixture simulation more than either the family simulation (Figure 3) or real data results (Figure 4). Both LMM with $r = 0$ and PCA (various $r$ ) achieve mean $| {SRMSD}_{p} | < 0.01$ (Table 3). The ${AUC}_{PR}$ distributions of both LMM and PCA track closely as $r$ is varied, although there is a small gap resulting in LMM ( $r = 0$ ) besting PCA in all three simulations. The results are qualitatively similar for RC traits (Figure 5—figure supplement 1, Table 3). Overall, these subpopulation tree simulations do not recapitulate the large LMM advantage over PCA observed on the real data.

Figure 5 with 1 supplement see all

Download asset Open asset

Evaluations in subpopulation tree simulations fit to human data with FES traits, high heritability.

Same setup as Figure 3, see that for details. These tree simulations, which exclude family structure by design, do not explain the large gaps in LMM-PCA performance observed in the real data. (A) Human Origins tree simulation. (B) Human Genome Diversity Panel (HGDP) tree simulation. (C) 1000 Genomes Project tree simulation.

Numerous distant relatives explain poor PCA performance in real data

In principle, PCA performance should be determined by the dimension of relatedness, or kinship matrix rank, since PCA is a low-dimensional model whereas LMM can model high-dimensional relatedness without overfitting. We used the Tracy-Widom test (Patterson et al., 2006) with $p < 0.01$ to estimate kinship matrix rank as the number of significant PCs (Figure 6—figure supplement 1A). The true rank of our simulations is slightly underestimated (Table 2), but we confirm that the family simulation has the greatest rank, and real datasets have greater estimates than their respective subpopulation tree simulations, which confirms our hypothesis to some extent. However, estimated ranks do not separate real datasets from tree simulations, as required to predict the observed PCA performance. Moreover, the HGDP and 1000 Genomes rank estimates are 45 and 61, respectively, yet PCA performed poorly for all $r \leq 90$ numbers of PCs (Figure 4). The top eigenvalue explained a proportion of variance proportional to $F_{ST}$ (Table 2), but the rest of the top 10 eigenvalues show no clear differences between datasets, except the small simulation had larger variances explained per eigenvalue (expected since it has fewer eigenvalues; Figure 6—figure supplement 1). Comparing cumulative variance explained versus rank fraction across all eigenvalues, all datasets increase from their starting point almost linearly until they reach 1, except the family simulation has much greater variance explained by mid-rank eigenvalues (Figure 6—figure supplement 1). We also calculated the number of PCs that are significantly associated with the trait, and observed similar results, namely that while the family simulation has more significant PCs than the non-family admixture simulations, the real datasets and their tree simulated counterparts have similar numbers of significant PCs (Figure 6—figure supplement 2). Overall, there is no separation between real datasets (where PCA performed poorly) and subpopulation tree simulations (where PCA performed relatively well) in terms of their eigenvalues or kinship matrix rank estimates.

Local kinship, which is recent relatedness due to family structure excluding population structure, is the presumed cause of the LMM to PCA performance gap observed in real datasets but not their subpopulation tree simulation counterparts. Instead of inferring local kinship through increased kinship matrix rank, as attempted in the last paragraph, now we measure it directly using the KING-robust estimator (Manichaikul et al., 2010). We observe more large local kinship in the real datasets and the family simulation compared to the other simulations (Figure 6). However, for real data this distribution depends on the subpopulation structure, since locally related pairs are most likely in the same subpopulation. Therefore, the only comparable curve to each real dataset is their corresponding subpopulation tree simulation, which matches subpopulation structure. In all real datasets, we identified highly related individual pairs with kinship above the 4th degree relative threshold of 0.022 (Manichaikul et al., 2010; Conomos et al., 2016b). However, these highly related pairs are vastly outnumbered by more distant pairs with evident non-zero local kinship as compared to the extreme tree simulation values.

Figure 6 with 2 supplements see all

Download asset Open asset

Local kinship distributions.

Curves are complementary cumulative distribution of lower triangular kinship matrix (self kinship excluded) from KING-robust estimator. Note log x-axis; negative estimates are counted but not shown. Most values are below 4th degree relative threshold. Each real dataset has a greater cumulative than its subpopulation tree simulations.

To try to improve PCA performance, we followed the standard practice of removing 4th degree relatives, which reduced sample sizes between 5% and 10% (Table 4). Only $r = 0$ for LMM and $r = 20$ for PCA were tested, as these performed well in our earlier evaluation, and only FES traits were tested because they previously displayed the large PCA-LMM performance gap. LMM significantly outperforms PCA in all these cases (Wilcoxon paired 1-tailed $p < 0.01$ ; Figure 7). Notably, PCA still had miscalibrated p-values two of the three real datasets ( $| {SRMSD}_{p} | > 0.01$ ), the only marginally calibrated case being HGDP which is also the smallest of these datasets. Otherwise, ${AUC}_{PR}$ and ${SRMSD}_{p}$ ranges were similar here as in our earlier evaluation. Therefore, the removal of the small number of highly related individual pairs had a negligible effect in PCA performance, so the larger number of more distantly related pairs explain the poor PCA performance in the real datasets.

Figure 7 with 1 supplement see all

Download asset Open asset

Evaluation in real datasets excluding 4th degree relatives, FES traits, high heritability.

Each dataset is a column, rows are measures. Boxplot whiskers are extrema over 50 replicates. First row has $| {SRMSD}_{p} | < 0.01$ band marked as gray area.

Table 4

Dataset sizes after 4th degree relative filter.

Dataset	Loci ( $m$ )	Ind. ( $n$ )	Ind. removed (%)
Human Origins	189 722	2636	9.8
HGDP	758 009	847	8.8
1000 Genomes	1 097 415	2390	4.6

Low heritability and environment simulations

Our main evaluations were repeated with traits simulated under a lower heritability value of $h^{2} = 0.3$ . We reduced the number of causal loci in response to this change in heritability, to result in equal average effect size per locus compared to the previous high heritability evaluations (see Trait Simulation). Despite that, these low heritability evaluations measured lower ${AUC}_{PR}$ values than their high heritability counterparts (Figure 3—figure supplement 2, Figure 3—figure supplement 3, Figure 4—figure supplement 2, Figure 4—figure supplement 3, Figure 7—figure supplement 1). The gap between LMM and PCA was reduced in these evaluations, but the main conclusion of the high heritability evaluation holds for low heritability as well, namely that LMM with $r = 0$ significantly outperforms or ties LMM with $r > 0$ and PCA in all cases (Table 5).

Table 5

Overview of PCA and LMM evaluations for low heritability simulations.

			LMM $r = 0$ vs best $r$			PCA vs LMM $r = 0$
Dataset	Metric	Trait*	Cal.^†	Best $r$ ^‡	p-value ^§	Best $r$ ^‡	Cal.^†	p-value ^§	Best model ^¶
Admix. Large sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	62	True	0.00012*	LMM
Admix. Small sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	3	True	0.27	Tie
Admix. Family sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	90	False	3.9e-10*	LMM
Human Origins	$\| {SRMSD}_{p} \|$	FES	True	0	1	81	True	3.9e-10*	LMM
HGDP	$\| {SRMSD}_{p} \|$	FES	True	0	1	37	True	6.2e-09*	LMM
1000 Genomes	$\| {SRMSD}_{p} \|$	FES	True	0	1	84	True	3.9e-10*	LMM
Admix. Large sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	35	True	0.00094	Tie
Admix. Small sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	3	True	0.087	Tie
Admix. Family sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	90	False	4.1e-10*	LMM
Human Origins	$\| {SRMSD}_{p} \|$	RC	True	0	1	75	True	0.00016*	LMM
HGDP	$\| {SRMSD}_{p} \|$	RC	True	0	1	23	True	1.7e-05*	LMM
1000 Genomes	$\| {SRMSD}_{p} \|$	RC	True	0	1	41	True	6.7e-10*	LMM
Admix. Large sim.	${AUC}_{PR}$	FES		0	1	3		0.11	Tie
Admix. Small sim.	${AUC}_{PR}$	FES		0	1	0		0.58	Tie
Admix. Family sim.	${AUC}_{PR}$	FES		0	1	7		2.2e-06*	LMM
Human Origins	${AUC}_{PR}$	FES		0	1	16		8e-10*	LMM
HGDP	${AUC}_{PR}$	FES		11	0.68	6		0.0043	Tie
1000 Genomes	${AUC}_{PR}$	FES		6	0.34	4		2.3e-07*	LMM
Admix. Large sim.	${AUC}_{PR}$	RC		0	1	3		0.14	Tie
Admix. Small sim.	${AUC}_{PR}$	RC		0	1	0		0.1	Tie
Admix. Family sim.	${AUC}_{PR}$	RC		0	1	5		1.9e-06*	LMM
Human Origins	${AUC}_{PR}$	RC		4	0.16	12		0.003	Tie
HGDP	${AUC}_{PR}$	RC		2	0.14	5		0.14	Tie
1000 Genomes	${AUC}_{PR}$	RC		0	1	4		0.078	Tie

*

FES: Fixed Effect Sizes, RC: Random Coefficients.
†

Calibrated: whether mean $| {SRMSD}_{p} | < 0.01$ over 50 replicates.
‡

Value of $r$ (number of PCs) with minimum mean $| {SRMSD}_{p} |$ or maximum mean ${AUC}_{PR}$ .
§

Wilcoxon paired 1-tailed test of distributions ( $| {SRMSD}_{p} |$ or ${AUC}_{PR}$ ) between models in header. Asterisk marks significant value using Bonferroni threshold ( $p < α / n_{tests}$ with $α = 0.01$ and $n_{tests} = 48$ is the number of tests in this table).
¶

Tie if no significant difference using Bonferroni threshold.

Lastly, we simulated traits with both low heritability and large environment effects determined by geography and subpopulation labels, so they are strongly correlated to the low-dimensional population structure. For that reason, PCs may be expected to perform better in this setting (in either PCA or LMM). However, we find that both PCA and LMM (even without PCs) increase their ${AUC}_{PR}$ values compared to the low-heritability evaluations (Figure 8—figure supplement 1; Figure 8 also shows representative numbers of PCs, which performed optimally or nearly so in individual simulations shown in Figure 3—figure supplement 4, Figure 3—figure supplement 5, Figure 4—figure supplement 4, Figure 4—figure supplement 5). p-Value calibration is comparable with or without environment effects, for LMM for all $r$ and for PCA once $r$ is large enough (Figure 8—figure supplement 1). These simulations are the only where we occasionally observed for both metrics a significant, though small, advantage of LMM with PCs versus LMM without PCs (Table 6). Additionally, on RC traits only, PCA significantly outperforms LMM in the three real human datasets (Table 6), the only cases in all of our evaluations where this is observed. For comparison, we also evaluate an ‘oracle’ LMM without PCs but with the finest group labels, the same used to simulate environment, as fixed categorical covariates (‘LMM lab.’), and see much larger ${AUC}_{PR}$ values than either LMM with PCs or PCA (Figure 8, Figure 3—figure supplement 4, Figure 3—figure supplement 5, Figure 4—figure supplement 4, Figure 4—figure supplement 5, Table 6). However, LMM with labels is often more poorly calibrated than LMM or PCA without labels, which may be since these numerous labels are inappropriately modeled as fixed rather than random effects. Overall, we find that association studies with correlated environment and genetic effects remain a challenge for PCA and LMM, that addition of PCs to an LMM improves performance only marginally, and that if the environment effect is driven by geography or ethnicity then use of those labels greatly improves performance compared to using PCs.

Figure 8 with 1 supplement see all

Download asset Open asset

Evaluation in real datasets excluding 4th degree relatives, FES traits, environment.

Traits simulated with environment effects, otherwise the same as Figure 7. ‘LMM lab.’ includes as fixed effects true groups from which environment was simulated.

Table 6

Overview of PCA and LMM evaluations for environment simulations.

			LMM $r = 0$ vs best $r$			PCA vs LMM $r = 0$				LMM lab. $r = 0$ vs PCA/LMM
Dataset	Metric	Trait*	Cal.^†	$r$ ^‡	p-value ^§	$r$ ^‡	Cal.^†	p-value ^§	Best ^¶	Cal.^†	p-value ^§	Best ^¶
Admix. Large sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	83	True	0.38	Tie	True	1.8e-14*	PCA/LMM
Admix. Small sim.	$\| {SRMSD}_{p} \|$	FES	True	0	1	90	True	0.001	Tie	False	1.4e-14*	PCA/LMM
Admix. Family sim.	$\| {SRMSD}_{p} \|$	FES	True	4	0.18	90	False	3.9e-10*	LMM	True	0.066	LMM/LMM lab.
Human Origins	$\| {SRMSD}_{p} \|$	FES	True	9	3.9e-05*	90	False	1.4e-08*	LMM	False	3.9e-10*	LMM
HGDP	$\| {SRMSD}_{p} \|$	FES	True	0	1	90	True	0.0037	Tie	False	2.1e-09*	PCA/LMM
1000 Genomes	$\| {SRMSD}_{p} \|$	FES	False	8	8.8e-08*	85	True	0.053	Tie	True	3.9e-10*	LMM lab.
Admix. Large sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	60	True	0.033	Tie	True	6.3e-10*	PCA/LMM
Admix. Small sim.	$\| {SRMSD}_{p} \|$	RC	True	0	1	9	True	0.85	Tie	False	1.4e-14*	PCA/LMM
Admix. Family sim.	$\| {SRMSD}_{p} \|$	RC	True	5	0.14	90	False	3.9e-10*	LMM	True	0.011	LMM/LMM lab.
Human Origins	$\| {SRMSD}_{p} \|$	RC	False	9	1.1e-08*	90	True	2.3e-07*	PCA	False	3.9e-10*	PCA
HGDP	$\| {SRMSD}_{p} \|$	RC	True	0	1	89	True	6.5e-09*	PCA	False	3.9e-10*	PCA
1000 Genomes	$\| {SRMSD}_{p} \|$	RC	False	8	1.6e-08*	88	True	4.9e-09*	PCA	True	0.09	PCA/LMM lab.
Admix. Large sim.	${AUC}_{PR}$	FES		4	2.4e-06*	6		0.0021	Tie		1.8e-15*	LMM lab.
Admix. Small sim.	${AUC}_{PR}$	FES		3	0.055	4		0.033	Tie		0.28	Tie
Admix. Family sim.	${AUC}_{PR}$	FES		12	7e-04	63		3.9e-10*	LMM		3.9e-10*	LMM lab.
Human Origins	${AUC}_{PR}$	FES		20	3.7e-06*	90		1.4e-05*	LMM		3.9e-10*	LMM lab.
HGDP	${AUC}_{PR}$	FES		12	4.3e-06*	45		0.0044	Tie		3.9e-10*	LMM lab.
1000 Genomes	${AUC}_{PR}$	FES		9	1.9e-08*	55		0.028	Tie		3.9e-10*	LMM lab.
Admix. Large sim.	${AUC}_{PR}$	RC		4	0.00085	5		0.0018	Tie		5e-10*	LMM lab.
Admix. Small sim.	${AUC}_{PR}$	RC		2	0.13	5		0.093	Tie		0.0028	Tie
Admix. Family sim.	${AUC}_{PR}$	RC		9	0.01	86		1.7e-09*	LMM		3.9e-10*	LMM lab.
Human Origins	${AUC}_{PR}$	RC		22	0.0039	90		1e-06*	PCA		3.9e-10*	LMM lab.
HGDP	${AUC}_{PR}$	RC		19	0.0057	64		2.8e-05*	PCA		3e-07*	LMM lab.
1000 Genomes	${AUC}_{PR}$	RC		9	8.7e-05*	87		1.2e-09*	PCA		4.4e-10*	LMM lab.

*

FES: Fixed Effect Sizes, RC: Random Coefficients.
†

Calibrated: whether mean $| {SRMSD}_{p} | < 0.01$ over 50 replicates.
‡

Value of $r$ (number of PCs) with minimum mean $| {SRMSD}_{p} |$ or maximum mean ${AUC}_{PR}$ .
§

Wilcoxon paired 1-tailed test of distributions ( $| {SRMSD}_{p} |$ or ${AUC}_{PR}$ ) between models in header. Asterisk marks significant value using Bonferroni threshold ( $p < α / n_{tests}$ with $α = 0.01$ and $n_{tests} = 72$ is the number of tests in this table).
¶

Tie if no significant difference using Bonferroni threshold; in last column, pairwise ties are specified and “Tie” is three-way tie.

Discussion

Our evaluations conclusively determined that LMM without PCs performs better than PCA (for any number of PCs) across all scenarios without environment effects, including all real and simulated genotypes and two trait simulation models. Although the addition of a few PCs to LMM does not greatly hurt its performance (except for small sample sizes), they generally did not improve it either (Table 3, Table 5), which agrees with previous observations (Liu et al., 2011; Janss et al., 2012) but contradicts others (Zhao et al., 2007; Price et al., 2010). Our findings make sense since PCs are the eigenvectors of the same kinship matrix that parameterized random effects, so including both is redundant.

The presence of environment effects that are correlated to relatedness presents the only scenario where occasionally PCA and LMM with PCs outperform LMM without PCs (Table 6). It is commonly believed that PCs model such environment effects well (Novembre et al., 2008; Zhang and Pan, 2015; Lin et al., 2021). However, we observe that LMM without PCs models environment effects nearly as well as with PCs (Figure 8), consistent with previous findings (Vilhjálmsson and Nordborg, 2013; Wang et al., 2022) and with environment inflating heritability estimates using LMM (Heckerman et al., 2016). Moreover, modeling the true environment groups as fixed categorical effects always substantially improved ${AUC}_{PR}$ compared to modeling them with PCs (Figure 8, Table 6). Modeling numerous environment groups as fixed effects does result in deflated p-values (Figure 8, Table 6), which we expect would be avoided by modeling them as random effects, a strategy we chose not to pursue here as it is both a circular evaluation (the true effects were drawn from that model) and out of scope. Overall, including PCs to model environment effects yields limited power gains if at all, even in an LMM, and is no replacement for more adequate modeling of environment whenever possible.

Previous studies found that PCA was better calibrated than LMM for unusually differentiated markers (Price et al., 2010; Wu et al., 2011; Yang et al., 2014), which as simulated were an artificial scenario not based on a population genetics model, and are otherwise believed to be unusual (Sul and Eskin, 2013; Price et al., 2013). Our evaluations on real human data, which contain such loci in relevant proportions if they exist, do not replicate that result. Family relatedness strongly favors LMM, an advantage that probably outweighs this potential PCA benefit in real data.

Relative to LMM, the behavior of PCA fell between two extremes. When PCA performed well, there was a small number of PCs with both calibrated p-values and ${AUC}_{PR}$ near that of LMM without PCs. Conversely, PCA performed poorly when no number of PCs had either calibrated p-values or acceptably large ${AUC}_{PR}$ . There were no cases where high numbers of PCs optimized an acceptable ${AUC}_{PR}$ , or cases with miscalibrated p-values but high ${AUC}_{PR}$ . PCA performed well in the admixture simulations (without families, both trait models), real human genotypes with RC traits, and the subpopulation tree simulations (both trait models). Conversely, PCA performed poorly in the admixed family simulation (both trait models) and the real human genotypes with FES traits.

PCA assumes that genetic relatedness is restricted to a low-dimensional subspace, whereas LMM can handle high-dimensional relatedness. Thus, PCA performs well in the admixture simulation, which is explicitly low-dimensional (see Genotype simulation from the admixture model), and our subpopulation tree simulations, which are likely well approximated by a few dimensions despite the large number of subpopulations because there are few long branches. Conversely, PCA performs poorly under family structure because its kinship matrix is high-dimensional (Figure 6—figure supplement 1). However, estimating the latent space dimensions of real datasets is challenging because estimated eigenvalues have biased distributions (Hayashi et al., 2018). Kinship matrix rank estimated using the Tracy-Widom test (Patterson et al., 2006) did not fully predict the datasets that PCA performs well on. In contrast, estimated local kinship finds considerable cryptic family relatedness in all real human datasets and better explains why PCA performs poorly there. The trait model also influences the relative performance of PCA, so genotype-only parameters (eigenvalues or local kinship) alone cannot tell the full story. There are related tests for numbers of dimensions that consider the trait which we did not consider, including the Bayesian information criterion for the regression with PCs against the trait (Zhu and Yu, 2009). Additionally, PCA and LMM goodness of fit could be compared using the coefficient of determination generalized for LMMs (Sun et al., 2010).

PCA is at best underpowered relative to LMMs, and at worst miscalibrated regardless of the numbers of PCs included, in real human genotype tests. Among our simulations, such poor performance occurred only in the admixed family. Local kinship estimates reveal considerable family relatedness in the real datasets absent in the corresponding subpopulation tree simulations. Admixture is also absent in our tree simulations, but our simulations and theory show that admixture is well handled by PCA. Hundreds of close relative pairs have been identified in 1000 Genomes (Gazal et al., 2015; Al Khudhair et al., 2015; Fedorova et al., 2016; Schlauch et al., 2017), but their removal does not improve PCA performance sufficiently in our tests, so the larger number of more distantly related pairs are PCA’s most serious obstacle in practice. Distant relatives are expected to be numerous in any large human dataset (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). Our FES trait tests show that family relatedness is more challenging when rarer variants have larger coefficients. Overall, the high relatedness dimensions induced by family relatedness is the key challenge for PCA association in modern datasets that is readily overcome by LMM.

Our tests also found PCA robust to large numbers of PCs, far beyond the optimal choice, agreeing with previous anecdotal observations (Price et al., 2006; Kang et al., 2010), in contrast to using too few PCs for which there is a large performance penalty. The exception was the small sample size simulation, where only small numbers of PCs performed well. In contrast, LMM is simpler since there is no need to choose the number of PCs. However, an LMM with a large number of covariates may have conservative p-values, as observed for LMM with large numbers of PCs, which is a weakness of the score test used by the LMM we evaluated that may be overcome with other statistical tests. Simulations or post hoc evaluations remain crucial for ensuring that statistics are calibrated.

There are several variants of the PCA and LMM analyses, most designed for better modeling linkage disequilibrium (LD), that we did not evaluate directly, in which PCs are no longer exactly the top eigenvectors of the kinship matrix (if estimated with different approaches), although this is not a crucial aspect of our arguments. We do not consider the case where samples are projected onto PCs estimated from an external sample (Privé et al., 2020), which is uncommon in association studies, and whose primary effect is shrinkage, so if all samples are projected then they are all equally affected and larger regression coefficients compensate for the shrinkage, although this will no longer be the case if only a portion of the sample is projected onto the PCs of the rest of the sample. Another approach tests PCs for association against every locus in the genome in order to identify and exclude PCs that capture LD structure (which is localized) instead of ancestry (which should be present across the genome; Privé et al., 2020); a previous proposal removes LD using an autocorrelation model prior to estimating PCs (Patterson et al., 2006). These improved PCs remain inadequate models of family relatedness, so an LMM will continue to outperform them in that setting. Similarly, the leave-one-chromosome-out (LOCO) approach for estimating kinship matrices for LMMs prevents the test locus and loci in LD with it from being modeled by the random effect as well, which is called ‘proximal contamination’ (Lippert et al., 2011; Yang et al., 2014). While LOCO kinship estimates vary for each chromosome, they continue to model family relatedness, thus maintaining their key advantage over PCA. The LDAK model estimates kinship instead by weighing loci taking LD into account (Speed et al., 2012). LD effects must be adjusted for, if present, so in unfiltered data we advise the previous methods be applied. However, in this work, simulated genotypes do not have LD, and the real datasets were filtered to remove LD, so here there is no proximal contamination and LD confounding is minimized if present at all, so these evaluations may be considered the ideal situation where LD effects have been adjusted successfully, and in this setting LMM outperforms PCA. Overall, these alternative PCs or kinship matrices differ from their basic counterparts by either the extent to which LD influences the estimates (which may be a confounder in a small portion of the genome, by definition) or by sampling noise, neither of which are expected to change our key conclusion.

One of the limitations of this work include relatively small sample sizes compared to modern association studies. However, our conclusions are not expected to change with larger sample sizes, as cryptic family relatedness will continue to be abundant in such data, if not increase in abundance, and thus give LMMs an advantage over PCA (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). One reason PCA has been favored over classic LMMs is because PCA’s runtime scales much better with increasing sample size. However, recent approaches not tested in this work have made LMMs more scalable and applicable to biobank-scale data (Loh et al., 2015; Zhou et al., 2018; Mbatchou et al., 2021), so one clear next step is carefully evaluating these approaches in simulations with larger sample sizes. A different benefit for including PCs were recently reported for BOLT-LMM, which does not result in greater power but rather in reduced runtime, a property that may be specific to its use of scalable algorithms such as conjugate gradient and variational Bayes (Loh et al., 2018). Many of these newer LMMs also no longer follow the infinitesimal model of the basic LMM (Loh et al., 2015; Mbatchou et al., 2021), and employ novel approximations, which are features not evaluated in this work and worthy of future study.

Another limitation of this work is ignoring rare variants, a necessity given our smaller sample sizes, where rare variant association is miscalibrated and underpowered. Using simulations mimicking the UK Biobank, recent work has found that rare variants can have a more pronounced structure than common variants, and that modeling this rare variant structure (with either PCA and LMM) may better model environment confounding, reduce inflation in association studies, and ameliorate stratification in polygenic risk scores (Zaidi and Mathieson, 2020). Better modeling rare variants and their structure is a key next step in association studies.

The largest limitation of our work is that we only considered quantitative traits. Previous evaluations involving case-control traits tended to report PCA-LMM ties or mixed results, an observation potentially confounded by the use of low-dimensional simulations without family relatedness (Table 1). An additional concern is case-control ascertainment bias and imbalance, which appears to affect LMMs more severely, although recent work appears to solve this problem (Yang et al., 2014; Zhou et al., 2018). Future evaluations should aim to include our simulations and real datasets, to ensure that previous results were not biased in favor of PCA by not simulating family structure or larger coefficients for rare variants that are expected for diseases by various selection models.

Overall, our results lead us to recommend LMM over PCA for association studies in general. Although PCA offer flexibility and speed compared to LMM, additional work is required to ensure that PCA is adequate, including removal of close relatives (lowering sample size and wasting resources) followed by simulations or other evaluations of statistics, and even then PCA may perform poorly in terms of both type I error control and power. The large numbers of distant relatives expected of any real dataset all but ensures that PCA will perform poorly compared to LMM (Henn et al., 2012; Shchur and Nielsen, 2018; Loh et al., 2018). Our findings also suggest that related applications such as polygenic models may enjoy gains in power and accuracy by employing an LMM instead of PCA to model relatedness (Rakitsch et al., 2013; Qian et al., 2020). PCA remains indispensable across population genetics, from visualizing population structure and performing quality control to its deep connection to admixture models, but the time has come to limit its use in association testing in favor of LMM or other, richer models capable of modeling all forms of relatedness.

Materials and methods

The complex trait model and PCA and LMM approximations

Request a detailed protocol

Let $x_{i j} \in {0, 1, 2}$ be the genotype at the biallelic locus $i$ for individual $j$ , which counts the number of reference alleles. Suppose there are $n$ individuals and $m$ loci, $X = (x_{i j})$ is their $m \times n$ genotype matrix, and $y$ is the length- $n$ column vector of individual trait values. The additive linear model for a quantitative (continuous) trait is:

y = 1 α + X^{'} β + Z^{'} η + ϵ,

where 1 is a length- $n$ vector of ones, $α$ is the scalar intercept coefficient, $β$ is the length- $m$ vector of locus coefficients, $Z$ is a design matrix of environment effects and other covariates, $η$ is the vector of environment coefficients, $ϵ$ is a length- $n$ vector of residuals, and the superscript prime symbol ( $'$ ) denotes matrix transposition. The residuals follow $ϵ_{j} \sim Normal (0, σ_{ϵ}^{2})$ independently per individual $j$ , for some $σ_{ϵ}^{2}$ .

The full model of Equation 1, which has a coefficient for each of the $m$ loci, is underdetermined in current datasets where $m ≫ n$ . The PCA and LMM models, respectively, approximate the full model fit at a single locus $i$ :

PCA: y = 1 α + x_{i} β_{i} + U_{r} γ_{r} + Z^{'} η + ϵ,

L M M : y = 1 α + x_{i} β_{i} + s + Z^{'} η + ϵ, s \sim N o r m a l (0, 2 σ_{s}^{2} Φ^{T}),

where $x_{i}$ is the length- $n$ vector of genotypes at locus $i$ only, $β_{i}$ is the locus coefficient, $U_{r}$ is an $n \times r$ matrix of PCs, $γ_{r}$ is the length- $r$ vector of PC coefficients, $s$ is a length- $n$ vector of random effects, $Φ^{T} = (φ_{j k}^{T})$ is the $n \times n$ kinship matrix conditioned on the ancestral population $T$ , and $σ_{s}^{2}$ is a variance factor. Both models condition the regression of the focal locus $i$ on an approximation of the total polygenic effect $X^{'} β$ with the same covariance structure, which is parameterized by the kinship matrix. Under the kinship model, genotypes are random variables obeying

E [x_{i} | T] = 2 p_{i}^{T} 1, Cov (x_{i} | T) = 4 p_{i}^{T} (1 - p_{i}^{T}) Φ^{T},

where $p_{i}^{T}$ is the ancestral allele frequency of locus $i$ (Malécot, 1948; Wright, 1949; Jacquard, 1970; Astle and Balding, 2009). Assuming independent loci, the covariance of the polygenic effect is

Cov (X^{'} β) = 2 σ_{s}^{2} Φ^{T}, σ_{s}^{2} = \sum_{i = 1}^{m} 2 p_{i}^{T} (1 - p_{i}^{T}) β_{i}^{2},

which is readily modeled by the LMM random effect $s$ , where the difference in mean is absorbed by the intercept. Alternatively, consider the eigendecomposition of the kinship matrix $Φ^{T} = U Λ U^{'}$ where $U$ is the $n \times n$ eigenvector matrix and $Λ$ is the $n \times n$ diagonal matrix of eigenvalues. The random effect can be written as

s = U γ_{L M M}, γ_{L M M} \sim Normal (0, 2 σ_{s}^{2} Λ),

which follows from the affine transformation property of multivariate normal distributions. Therefore, the PCA term $U_{r} γ_{r}$ can be derived from the above equation under the additional assumption that the kinship matrix has approximate rank $r$ and the coefficients $γ_{r}$ are fit without constraints. In contrast, the LMM uses all eigenvectors, while effectively shrinking their coefficients $γ_{LMM}$ as all random effects models do, although these parameters are marginalized (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015). PCA has more parameters than LMM, so it may overfit more: ignoring the shared terms in Equation 2 and Equation 3, PCA fits $r$ parameters (length of $γ$ ), whereas LMMs fit only one ( $σ_{s}^{2}$ ).

In practice, the kinship matrix used for PCA and LMM is estimated with variations of a method-of-moments formula applied to standardized genotypes $X_{S}$ , which is derived from Equation 4:

X_{S} = (\frac{x_{i j} - 2 {\hat{p}}_{i}^{T}}{\sqrt{4 {\hat{p}}_{i}^{T} (1 - {\hat{p}}_{i}^{T})}}), {\hat{Φ}}^{T} = \frac{1}{m} X_{S}^{'} X_{S},

where the unknown $p_{i}^{T}$ is estimated by ${\hat{p}}_{i}^{T} = \frac{1}{2 n} \sum_{j = 1}^{n} x_{i j}$ (Price et al., 2006; Kang et al., 2008; Kang et al., 2010; Yang et al., 2011; Zhou and Stephens, 2012; Yang et al., 2014; Loh et al., 2015; Sul et al., 2018; Zhou et al., 2018). However, this kinship estimator has a complex bias that differs for every individual pair, which arises due to the use of this estimated ${\hat{p}}_{i}^{T}$ (Ochoa and Storey, 2021; Ochoa and Storey, 2019). Nevertheless, in PCA and LMM these biased estimates perform as well as unbiased ones (Hou et al., 2023b).

We selected fast and robust software implementing the basic PCA and LMM models. PCA association was performed with plink2 (Chang et al., 2015). The quantitative trait association model is a linear regression with covariates, evaluated using the t-test. PCs were calculated with plink2, which equal the top eigenvectors of Equation 5 after removing loci with minor allele frequency $MAF < 0.1$ .

LMM association was performed using GCTA (Yang et al., 2011; Yang et al., 2014). Its kinship estimator equals Equation 5. PCs were calculated using GCTA from its kinship estimate. Association significance is evaluated with a score test. In the small simulation only, GCTA with large numbers of PCs had convergence and singularity errors in some replicates, which were treated as missing data.

Simulations

Every simulation was replicated 50 times, drawing anew all genotypes (except for real datasets) and traits. Below we use the notation $f_{A}^{B}$ for the inbreeding coefficient of a subpopulation $A$ from another subpopulation $B$ ancestral to $A$ . In the special case of the total inbreeding of $A$ , $f_{A}^{T}$ , $T$ is an overall ancestral population, which is ancestral to every individual under consideration, such as the most recent common ancestor (MRCA) population.

Genotype simulation from the admixture model

Request a detailed protocol

The basic admixture model is as described previously (Ochoa and Storey, 2021) and is implemented in the R package bnpsd. Both Large and Family simulations have $n = 1, 000$ individuals, while Small has $n = 100$ . The number of loci is $m = 100, 000$ . Individuals are admixed from $K = 10$ intermediate subpopulations, or ancestries. Each subpopulation $S_{u}$ ( $u \in {1, \dots, K}$ ) is at coordinate $u$ and has an inbreeding coefficient $f_{S_{u}}^{T} = u τ$ for some $τ$ . Ancestry proportions $q_{j u}$ for individual $j$ and $S_{u}$ arise from a random walk with spread $σ$ on the 1D geography, and $τ$ and $σ$ are fit to give $F_{ST} = 0.1$ and mean kinship ${\bar{θ}}^{T} = 0.5 F_{ST}$ for the admixed individuals (Ochoa and Storey, 2021). Random ancestral allele frequencies $p_{i}^{T}$ , subpopulation allele frequencies $p_{i}^{S_{u}}$ , individual-specific allele frequencies $π_{i j}$ , and genotypes $x_{i j}$ are drawn from this hierarchical model:

\begin{aligned} \begin{aligned} p_{i}^{T} & \sim U n i f o r m (0.01, 0.5), \\ p_{i}^{S_{u}} | p_{i}^{T} & \sim B e t a (p_{i}^{T} (\frac{1}{f_{S_{u}}^{T}} - 1), (1 - p_{i}^{T}) (\frac{1}{f_{S_{u}}^{T}} - 1)), \\ π_{i j} & = \sum_{u = 1}^{K} q_{j u} p_{i}^{S_{u}}, \\ x_{i j} | π_{i j} & \sim B i n o m i a l (2, π_{i j}), \end{aligned} \end{aligned}

where this Beta is the Balding-Nichols distribution (Balding and Nichols, 1995) with mean $p_{i}^{T}$ and variance $p_{i}^{T} (1 - p_{i}^{T}) f_{S_{u}}^{T}$ . Fixed loci ( $i$ where $x_{i j} = 0$ for all $j$ , or $x_{i j} = 2$ for all $j$ ) are drawn again from the model, starting from $p_{i}^{T}$ , iterating until no loci are fixed. Each replicate draws a genotypes starting from $p_{i}^{T}$ .

As a brief aside, we prove that global ancestry proportions as covariates is equivalent in expectation to using PCs under the admixture model. Note that the latent space of $X$ , which is the subspace to which the data is constrained by the admixture model, is given by $(π_{i j})$ , which has $K$ dimensions (number of columns of $Q = (q_{j u})$ ), so the top $K$ PCs span this space. Since associations include an intercept term ( $1 α$ in Equation 2), estimated PCs are orthogonal to 1 (note ${\hat{Φ}}^{T} 1 = 0$ because $X_{S} 1 = 0$ ), and the sum of rows of $Q$ sums to one, then only $K - 1$ PCs plus the intercept are needed to span the latent space of this admixture model.

Genotype simulation from random admixed families

Request a detailed protocol

We simulated a pedigree with admixed founders, no close relative pairings, assortative mating based on a 1D geography (to preserve admixture structure), random family sizes, and arbitrary numbers of generations (20 here). This simulation is implemented in the R package simfam. Generations are drawn iteratively. Generation 1 has $n = 1000$ individuals from the above admixture simulation ordered by their 1D geography. Local kinship measures pedigree relatedness; in the first generation, everybody is locally unrelated and outbred. Individuals are randomly assigned sex. In the next generation, individuals are paired iteratively, removing random males from the pool of available males and pairing them with the nearest available female with local kinship $< 1 / 4^{3}$ (stay unpaired if there are no matches), until there are no more available males or females. Let $n = 1000$ be the desired population size, $n_{m} = 1$ the minimum number of children per family and n_f the number of families (paired parents) in the current generation, then the number of additional children (beyond the minimum) is drawn from $Poisson (n / n_{f} - n_{m})$ . Let $δ$ be the difference between desired and current population sizes. If $δ > 0$ , then $δ$ random families are incremented by 1. If $δ < 0$ , then $| δ |$ random families with at least $n_{m} + 1$ children are decremented by 1. If $| δ |$ exceeds the number of families, all families are incremented or decremented as needed and the process is iterated. Children are assigned sex randomly, and are reordered by the average coordinate of their parents. Children draw alleles from their parents independently per locus. A new random pedigree is drawn for each replicate, as well as new founder genotypes from the admixture model.

Genotype simulation from a subpopulation tree model

Request a detailed protocol

This model draws subpopulations allele frequencies from a hierarchical model parameterized by a tree, which is also implemented in bnpsd and relies on the R package ape for general tree data structures and methods (Paradis and Schliep, 2019). The ancestral population $T$ is the root, and each node is a subpopulation $S_{w}$ indexed arbitrarily. Each edge between $S_{w}$ and its parent population $P_{w}$ has an inbreeding coefficient $f_{S_{w}}^{P_{w}}$ . $P_{i}^{T}$ are drawn from a given distribution, which is constructed to mimic each real dataset in Appendix 1. Given the allele frequencies $p_{i}^{P_{w}}$ of the parent population, $S_{w}$ ’s allele frequencies are drawn from:

p_{i}^{S_{w}} | p_{i}^{P_{w}} \sim Beta (p_{i}^{P_{w}} (\frac{1}{f_{S_{w}}^{P_{w}}} - 1), (1 - p_{i}^{P_{w}}) (\frac{1}{f_{S_{w}}^{P_{w}}} - 1)) .

Individuals $j$ in $S_{w}$ draw genotypes from its allele frequency: $x_{i j} | p_{i}^{S_{w}} \sim B i n o m i a l (2, p_{i}^{S_{w}}) .$ Loci with $MAF < 0.01$ are drawn again starting from the $p_{i}^{T}$ distribution, iterating until no such loci remain.

Fitting subpopulation tree to real data

Request a detailed protocol

We developed new methods to fit trees to real data based on unbiased kinship estimates from popkin, implemented in bnpsd. A tree with given inbreeding coefficients $f_{S_{w}}^{P_{w}}$ for its edges (between subpopulation $S_{w}$ and its parent $P_{w}$ ) gives rise to a coancestry matrix $ϑ_{u v}^{T}$ for a subpopulation pair ( $S_{u}, S_{v}$ ), and the goal is to recover these edge inbreeding coefficients from coancestry estimates. Coancestry values are total inbreeding coefficients of the MRCA population of each subpopulation pair. Therefore, we calculate $f_{S_{w}}^{T}$ for every $S_{w}$ recursively from the root as follows. Nodes with parent $P_{w} = T$ are already as desired. Given $f_{P_{w}}^{T}$ , the desired $f_{S_{w}}^{T}$ is calculated via the ‘additive edge’ $δ_{w}$ (Ochoa and Storey, 2021):

f_{S_{w}}^{T} = f_{P_{w}}^{T} + δ_{w}, δ_{w} = f_{S_{w}}^{P_{w}} (1 - f_{P_{w}}^{T}) .

These $δ_{w} \geq 0$ because $0 \leq f_{S_{w}}^{P_{w}}, f_{P_{w}}^{T} \leq 1$ for every $w$ . Edge inbreeding coefficients can be recovered from additive edges: $f_{S_{w}}^{P_{w}} = δ_{w} / (1 - f_{P_{w}}^{T})$ . Overall, coancestry values are sums of $δ_{w}$ over common ancestor nodes,

ϑ_{u v}^{T} = \sum_{w} δ_{w} I_{w} (u, v),

where the sum includes all $w$ , and $I_{w} (u, v)$ equals 1 if $S_{w}$ is a common ancestor of $S_{u}, S_{v}$ , 0 otherwise. Note that $I_{w} (u, v)$ reflects tree topology and $δ_{w}$ edge values.

To estimate population-level coancestry, first kinship ( ${\hat{φ}}_{j k}^{T}$ ) is estimated using popkin (Ochoa and Storey, 2021). Individual coancestry ( ${\hat{θ}}_{j k}^{T}$ ) is estimated from kinship using

{\hat{θ}}_{j k}^{T} = {\begin{cases} {\hat{φ}}_{j k}^{T} & if k \neq j, \\ {\hat{f}}_{j}^{T} = 2 {\hat{φ}}_{j j}^{T} - 1 & if k = j . \end{cases}

Lastly, coancestry ${\hat{ϑ}}_{u v}^{T}$ between subpopulations are averages of individual coancestry values:

{\hat{ϑ}}_{u v}^{T} = \frac{1}{| S_{u} | | S_{v} |} \sum_{j \in S_{u}} \sum_{k \in S_{v}} {\hat{θ}}_{j k}^{T} .

Topology is estimated with hierarchical clustering using the weighted pair group method with arithmetic mean (Sokal and Michener, 1958), with distance function $d (S_{u}, S_{v}) = max {{\hat{ϑ}}_{u v}^{T}} - {\hat{ϑ}}_{u v}^{T},$ which succeeds due to the monotonic relationship between node depth and coancestry (Equation 7). This algorithm recovers the true topology from the true coancestry values, and performs well for estimates from genotypes.

To estimate tree edge lengths, first $δ_{w}$ are estimated from ${\hat{ϑ}}_{u v}^{T}$ and the topology using Equation 7 and non-negative least squares linear regression (Lawson and Hanson, 1974) (implemented in nnls; Mullen, 2012) to yield non-negative $δ_{w}$ , and $f_{S_{w}}^{P_{w}}$ are calculated from $δ_{w}$ by reversing Equation 5. To account for small biases in coancestry estimation, an intercept term $δ_{0}$ is included ( $I_{0} (u, v) = 1$ for all $u, v$ ), and when converting $δ_{w}$ to $f_{S_{w}}^{P_{w}}$ , $δ_{0}$ is treated as an additional edge to the root, but is ignored when drawing allele frequencies from the tree.

Trait simulation

Request a detailed protocol

Traits are simulated from the quantitative trait model of Equation 1, with novel bias corrections for simulating the desired heritability from real data relying on the unbiased kinship estimator popkin (Ochoa and Storey, 2021). This simulation is implemented in the R package simtrait. All simulations have a fixed narrow-sense heritability of $h^{2}$ , a variance proportion due to environment effects $σ_{η}^{2}$ , and residuals are drawn from $ϵ_{j} \sim Normal (0, σ_{ϵ}^{2})$ with $σ_{ϵ}^{2} = 1 - h^{2} - σ_{η}^{2}$ . The number of causal loci m₁, which determines the average coefficient size, is chosen with the heuristic formula $m_{1} = round (n h^{2} / 8)$ , which empirically balances power well with varying $n$ and $h^{2}$ . The set of causal loci $C$ is drawn anew for each replicate, from loci with $MAF \geq 0.01$ to avoid rare causal variants, which are not discoverable by PCA or LMM at the sample sizes we considered. Letting $v_{i}^{T} = p_{i}^{T} (1 - p_{i}^{T})$ , the effect size of locus $i$ equals $2 v_{i}^{T} β_{i}^{2}$ , its contribution of the trait variance (Park et al., 2010). Under the fixed effect sizes (FES) model, initial causal coefficients are

β_{i} = \frac{1}{\sqrt{2 v_{i}^{T}}}

for known $p_{i}^{T}$ ; otherwise $v_{i}^{T}$ is replaced by the unbiased estimator (Ochoa and Storey, 2021) ${\hat{v}}_{i}^{T} = {\hat{p}}_{i}^{T} (1 - {\hat{p}}_{i}^{T}) / (1 - {\bar{φ}}^{T}),$ where ${\bar{φ}}^{T}$ is the mean kinship estimated with popkin. Each causal locus is multiplied by –1 with probability 0.5. Alternatively, under the random coefficients (RC) model, initial causal coefficients are drawn independently from $β_{i} \sim Normal (0, 1)$ . For both models, the initial genetic variance is $σ_{0}^{2} = \sum_{i \in C} 2 v_{i}^{T} β_{i}^{2},$ replacing $v_{i}^{T}$ with ${\hat{v}}_{i}^{T}$ for unknown $p_{i}^{T}$ (so $σ_{0}^{2}$ is an unbiased estimate), so we multiply every initial $β_{i}$ by $\frac{h}{σ_{0}}$ to have the desired heritability. Lastly, for known $p_{i}^{T}$ , the intercept coefficient is $α = - \sum_{i \in C} 2 p_{i}^{T} β_{i} .$ When $p_{i}^{T}$ are unknown, ${\hat{p}}_{i}^{T}$ should not replace $p_{i}^{T}$ since that distorts the trait covariance (for the same reason the standard kinship estimator in Equation 5 is biased), which is avoided with

α = - \frac{2}{m_{1}} (\sum_{i \in C} {\hat{p}}_{i}^{T}) (\sum_{i \in C} β_{i}) .

Simulations optionally included multiple environment group effects, similarly to previous models (Zhang and Pan, 2015; Wang et al., 2022), as follows. Each independent environment $i$ has predefined groups, and each group $g$ has random coefficients drawn independent from $η_{g i} \sim Normal (0, σ_{η i}^{2})$ where $σ_{η i}^{2}$ is a specified variance proportion for environment $i$ . $Z$ has individuals along columns and environment-groups along rows, and it contains indicator variables: 1 if the individual belongs to the environment-group, 0 otherwise.

We performed trait simulations with the following variance parameters (Table 7): high heritability used $h^{2} = 0.8$ and no environment effects; low heritability used $h^{2} = 0.3$ and no environment effects; lastly, environment used $h^{2} = 0.3, σ_{η 1}^{2} = 0.3, σ_{η 2}^{2} = 0.2$ (total $σ_{η}^{2} = σ_{η 1}^{2} + σ_{η 2}^{2} = 0.5$ ). For real genotype datasets, the groups are the continental (environment 1) and fine-grained (environment 2) subpopulation labels given (see next subsection). For simulated genotypes, we created these labels by grouping by the index $j$ (geographical coordinate) of each simulated individual, assigning group $g = ceiling (j k_{i} / n)$ where k_i is the number of groups in environment $i$ , and we selected $k_{1} = 5$ and $k_{2} = 25$ to mimic the number of groups in each level of 1000 Genomes (Table 2).

Table 7

Variance parameters of trait simulations.

Trait variance type	$h^{2}$	$σ_{η}^{2}$	$σ_{ϵ}^{2}$
High heritability	0.8	0.0	0.2
Low heritability	0.3	0.0	0.7
Environment	0.3	0.5	0.2

Real human genotype datasets

Request a detailed protocol

The three datasets were processed as before (Ochoa and Storey, 2019; summarized below), except with an additional filter so loci are in approximate linkage equilibrium and rare variants are removed. All processing was performed with plink2 (Chang et al., 2015), and analysis was uniquely enabled by the R packages BEDMatrix (Grueneberg and de Los Campos, 2019) and genio. Each dataset groups individuals in a two-level hierarchy: continental and fine-grained subpopulations. Final dataset sizes are in Table 2.

We obtained the full (including non-public) Human Origins by contacting the authors and agreeing to their usage restrictions. The Pacific data (Skoglund et al., 2016) was obtained separately from the rest (Lazaridis et al., 2014; Lazaridis et al., 2016), and datasets were merged using the intersection of loci. We removed ancient individuals, and individuals from singleton and non-native subpopulations. Non-autosomal loci were removed. Our analysis of both the whole-genome sequencing (WGS) version of HGDP (Bergström et al., 2020) and the high-coverage NYGC version of 1000 Genomes (Fairley et al., 2020) was restricted to autosomal biallelic SNP loci with filter “PASS”.

Since our evaluations assume uncorrelated loci, we filtered each real dataset with plink2 using parameters “--indep-pairwise 1000kb 0.3”, which iteratively removes loci that have a greater than 0.3 squared correlation coefficient with another locus that is within 1000 kb, stopping until no such loci remain. Since all real datasets have numerous rare variants, while PCA and LMM are not able to detect associations involving rare variants, we removed all loci with $MAF < 0.01$ . Lastly, only HGDP had loci with over 10% missingness removed, as they were otherwise 17% of remaining loci (for Human Origins and 1000 Genomes they were under 1% of loci so they were not removed). Kinship matrix rank and eigenvalues were calculated from popkin kinship estimates. Eigenvalues were assigned p-values with twstats of the Eigensoft package (Patterson et al., 2006), and kinship matrix rank was estimated as the largest number of consecutive eigenvalue from the start that all satisfy $p < 0.01$ (p-values did not increase monotonically). For the evaluation with close relatives removed, each dataset was filtered with plink2 with option “--king-cutoff” with cutoff 0.02209709 ( $= 2^{- 11 / 2}$ ) for removing up to 4th degree relatives using KING-robust (Manichaikul et al., 2010), and $MAF < 0.01$ filter is reapplied (Table 4).

Evaluation of performance

Request a detailed protocol

All approaches are evaluated using two complementary metrics: ${SRMSD}_{p}$ quantifies p-value uniformity, and ${AUC}_{PR}$ measures causal locus classification performance and reflects power while ranking miscalibrated models fairly. These measures are more robust alternatives to previous measures from the literature (Appendix 2), and are implemented in simtrait.

P-values for continuous test statistics have a uniform distribution when the null hypothesis holds, a crucial assumption for type I error and FDR control (Storey, 2003; Storey and Tibshirani, 2003). We use the Signed Root Mean Square Deviation ( ${SRMSD}_{p}$ ) to measure the difference between the observed null p-value quantiles and the expected uniform quantiles:

{S R M S D}_{p} = s g n (u_{m e d i a n} - p_{m e d i a n}) \sqrt{\frac{1}{m_{0}} \sum_{i = 1}^{m_{0}} {(u_{i} - p_{(i)})}^{2}},

where $m_{0} = m - m_{1}$ is the number of null (non-causal) loci, here $i$ indexes null loci only, $p_{(i)}$ is the $i$ th ordered null p-value, $u_{i} = (i - 0.5) / m_{0}$ is its expectation, $p_{median}$ is the median observed null p-value, $u_{median} = \frac{1}{2}$ is its expectation, and sgn is the sign function (1 if $u_{median} \geq p_{median}$ , –1 otherwise). Thus, ${SRMSD}_{p} = 0$ corresponds to calibrated p-values, ${SRMSD}_{p} > 0$ indicate anti-conservative p-values, and ${SRMSD}_{p} < 0$ are conservative p-values. The maximum ${SRMSD}_{p}$ is achieved when all p-values are zero (the limit of anti-conservative p-values), which for infinite loci approaches

{S R M S D}_{p} \to \sqrt{\int_{0}^{1} u^{2} d u} = \frac{1}{\sqrt{3}} \approx 0.577.

The same value with a negative sign occurs for all p-values of 1.

Precision and recall are standard performance measures for binary classifiers that do not require calibrated p-values (Grau et al., 2015). Given the total numbers of true positives (TP), false positives (FP) and false negatives (FN) at some threshold or parameter $t$ , precision and recall are

\begin{aligned} \begin{aligned} P r e c i s i o n (t) & = \frac{T P (t)}{T P (t) + F P (t)}, \\ R e c a l l (t) & = \frac{TP (t)}{T P (t) + F N (t)} . \end{aligned} \end{aligned}

Precision and Recall trace a curve as $t$ is varied, and the area under this curve is ${AUC}_{PR}$ . We use the R package PRROC to integrate the correct non-linear piecewise function when interpolating between points. A model obtains the maximum ${AUC}_{PR} = 1$ if there is a $t$ that classifies all loci perfectly. In contrast, the worst models, which classify at random, have an expected precision ( $= {AUC}_{PR}$ ) equal to the overall proportion of causal loci: $m_{1} / m$ .

Data and code availability

Request a detailed protocol

The data and code generated during this study are available on GitHub at https://github.com/OchoaLab/pca-assoc-paper (copy archived at Ochoa, 2023). The public subset of Human Origins is available on the Reich Lab website at https://reich.hms.harvard.edu/datasets; non-public samples have to be requested from David Reich. The WGS version of HGDP was downloaded from the Wellcome Sanger Institute FTP site at ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/. The high-coverage version of the 1000 Genomes Project was downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/.

Web resources

Request a detailed protocol

Appendix 1

Fitting ancestral allele frequency distribution to real data

We calculated ${\hat{p}}_{i}^{T}$ distributions of each real dataset. However, population structure increases the variance of these sample ${\hat{p}}_{i}^{T}$ relative to the true $p_{i}^{T}$ (Ochoa and Storey, 2021). We present a new algorithm for constructing a new distribution based on the input data but with the lower variance of the true ancestral distribution. Suppose the $p_{i}^{T}$ distribution over loci $i$ satisfies $E [p_{i}^{T}] = \frac{1}{2}$ and $Var (p_{i}^{T}) = V^{T}$ . The sample allele frequency ${\hat{p}}_{i}^{T}$ , conditioned on $p_{i}^{T}$ , satisfies

E [{\hat{p}}_{i}^{T} | p_{i}^{T}] = p_{i}^{T}, Var ({\hat{p}}_{i}^{T} | p_{i}^{T}) = p_{i}^{T} (1 - p_{i}^{T}) {\bar{φ}}^{T},

where ${\bar{φ}}^{T} = \frac{1}{n^{2}} \sum_{j = 1}^{n} \sum_{k = 1}^{n} φ_{j k}^{T}$ is the mean kinship over all individual (Ochoa and Storey, 2021). The unconditional moments of ${\hat{p}}_{i}^{T}$ follow from the laws of total expectation and variance: $E [{\hat{p}}_{i}^{T}] = \frac{1}{2}$ and

W^{T} = Var ({\hat{p}}_{i}^{T}) = {\bar{φ}}^{T} \frac{1}{4} + (1 - {\bar{φ}}^{T}) V^{T} .

Since $V^{T} \leq \frac{1}{4}$ and ${\bar{φ}}^{T} \geq 0$ , then $W^{T} \geq V^{T}$ . Thus, the goal is to construct a new distribution with the original, lower variance of

V^{T} = \frac{W^{T} - \frac{1}{4} {\bar{φ}}^{T}}{1 - {\bar{φ}}^{T}} .

We use the unbiased estimator ${\hat{W}}^{T} = \frac{1}{m} \sum_{i = 1}^{m} {({\hat{p}}_{i}^{T} - \frac{1}{2})}^{2},$ while ${\bar{φ}}^{T}$ is calculated from the tree parameters: the subpopulation coancestry matrix (Equation 7), expanded from subpopulations to individuals, the diagonal converted to kinship (reversing Equation 5), and the matrix averaged. However, since our model ignores the MAF filters imposed in our simulations, ${\bar{φ}}^{T}$ was adjusted. For Human Origins the true model ${\bar{φ}}^{T}$ of 0.143 was used. For 1000 Genomes and HGDP the true ${\bar{φ}}^{T}$ are 0.126 and 0.124, respectively, but 0.4 for both produced a better fit.

Lastly, we construct new allele frequencies,

p^{*} = w {\hat{p}}_{i}^{T} + (1 - w) q,

by a weighted average of ${\hat{p}}_{i}^{T}$ and $q \in (0, 1)$ drawn independently from a different distribution. $E [q] = \frac{1}{2}$ is required to have $E [p^{*}] = \frac{1}{2}$ . The resulting variance is

Var (p^{*}) = w^{2} W^{T} + (1 - w)^{2} Var (q),

which we equate to the desired $V^{T}$ (Equation 9) and solve for $w$ . For simplicity, we also set $Var (q) = V^{T}$ , which is achieved with:

q \sim B e t a (\frac{1}{2} (\frac{1}{4 V^{T}} - 1), \frac{1}{2} (\frac{1}{4 V^{T}} - 1)) .

Although $w = 0$ yields $Var (p^{*}) = V^{T}$ , we use the second root of the quadratic equation to use ${\hat{p}}_{i}^{T}$ :

w = \frac{2 V^{T}}{W^{T} + V^{T}} .

Appendix 2

Comparisons between ${SRMSD}_{p}$ , ${AUC}_{PR}$ , and evaluation measures from the literature

2.1 The inflation factor $λ$

Test statistic inflation has been used to measure model calibration (Astle and Balding, 2009; Price et al., 2010). The inflation factor $λ$ is defined as the median $χ^{2}$ association statistic divided by theoretical median under the null hypothesis (Devlin and Roeder, 1999). To compare p-values from non- $χ^{2}$ tests (such as t-statistics), $λ$ can be calculated from p-values using

λ = \frac{F^{- 1} (1 - p_{m e d i a n})}{F^{- 1} (1 - u_{m e d i a n})},

where $p_{median}$ is the median observed p-value (including causal loci), $u_{median} = \frac{1}{2}$ is its null expectation, and $F$ is the $χ^{2}$ cumulative density function ( $F^{- 1}$ is the quantile function).

To compare $λ$ and ${SRMSD}_{p}$ directly, for simplicity assume that all p-values are null. In this case, calibrated p-values give $λ = 1$ and ${SRMSD}_{p} = 0$ . However, non-uniform p-values with the expected median, such as from genomic control (Devlin and Roeder, 1999), result in $λ = 1$ , but ${SRMSD}_{p} \neq 0$ except for uniform p-values, a key flaw of $λ$ that ${SRMSD}_{p}$ overcomes. Inflated statistics (anti-conservative p-values) give $λ > 1$ and ${SRMSD}_{p} > 0$ . Deflated statistics (conservative p-values) give $λ < 1$ and ${SRMSD}_{p} < 0$ . Thus, $λ \neq 1$ always implies ${SRMSD}_{p} \neq 0$ (where $λ - 1$ and ${SRMSD}_{p}$ have the same sign), but not the other way around. Overall, $λ$ depends only on the median p-value, while ${SRMSD}_{p}$ uses the complete distribution. However, ${SRMSD}_{p}$ requires knowing which loci are null, so unlike $λ$ it is only applicable to simulated traits.

2.2 Empirical comparison of ${SRMSD}_{p}$ and $λ$

There is a near one-to-one correspondence between $λ$ and ${SRMSD}_{p}$ in our data (Figure 2—figure supplement 1). PCA tended to be inflated ( $λ > 1$ and ${SRMSD}_{p} > 0$ ) whereas LMM tended to be deflated ( $λ < 1$ and ${SRMSD}_{p} < 0$ ), otherwise the data for both models fall on the same contiguous curve. We fit a sigmoidal function to this data,

{S R M S D}_{p} (λ) = a \frac{λ^{b} - 1}{λ^{b} + 1},

which for $a, b > 0$ satisfies ${S R M S D}_{p} (λ = 1) = 0$ and reflects $\log (λ)$ about zero ( $λ = 1$ ):

{S R M S D}_{p} (\log (λ) = - x) = - {S R M S D}_{p} (\log (λ) = x) .

We fit this model to $λ > 1$ only since it was less noisy and of greater interest, and obtained the curve shown in Figure 2—figure supplement 1 with $a = 0.564$ and $b = 0.619$ . The value $λ = 1.05$ , a common threshold for benign inflation (Price et al., 2010), corresponds to ${SRMSD}_{p} = 0.0085$ according to Equation 10. Conversely, ${SRMSD}_{p} = 0.01$ , serving as a simpler rule of thumb, corresponds to $λ = 1.06$ .

2.3 Type I error rate

The type I error rate is the proportion of null p-values with $p \leq t$ . Calibrated p-values have type I error rate near $t$ , which may be evaluated with a binomial test. This measure may give different results for different $t$ , for example be significantly miscalibrated only for large $t$ (due to lack of power for smaller $t$ ), and it requires large simulations to estimate well as it depends on the tail of the distribution. In contrast, ${S R M S D}_{p}$ uses the entire distribution so it is easier to estimate, ${S R M S D}_{p} = 0$ guarantees calibrated type I error rates at all $t$ , while large $| {S R M S D}_{p} |$ indicates incorrect type I errors for a range of $t$ . Empirically, we find the expected agreement and monotonic relationship between ${S R M S D}_{p}$ and type I error rate (Figure 2—figure supplement 2).

2.4 Statistical power and comparison to ${AUC}_{PR}$

Power is the probability that a test is declared significant when the alternative hypothesis H₁ holds. At a p-value threshold $t$ , power equals

F (t) = Pr (p < t | H_{1}) .

$F (t)$ is a cumulative function, so it is monotonically increasing and has an inverse. Like type I error control, power may rank models differently depending on $t$ , and it is also harder to estimate than ${AUC}_{PR}$ because power depends on the tail of the distribution.

Power is not meaningful when p-values are not calibrated. To establish a clear connection to ${AUC}_{PR}$ , assume calibrated (uniform) null p-values: $Pr (p < t | H_{0}) = t$ . TPs, FPs, and FNs at $t$ are

\begin{aligned} T P (t) & = m π_{1} F (t), \\ F P (t) & = m π_{0} t, \\ F N (t) & = m π_{1} (1 - F (t)), \end{aligned}

where $π_{0} = \Pr (H_{0})$ is the proportion of null cases and $π_{1} = 1 - π_{0}$ of alternative cases. Therefore,

\begin{aligned} P r e c i s i o n (t) & = \frac{π_{1} F (t)}{π_{1} F (t) + π_{0} t}, \\ R e c a l l (t) & = F (t) . \end{aligned}

Noting that $t = F^{- 1} (R e c a l l)$ , precision can be written as a function of recall, the power function, and constants:

P r e c i s i o n (R e c a l l) = \frac{π_{1} R e c a l l}{π_{1} R e c a l l + π_{0} F^{- 1} (R e c a l l)} .

This last form leads most clearly to ${A U C}_{P R} = \int_{0}^{1} P r e c i s i o n (R e c a l l) d R e c a l l$ .

Lastly, consider a simple yet common case in which model $A$ is uniformly more powerful than model $B : F_{A} (t) > F_{B} (t)$ for every $t$ . Therefore $F_{A}^{- 1} (R e c a l l) < F_{B}^{- 1} (R e c a l l)$ for every recall value. This ensures that the precision of $A$ is greater than that of $B$ at every recall value, so ${AUC}_{PR}$ is greater for $A$ than $B$ . Thus, ${AUC}_{PR}$ ranks calibrated models according to power.

Empirically, we find the predicted positive correlation between ${AUC}_{PR}$ and calibrated power (Figure 2—figure supplement 3). The correlation is clear when considered separately per dataset, but the slope varies per dataset, which is expected because the proportion of alternative cases $π_{1}$ varies per dataset.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Code is available at https://github.com/OchoaLab/pca-assoc-paper (copy archived at Ochoa, 2023).

The following previously published data sets were used

1. Fairley S
(2020) International Genome Sample Resource
ID NYGC_GATK/. 1000 Genomes Project, high-coverage version.

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/
1. Bergstrom A
(2020) Wellcome Sanger Institute
ID wgs.20190516/. Human Genome Diversity Panel, whole-genome sequencing version.

ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/
1. Lazaridis I
(2016) David Reich Lab
ID datasets. Human Origins.

https://reich.hms.harvard.edu/datasets

References

(2010) A map of human genome variation from population-scale sequencing
Nature 467:1061–1073.

https://doi.org/10.1038/nature09534
- PubMed
- Google Scholar
(2012) An integrated map of genetic variation from 1,092 human Genomes
Nature 491:56–65.

https://doi.org/10.1038/nature11632
- PubMed
- Google Scholar
1. Abraham G
2. Inouye M
(2014) Fast principal component analysis of large-scale genome-wide data
PLOS ONE 9:e93766.

https://doi.org/10.1371/journal.pone.0093766
- PubMed
- Google Scholar
1. Abraham G
2. Qiu Y
3. Inouye M
(2017) Flashpca2: Principal component analysis of Biobank-scale genotype Datasets
Bioinformatics 33:2776–2778.

https://doi.org/10.1093/bioinformatics/btx299
- PubMed
- Google Scholar
1. Agrawal A
2. Chiu AM
3. Le M
4. Halperin E
5. Sankararaman S
(2020) Scalable probabilistic PCA for large-scale genetic variation data
PLOS Genetics 16:e1008773.

https://doi.org/10.1371/journal.pgen.1008773
- PubMed
- Google Scholar
(2009) Fast model-based estimation of ancestry in unrelated individuals
Genome Research 19:1655–1664.

https://doi.org/10.1101/gr.094052.109
- PubMed
- Google Scholar
1. Al Khudhair A
2. Qiu S
3. Wyse M
4. Chowdhury S
5. Cheng X
6. Bekbolsynov D
7. Saha-Mandal A
8. Dutta R
9. Fedorova L
10. Fedorov A
(2015) Inference of distant genetic relations in humans using "1000 Genomes"
Genome Biology and Evolution 7:481–492.

https://doi.org/10.1093/gbe/evv003
- PubMed
- Google Scholar
1. Astle W
2. Balding DJ
(2009) Population structure and cryptic relatedness in genetic Association studies
Statistical Science 24:451–471.

https://doi.org/10.1214/09-STS307
- Google Scholar
(2007) Genomewide rapid Association using mixed model and regression: A fast and simple method for Genomewide pedigree-based quantitative trait Loci Association analysis
Genetics 177:577–585.

https://doi.org/10.1534/genetics.107.075614
- PubMed
- Google Scholar
1. Balding DJ
2. Nichols RA
(1995) A method for Quantifying differentiation between populations at multi-Allelic Loci and its implications for investigating identity and Paternity
Genetica 96:3–12.

https://doi.org/10.1007/BF01441146
- PubMed
- Google Scholar
1. Bergström A
2. McCarthy SA
3. Hui R
4. Almarri MA
5. Ayub Q
6. Danecek P
7. Chen Y
8. Felkel S
9. Hallast P
10. Kamm J
11. Blanché H
12. Deleuze JF
13. Cann H
14. Mallick S
15. Reich D
16. Sandhu MS
17. Skoglund P
18. Scally A
19. Xue Y
20. Durbin R
21. Tyler-Smith C
(2020) Insights into human genetic variation and population history from 929 diverse Genomes
Science 367:eaay5012.

https://doi.org/10.1126/science.aay5012
- PubMed
- Google Scholar
(2011) Accounting for population stratification in practice: A comparison of the main strategies dedicated to genome-wide Association studies
PLOS ONE 6:e28845.

https://doi.org/10.1371/journal.pone.0028845
- PubMed
- Google Scholar
1. Cabreros I
2. Storey JD
(2019) A likelihood-free estimator of population structure bridging admixture models and principal components analysis
Genetics 212:1009–1029.

https://doi.org/10.1534/genetics.119.302159
- PubMed
- Google Scholar
1. Cann HM
2. de Toma C
3. Cazes L
4. Legrand MF
5. Morel V
6. Piouffre L
7. Bodmer J
8. Bodmer WF
9. Bonne-Tamir B
10. Cambon-Thomsen A
11. Chen Z
12. Chu J
13. Carcassi C
14. Contu L
15. Du R
16. Excoffier L
17. Ferrara GB
18. Friedlaender JS
19. Groot H
20. Gurwitz D
21. Jenkins T
22. Herrera RJ
23. Huang X
24. Kidd J
25. Kidd KK
26. Langaney A
27. Lin AA
28. Mehdi SQ
29. Parham P
30. Piazza A
31. Pistillo MP
32. Qian Y
33. Shu Q
34. Xu J
35. Zhu S
36. Weber JL
37. Greely HT
38. Feldman MW
39. Thomas G
40. Dausset J
41. Cavalli-Sforza LL
(2002) A human genome diversity cell line panel
Science 296:261–262.

https://doi.org/10.1126/science.296.5566.261b
- PubMed
- Google Scholar
1. Chang CC
2. Chow CC
3. Tellier LC
4. Vattikuti S
5. Purcell SM
6. Lee JJ
(2015) Second-generation PLINK: Rising to the challenge of larger and richer Datasets
GigaScience 4:7.

https://doi.org/10.1186/s13742-015-0047-8
- PubMed
- Google Scholar
(2022) Inferring population structure in biobank-scale genomic data
American Journal of Human Genetics 109:727–737.

https://doi.org/10.1016/j.ajhg.2022.02.015
- PubMed
- Google Scholar
1. Conomos MP
2. Laurie CA
3. Stilp AM
4. Gogarten SM
5. McHugh CP
6. Nelson SC
7. Sofer T
8. Fernández-Rhodes L
9. Justice AE
10. Graff M
11. Young KL
12. Seyerle AA
13. Avery CL
14. Taylor KD
15. Rotter JI
16. Talavera GA
17. Daviglus ML
18. Wassertheil-Smoller S
19. Schneiderman N
20. Heiss G
21. Kaplan RC
22. Franceschini N
23. Reiner AP
24. Shaffer JR
25. Barr RG
26. Kerr KF
27. Browning SR
28. Browning BL
29. Weir BS
30. Avilés-Santa ML
31. Papanicolaou GJ
32. Lumley T
33. Szpiro AA
34. North KE
35. Rice K
36. Thornton TA
37. Laurie CC
(2016a) Genetic diversity and Association studies in US Hispanic/Latino populations: Applications in the Hispanic community health study/study of Latinos
The American Journal of Human Genetics 98:165–184.

https://doi.org/10.1016/j.ajhg.2015.12.001
- PubMed
- Google Scholar
(2016b) Model-free estimation of recent genetic relatedness
The American Journal of Human Genetics 98:127–148.

https://doi.org/10.1016/j.ajhg.2015.11.022
- PubMed
- Google Scholar
1. Coram MA
2. Duan Q
3. Hoffmann TJ
4. Thornton T
5. Knowles JW
6. Johnson NA
7. Ochs-Balcom HM
8. Donlon TA
9. Martin LW
10. Eaton CB
11. Robinson JG
12. Risch NJ
13. Zhu X
14. Kooperberg C
15. Li Y
16. Reiner AP
17. Tang H
(2013) Genome-wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations
American Journal of Human Genetics 92:904–916.

https://doi.org/10.1016/j.ajhg.2013.04.025
- PubMed
- Google Scholar
1. Devlin B
2. Roeder K
(1999) Genomic control for Association studies
Biometrics 55:997–1004.

https://doi.org/10.1111/j.0006-341x.1999.00997.x
- PubMed
- Google Scholar
(2020) The International genome sample resource (IGSR) collection of open human Genomic variation resources
Nucleic Acids Research 48:D941–D947.

https://doi.org/10.1093/nar/gkz836
- PubMed
- Google Scholar
1. Fedorova L
2. Qiu S
3. Dutta R
4. Fedorov A
(2016) Atlas of cryptic genetic relatedness among 1000 human Genomes
Genome Biology and Evolution 8:777–790.

https://doi.org/10.1093/gbe/evw034
- PubMed
- Google Scholar
1. Galinsky KJ
2. Bhatia G
3. Loh PR
4. Georgiev S
5. Mukherjee S
6. Patterson NJ
7. Price AL
(2016) Fast principal-component analysis reveals CONVERGENT evolution of Adh1B in Europe and East Asia
American Journal of Human Genetics 98:456–472.

https://doi.org/10.1016/j.ajhg.2015.12.022
- PubMed
- Google Scholar
(2015) High level of inbreeding in final phase of 1000 Genomes project
Scientific Reports 5:17453.

https://doi.org/10.1038/srep17453
- PubMed
- Google Scholar
1. Gopalan P
2. Hao W
3. Blei DM
4. Storey JD
(2016) Scaling probabilistic models of genetic variation to millions of humans
Nature Genetics 48:1587–1590.

https://doi.org/10.1038/ng.3710
- PubMed
- Google Scholar
(2015) PRROC: Computing and Visualizing precision-recall and receiver operating characteristic curves in R
Bioinformatics 31:2595–2597.

https://doi.org/10.1093/bioinformatics/btv153
- PubMed
- Google Scholar
1. Grueneberg A
2. de Los Campos G
(2019) Bgdata - A suite of R packages for Genomic analysis with big data
G3: Genes, Genomes, Genetics 9:1377–1383.

https://doi.org/10.1534/g3.119.400018
- PubMed
- Google Scholar
Book
(2018) On the bias in Eigenvalues of sample covariance matrix
In: Wiberg M, Culpepper S, Janssen R, González J, Molenaar D, editors. Quantitative Psychology Springer Proceedings in Mathematics & Statistics. Cham: Springer International Publishing. pp. 221–233.

https://doi.org/10.1007/978-3-319-77249-3_19
- Google Scholar
1. Heckerman D
2. Gurdasani D
3. Kadie C
4. Pomilla C
5. Carstensen T
6. Martin H
7. Ekoru K
8. Nsubuga RN
9. Ssenyomo G
10. Kamali A
11. Kaleebu P
12. Widmer C
13. Sandhu MS
(2016) Linear mixed model for Heritability estimation that explicitly addresses environmental variation
PNAS 113:7377–7382.

https://doi.org/10.1073/pnas.1510497113
- PubMed
- Google Scholar
1. Henn BM
2. Hon L
3. Macpherson JM
4. Eriksson N
5. Saxonov S
6. Pe’er I
7. Mountain JL
(2012) Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples
PLOS ONE 7:e34267.

https://doi.org/10.1371/journal.pone.0034267
- PubMed
- Google Scholar
1. Hindorff LA
2. Bonham VL
3. Brody LC
4. Ginoza MEC
5. Hutter CM
6. Manolio TA
7. Green ED
(2018) Prioritizing diversity in human Genomics research
Nature Reviews Genetics 19:175–185.

https://doi.org/10.1038/nrg.2017.89
- PubMed
- Google Scholar
1. Hodonsky CJ
2. Jain D
3. Schick UM
4. Morrison JV
5. Brown L
6. McHugh CP
7. Schurmann C
8. Chen DD
9. Liu YM
10. Auer PL
11. Laurie CA
12. Taylor KD
13. Browning BL
14. Li Y
15. Papanicolaou G
16. Rotter JI
17. Kurita R
18. Nakamura Y
19. Browning SR
20. Loos RJF
21. North KE
22. Laurie CC
23. Thornton TA
24. Pankratz N
25. Bauer DE
26. Sofer T
27. Reiner AP
28. Williams SM
(2017) Genome-wide Association study of red blood cell traits in Hispanics/Latinos: The Hispanic community health study/study of Latinos
PLOS Genetics 13:e1006760.

https://doi.org/10.1371/journal.pgen.1006760
- PubMed
- Google Scholar
1. Hoffman GE
2. Dubé M-P
(2013) Correcting for population structure and kinship using the linear mixed model: theory and extensions
PLOS ONE 8:e75707.

https://doi.org/10.1371/journal.pone.0075707
- Google Scholar
1. Hoffmann TJ
2. Choquet H
3. Yin J
4. Banda Y
5. Kvale MN
6. Glymour M
7. Schaefer C
8. Risch N
9. Jorgenson E
(2018) A large Multiethnic genome-wide Association study of adult body mass index identifies novel Loci
Genetics 210:499–515.

https://doi.org/10.1534/genetics.118.301479
- PubMed
- Google Scholar
1. Hou K
2. Ding Y
3. Xu Z
4. Wu Y
5. Bhattacharya A
6. Mester R
7. Belbin GM
8. Buyske S
9. Conti DV
10. Darst BF
11. Fornage M
12. Gignoux C
13. Guo X
14. Haiman C
15. Kenny EE
16. Kim M
17. Kooperberg C
18. Lange L
19. Manichaikul A
20. North KE
21. Peters U
22. Rasmussen-Torvik LJ
23. Rich SS
24. Rotter JI
25. Wheeler HE
26. Wojcik GL
27. Zhou Y
28. Sankararaman S
29. Pasaniuc B
(2023a) Causal effects on complex traits are similar for common variants across segments of different Continental Ancestries within admixed individuals
Nature Genetics 55:549–558.

https://doi.org/10.1038/s41588-023-01338-6
- PubMed
- Google Scholar
1. Hou Z
2. Ochoa A
3. Browning S
(2023b) Genetic Association models are robust to common population kinship estimation biases
GENETICS 224:iyad030.

https://doi.org/10.1093/genetics/iyad030
- PubMed
- Google Scholar
1. Hu Y
2. Graff M
3. Haessler J
4. Buyske S
5. Bien SA
6. Tao R
7. Highland HM
8. Nishimura KK
9. Zubair N
10. Lu Y
11. Verbanck M
12. Hilliard AT
13. Klarin D
14. Damrauer SM
15. Ho Y-L
16. Wilson PWF
17. Chang K-M
18. Tsao PS
19. Cho K
20. O’Donnell CJ
21. Assimes TL
22. Petty LE
23. Below JE
24. Dikilitas O
25. Schaid DJ
26. Kosel ML
27. Kullo IJ
28. Rasmussen-Torvik LJ
29. Jarvik GP
30. Feng Q
31. Wei W-Q
32. Larson EB
33. Mentch FD
34. Almoguera B
35. Sleiman PM
36. Raffield LM
37. Correa A
38. Martin LW
39. Daviglus M
40. Matise TC
41. Ambite JL
42. Carlson CS
43. Do R
44. Loos RJF
45. Wilkens LR
46. Le Marchand L
47. Haiman C
48. Stram DO
49. Hindorff LA
50. North KE
51. Kooperberg C
52. Cheng I
53. Peters U
54. VA Million Veteran Program
(2020) Minority-centric meta-analyses of blood lipid levels identify novel Loci in the population architecture using Genomics and epidemiology (page) study
PLOS Genetics 16:e1008684.

https://doi.org/10.1371/journal.pgen.1008684
- PubMed
- Google Scholar
Book
1. Jacquard A
(1970)
Structures Génétiques Des Populations

Paris: Masson et Cie.
- Google Scholar
(2012) Inferences from Genomic models in stratified populations
Genetics 192:693–704.

https://doi.org/10.1534/genetics.112.141143
- PubMed
- Google Scholar
Book
1. Jolliffe IT
(2002)
Principal Component Analysis

New York: Springer-Verlag.
- Google Scholar
(2021) Misuse of the term ‘Trans-ethnic’ in Genomics research
Nature Genetics 53:1520–1521.

https://doi.org/10.1038/s41588-021-00952-6
- PubMed
- Google Scholar
1. Kang H.M
2. Zaitlen NA
3. Wade CM
4. Kirby A
5. Heckerman D
6. Daly MJ
7. Eskin E
(2008) Efficient control of population structure in model organism association mapping
Genetics 178:1709–1723.

https://doi.org/10.1534/genetics.107.080101
- Google Scholar
1. Kang HM
2. Sul JH
3. Service SK
4. Zaitlen NA
5. Kong S-Y
6. Freimer NB
7. Sabatti C
8. Eskin E
(2010) Variance component model to account for sample structure in genome-wide Association studies
Nature Genetics 42:348–354.

https://doi.org/10.1038/ng.548
- PubMed
- Google Scholar
Book
1. Lawson CL
2. Hanson RJ
(1974)
Solving Least Squares Problems

Englewood Cliffs: Prentice Hall.
- Google Scholar
1. Lazaridis I
2. Patterson N
3. Mittnik A
4. Renaud G
5. Mallick S
6. Kirsanow K
7. Sudmant PH
8. Schraiber JG
9. Castellano S
10. Lipson M
11. Berger B
12. Economou C
13. Bollongino R
14. Fu Q
15. Bos KI
16. Nordenfelt S
17. Li H
18. de Filippo C
19. Prüfer K
20. Sawyer S
21. Posth C
22. Haak W
23. Hallgren F
24. Fornander E
25. Rohland N
26. Delsate D
27. Francken M
28. Guinet J-M
29. Wahl J
30. Ayodo G
31. Babiker HA
32. Bailliet G
33. Balanovska E
34. Balanovsky O
35. Barrantes R
36. Bedoya G
37. Ben-Ami H
38. Bene J
39. Berrada F
40. Bravi CM
41. Brisighelli F
42. Busby GBJ
43. Cali F
44. Churnosov M
45. Cole DEC
46. Corach D
47. Damba L
48. van Driem G
49. Dryomov S
50. Dugoujon J-M
51. Fedorova SA
52. Gallego Romero I
53. Gubina M
54. Hammer M
55. Henn BM
56. Hervig T
57. Hodoglugil U
58. Jha AR
59. Karachanak-Yankova S
60. Khusainova R
61. Khusnutdinova E
62. Kittles R
63. Kivisild T
64. Klitz W
65. Kučinskas V
66. Kushniarevich A
67. Laredj L
68. Litvinov S
69. Loukidis T
70. Mahley RW
71. Melegh B
72. Metspalu E
73. Molina J
74. Mountain J
75. Näkkäläjärvi K
76. Nesheva D
77. Nyambo T
78. Osipova L
79. Parik J
80. Platonov F
81. Posukh O
82. Romano V
83. Rothhammer F
84. Rudan I
85. Ruizbakiev R
86. Sahakyan H
87. Sajantila A
88. Salas A
89. Starikovskaya EB
90. Tarekegn A
91. Toncheva D
92. Turdikulova S
93. Uktveryte I
94. Utevska O
95. Vasquez R
96. Villena M
97. Voevoda M
98. Winkler CA
99. Yepiskoposyan L
100. Zalloua P
101. Zemunik T
102. Cooper A
103. Capelli C
104. Thomas MG
105. Ruiz-Linares A
106. Tishkoff SA
107. Singh L
108. Thangaraj K
109. Villems R
110. Comas D
111. Sukernik R
112. Metspalu M
113. Meyer M
114. Eichler EE
115. Burger J
116. Slatkin M
117. Pääbo S
118. Kelso J
119. Reich D
120. Krause J
(2014) Ancient human genomes suggest three ancestral populations for present-day Europeans
Nature 513:409–413.

https://doi.org/10.1038/nature13673
- Google Scholar
1. Lazaridis I
2. Nadel D
3. Rollefson G
4. Merrett DC
5. Rohland N
6. Mallick S
7. Fernandes D
8. Novak M
9. Gamarra B
10. Sirak K
11. Connell S
12. Stewardson K
13. Harney E
14. Fu Q
15. Gonzalez-Fortes G
16. Jones ER
17. Roodenberg SA
18. Lengyel G
19. Bocquentin F
20. Gasparian B
21. Monge JM
22. Gregg M
23. Eshed V
24. Mizrahi A-S
25. Meiklejohn C
26. Gerritsen F
27. Bejenaru L
28. Blüher M
29. Campbell A
30. Cavalleri G
31. Comas D
32. Froguel P
33. Gilbert E
34. Kerr SM
35. Kovacs P
36. Krause J
37. McGettigan D
38. Merrigan M
39. Merriwether DA
40. O’Reilly S
41. Richards MB
42. Semino O
43. Shamoon-Pour M
44. Stefanescu G
45. Stumvoll M
46. Tönjes A
47. Torroni A
48. Wilson JF
49. Yengo L
50. Hovhannisyan NA
51. Patterson N
52. Pinhasi R
53. Reich D
(2016) Genomic insights into the origin of farming in the ancient near East
Nature 536:419–424.

https://doi.org/10.1038/nature19310
- PubMed
- Google Scholar
1. Lee S
2. Epstein MP
3. Duncan R
4. Lin X
(2012) Sparse principal component analysis for identifying ancestry-informative markers in genome-wide Association studies
Genetic Epidemiology 36:293–302.

https://doi.org/10.1002/gepi.21621
- PubMed
- Google Scholar
1. Lin M
2. Park DS
3. Zaitlen NA
4. Henn BM
5. Gignoux CR
(2021) Admixed populations improve power for variant discovery and Portability in genome-wide Association studies
Frontiers in Genetics 12:673167.

https://doi.org/10.3389/fgene.2021.673167
- PubMed
- Google Scholar
1. Lippert C
2. Listgarten J
3. Liu Y
4. Kadie CM
5. Davidson RI
6. Heckerman D
(2011) Fast linear mixed models for genome-wide Association studies
Nature Methods 8:833–835.

https://doi.org/10.1038/nmeth.1681
- PubMed
- Google Scholar
(2012) Improved linear mixed models for genome-wide Association studies
Nature Methods 9:525–526.

https://doi.org/10.1038/nmeth.2037
- PubMed
- Google Scholar
1. Liu N
2. Zhao H
3. Patki A
4. Limdi NA
5. Allison DB
(2011) Controlling population structure in human genetic association studies with samples of unrelated individuals
Statistics and Its Interface 4:317–326.

https://doi.org/10.4310/sii.2011.v4.n3.a6
- PubMed
- Google Scholar
1. Liu X
2. Huang M
3. Fan B
4. Buckler ES
5. Zhang Z
6. Listgarten J
(2016) Iterative usage of fixed and random effect models for powerful and efficient genome-wide Association studies
PLOS Genetics 12:e1005767.

https://doi.org/10.1371/journal.pgen.1005767
- PubMed
- Google Scholar
1. Loh PR
2. Tucker G
3. Bulik-Sullivan BK
4. Vilhjálmsson BJ
5. Finucane HK
6. Salem RM
7. Chasman DI
8. Ridker PM
9. Neale BM
10. Berger B
11. Patterson N
12. Price AL
(2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts
Nature Genetics 47:284–290.

https://doi.org/10.1038/ng.3190
- Google Scholar
1. Loh PR
2. Kichaev G
3. Gazal S
4. Schoech AP
5. Price AL
(2018) Mixed-model association for biobank-scale datasets
Nature Genetics 50:906–908.

https://doi.org/10.1038/s41588-018-0144-6
- PubMed
- Google Scholar
1. Mahajan A
2. Spracklen CN
3. Zhang W
4. Ng MCY
5. Petty LE
6. Kitajima H
7. Yu GZ
8. Rüeger S
9. Speidel L
10. Kim YJ
11. Horikoshi M
12. Mercader JM
13. Taliun D
14. Moon S
15. Kwak S-H
16. Robertson NR
17. Rayner NW
18. Loh M
19. Kim B-J
20. Chiou J
21. Miguel-Escalada I
22. Della Briotta Parolo P
23. Lin K
24. Bragg F
25. Preuss MH
26. Takeuchi F
27. Nano J
28. Guo X
29. Lamri A
30. Nakatochi M
31. Scott RA
32. Lee J-J
33. Huerta-Chagoya A
34. Graff M
35. Chai J-F
36. Parra EJ
37. Yao J
38. Bielak LF
39. Tabara Y
40. Hai Y
41. Steinthorsdottir V
42. Cook JP
43. Kals M
44. Grarup N
45. Schmidt EM
46. Pan I
47. Sofer T
48. Wuttke M
49. Sarnowski C
50. Gieger C
51. Nousome D
52. Trompet S
53. Long J
54. Sun M
55. Tong L
56. Chen W-M
57. Ahmad M
58. Noordam R
59. Lim VJY
60. Tam CHT
61. Joo YY
62. Chen C-H
63. Raffield LM
64. Lecoeur C
65. Prins BP
66. Nicolas A
67. Yanek LR
68. Chen G
69. Jensen RA
70. Tajuddin S
71. Kabagambe EK
72. An P
73. Xiang AH
74. Choi HS
75. Cade BE
76. Tan J
77. Flanagan J
78. Abaitua F
79. Adair LS
80. Adeyemo A
81. Aguilar-Salinas CA
82. Akiyama M
83. Anand SS
84. Bertoni A
85. Bian Z
86. Bork-Jensen J
87. Brandslund I
88. Brody JA
89. Brummett CM
90. Buchanan TA
91. Canouil M
92. Chan JCN
93. Chang L-C
94. Chee M-L
95. Chen J
96. Chen S-H
97. Chen Y-T
98. Chen Z
99. Chuang L-M
100. Cushman M
101. Das SK
102. de Silva HJ
103. Dedoussis G
104. Dimitrov L
105. Doumatey AP
106. Du S
107. Duan Q
108. Eckardt K-U
109. Emery LS
110. Evans DS
111. Evans MK
112. Fischer K
113. Floyd JS
114. Ford I
115. Fornage M
116. Franco OH
117. Frayling TM
118. Freedman BI
119. Fuchsberger C
120. Genter P
121. Gerstein HC
122. Giedraitis V
123. González-Villalpando C
124. González-Villalpando ME
125. Goodarzi MO
126. Gordon-Larsen P
127. Gorkin D
128. Gross M
129. Guo Y
130. Hackinger S
131. Han S
132. Hattersley AT
133. Herder C
134. Howard A-G
135. Hsueh W
136. Huang M
137. Huang W
138. Hung Y-J
139. Hwang MY
140. Hwu C-M
141. Ichihara S
142. Ikram MA
143. Ingelsson M
144. Islam MT
145. Isono M
146. Jang H-M
147. Jasmine F
148. Jiang G
149. Jonas JB
150. Jørgensen ME
151. Jørgensen T
152. Kamatani Y
153. Kandeel FR
154. Kasturiratne A
155. Katsuya T
156. Kaur V
157. Kawaguchi T
158. Keaton JM
159. Kho AN
160. Khor C-C
161. Kibriya MG
162. Kim D-H
163. Kohara K
164. Kriebel J
165. Kronenberg F
166. Kuusisto J
167. Läll K
168. Lange LA
169. Lee M-S
170. Lee NR
171. Leong A
172. Li L
173. Li Y
174. Li-Gao R
175. Ligthart S
176. Lindgren CM
177. Linneberg A
178. Liu C-T
179. Liu J
180. Locke AE
181. Louie T
182. Luan J
183. Luk AO
184. Luo X
185. Lv J
186. Lyssenko V
187. Mamakou V
188. Mani KR
189. Meitinger T
190. Metspalu A
191. Morris AD
192. Nadkarni GN
193. Nadler JL
194. Nalls MA
195. Nayak U
196. Nongmaithem SS
197. Ntalla I
198. Okada Y
199. Orozco L
200. Patel SR
201. Pereira MA
202. Peters A
203. Pirie FJ
204. Porneala B
205. Prasad G
206. Preissl S
207. Rasmussen-Torvik LJ
208. Reiner AP
209. Roden M
210. Rohde R
211. Roll K
212. Sabanayagam C
213. Sander M
214. Sandow K
215. Sattar N
216. Schönherr S
217. Schurmann C
218. Shahriar M
219. Shi J
220. Shin DM
221. Shriner D
222. Smith JA
223. So WY
224. Stančáková A
225. Stilp AM
226. Strauch K
227. Suzuki K
228. Takahashi A
229. Taylor KD
230. Thorand B
231. Thorleifsson G
232. Thorsteinsdottir U
233. Tomlinson B
234. Torres JM
235. Tsai F-J
236. Tuomilehto J
237. Tusie-Luna T
238. Udler MS
239. Valladares-Salgado A
240. van Dam RM
241. van Klinken JB
242. Varma R
243. Vujkovic M
244. Wacher-Rodarte N
245. Wheeler E
246. Whitsel EA
247. Wickremasinghe AR
248. van Dijk KW
249. Witte DR
250. Yajnik CS
251. Yamamoto K
252. Yamauchi T
253. Yengo L
254. Yoon K
255. Yu C
256. Yuan J-M
257. Yusuf S
258. Zhang L
259. Zheng W
260. FinnGen
261. eMERGE Consortium
262. Raffel LJ
263. Igase M
264. Ipp E
265. Redline S
266. Cho YS
267. Lind L
268. Province MA
269. Hanis CL
270. Peyser PA
271. Ingelsson E
272. Zonderman AB
273. Psaty BM
274. Wang Y-X
275. Rotimi CN
276. Becker DM
277. Matsuda F
278. Liu Y
279. Zeggini E
280. Yokota M
281. Rich SS
282. Kooperberg C
283. Pankow JS
284. Engert JC
285. Chen Y-DI
286. Froguel P
287. Wilson JG
288. Sheu WHH
289. Kardia SLR
290. Wu J-Y
291. Hayes MG
292. Ma RCW
293. Wong T-Y
294. Groop L
295. Mook-Kanamori DO
296. Chandak GR
297. Collins FS
298. Bharadwaj D
299. Paré G
300. Sale MM
301. Ahsan H
302. Motala AA
303. Shu X-O
304. Park K-S
305. Jukema JW
306. Cruz M
307. McKean-Cowdin R
308. Grallert H
309. Cheng C-Y
310. Bottinger EP
311. Dehghan A
312. Tai E-S
313. Dupuis J
314. Kato N
315. Laakso M
316. Köttgen A
317. Koh W-P
318. Palmer CNA
319. Liu S
320. Abecasis G
321. Kooner JS
322. Loos RJF
323. North KE
324. Haiman CA
325. Florez JC
326. Saleheen D
327. Hansen T
328. Pedersen O
329. Mägi R
330. Langenberg C
331. Wareham NJ
332. Maeda S
333. Kadowaki T
334. Lee J
335. Millwood IY
336. Walters RG
337. Stefansson K
338. Myers SR
339. Ferrer J
340. Gaulton KJ
341. Meigs JB
342. Mohlke KL
343. Gloyn AL
344. Bowden DW
345. Below JE
346. Chambers JC
347. Sim X
348. Boehnke M
349. Rotter JI
350. McCarthy MI
351. Morris AP
(2022) Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation
Nature Genetics 54:560–572.

https://doi.org/10.1038/s41588-022-01058-3
- PubMed
- Google Scholar
Book
1. Malécot G
(1948)
Mathématiques de l’hérédité

Paris: Masson et Cie.
- Google Scholar
1. Manichaikul A
2. Mychaleckyj JC
3. Rich SS
4. Daly K
5. Sale M
6. Chen WM
(2010) Robust relationship inference in genome-wide Association studies
Bioinformatics 26:2867–2873.

https://doi.org/10.1093/bioinformatics/btq559
- PubMed
- Google Scholar
1. Martin AR
2. Gignoux CR
3. Walters RK
4. Wojcik GL
5. Neale BM
6. Gravel S
7. Daly MJ
8. Bustamante CD
9. Kenny EE
(2017a) Human demographic history impacts genetic risk prediction across diverse populations
American Journal of Human Genetics 100:635–649.

https://doi.org/10.1016/j.ajhg.2017.03.004
- PubMed
- Google Scholar
1. Martin AR
2. Lin M
3. Granka JM
4. Myrick JW
5. Liu X
6. Sockell A
7. Atkinson EG
8. Werely CJ
9. Möller M
10. Sandhu MS
11. Kingsley DM
12. Hoal EG
13. Liu X
14. Daly MJ
15. Feldman MW
16. Gignoux CR
17. Bustamante CD
18. Henn BM
(2017b) An unexpectedly complex architecture for skin Pigmentation in Africans
Cell 171:1340–1353.

https://doi.org/10.1016/j.cell.2017.11.015
- PubMed
- Google Scholar
1. Matoba N
2. Akiyama M
3. Ishigaki K
4. Kanai M
5. Takahashi A
6. Momozawa Y
7. Ikegawa S
8. Ikeda M
9. Iwata N
10. Hirata M
11. Matsuda K
12. Murakami Y
13. Kubo M
14. Kamatani Y
15. Okada Y
(2020) GWAS of 165,084 Japanese individuals identified nine Loci associated with dietary habits
Nature Human Behaviour 4:308–316.

https://doi.org/10.1038/s41562-019-0805-1
- PubMed
- Google Scholar
1. Mbatchou J
2. Barnard L
3. Backman J
4. Marcketta A
5. Kosmicki JA
6. Ziyatdinov A
7. Benner C
8. O’Dushlaine C
9. Barber M
10. Boutkov B
11. Habegger L
12. Ferreira M
13. Baras A
14. Reid J
15. Abecasis G
16. Maxwell E
17. Marchini J
(2021) Computationally efficient whole-genome regression for quantitative and binary traits
Nature Genetics 53:1097–1103.

https://doi.org/10.1038/s41588-021-00870-7
- PubMed
- Google Scholar
1. McVean G
(2009) A Genealogical interpretation of principal components analysis
PLOS Genetics 5:e1000686.

https://doi.org/10.1371/journal.pgen.1000686
- PubMed
- Google Scholar
(2015) Challenges in conducting genome-wide Association studies in highly admixed multi-ethnic populations: The generation R study
European Journal of Epidemiology 30:317–330.

https://doi.org/10.1007/s10654-015-9998-4
- PubMed
- Google Scholar
1. Mogil LS
2. Andaleon A
3. Badalamenti A
4. Dickinson SP
5. Guo X
6. Rotter JI
7. Johnson WC
8. Im HK
9. Liu Y
10. Wheeler HE
(2018) Genetic architecture of gene expression traits across diverse populations
PLOS Genetics 14:e1007586.

https://doi.org/10.1371/journal.pgen.1007586
- PubMed
- Google Scholar
Software
1. Mullen KM
(2012) Stokkum Ihmv, Nnls: The Lawson-Hanson algorithm for non-negative least squares (NNLS)
The Comprehensive R Archive Network.

https://CRAN.R-project.org/package=nnls
1. Novembre J
2. Johnson T
3. Bryc K
4. Kutalik Z
5. Boyko AR
6. Auton A
7. Indap A
8. King KS
9. Bergmann S
10. Nelson MR
11. Stephens M
12. Bustamante CD
(2008) Genes mirror geography within Europe
Nature 456:98–101.

https://doi.org/10.1038/nature07331
- PubMed
- Google Scholar
Preprint
1. Ochoa A
2. Storey JD
(2019) New Kinship and FST Estimates Reveal Higher Levels of Differentiation in the Global Human Population
bioRxiv.

https://doi.org/10.1101/653279
- Google Scholar
1. Ochoa A
2. Storey JD
(2021) Estimating FST and kinship for arbitrary population structures
PLOS Genetics 17:e1009241.

https://doi.org/10.1371/journal.pgen.1009241
- PubMed
- Google Scholar
Software
1. Ochoa A
(2023) Pca-Assoc-paper, version swh:1:rev:8549eafe6c27583894640e6cd8639232ed15cade
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:54f4600c823ac0f1c3b17eb03185aa49a8232d56;origin=https://github.com/OchoaLab/pca-assoc-paper;visit=swh:1:snp:fcc0d7bc50b88ce0b091fd4a89d811fb26f3ddd7;anchor=swh:1:rev:8549eafe6c27583894640e6cd8639232ed15cade
(2019) Extreme Polygenicity of complex traits is explained by negative selection
American Journal of Human Genetics 105:456–476.

https://doi.org/10.1016/j.ajhg.2019.07.003
- PubMed
- Google Scholar
1. Paradis E
2. Schliep K
(2019) Ape 5.0: An environment for modern Phylogenetics and evolutionary analyses in R
Bioinformatics 35:526–528.

https://doi.org/10.1093/bioinformatics/bty633
- PubMed
- Google Scholar
1. Park JH
2. Wacholder S
3. Gail MH
4. Peters U
5. Jacobs KB
6. Chanock SJ
7. Chatterjee N
(2010) Estimation of effect size distribution from genome-wide Association studies and implications for future discoveries
Nature Genetics 42:570–575.

https://doi.org/10.1038/ng.610
- PubMed
- Google Scholar
1. Park JH
2. Gail MH
3. Weinberg CR
4. Carroll RJ
5. Chung CC
6. Wang Z
7. Chanock SJ
8. Fraumeni JF
9. Chatterjee N
(2011) Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants
PNAS 108:18026–18031.

https://doi.org/10.1073/pnas.1114759108
- PubMed
- Google Scholar
(2006) Population structure and Eigenanalysis
PLOS Genetics 2:e190.

https://doi.org/10.1371/journal.pgen.0020190
- PubMed
- Google Scholar
1. Patterson N
2. Moorjani P
3. Luo Y
4. Mallick S
5. Rohland N
6. Zhan Y
7. Genschoreck T
8. Webster T
9. Reich D
(2012) Ancient admixture in human history
Genetics 192:1065–1093.

https://doi.org/10.1534/genetics.112.145037
- Google Scholar
1. Peterson RE
2. Kuchenbaecker K
3. Walters RK
4. Chen C-Y
5. Popejoy AB
6. Periyasamy S
7. Lam M
8. Iyegbe C
9. Strawbridge RJ
10. Brick L
11. Carey CE
12. Martin AR
13. Meyers JL
14. Su J
15. Chen J
16. Edwards AC
17. Kalungi A
18. Koen N
19. Majara L
20. Schwarz E
21. Smoller JW
22. Stahl EA
23. Sullivan PF
24. Vassos E
25. Mowry B
26. Prieto ML
27. Cuellar-Barboza A
28. Bigdeli TB
29. Edenberg HJ
30. Huang H
31. Duncan LE
(2019) Genome-wide Association studies in Ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations
Cell 179:589–603.

https://doi.org/10.1016/j.cell.2019.08.051
- PubMed
- Google Scholar
(2006) Principal components analysis corrects for stratification in genome-wide Association studies
Nature Genetics 38:904–909.

https://doi.org/10.1038/ng1847
- PubMed
- Google Scholar
(2010) New approaches to population stratification in genome-wide association studies
Nature Reviews Genetics 11:459–463.

https://doi.org/10.1038/nrg2813
- PubMed
- Google Scholar
(2013) Response to Sul and Eskin
Nature Reviews Genetics 14:300.

https://doi.org/10.1038/nrg2813-c2
- PubMed
- Google Scholar
(2000) Association mapping in structured populations
American Journal of Human Genetics 67:170–181.

https://doi.org/10.1086/302959
- PubMed
- Google Scholar
(2020) Efficient Toolkit implementing best practices for principal component analysis of population genetic data
Bioinformatics 36:4449–4457.

https://doi.org/10.1093/bioinformatics/btaa520
- PubMed
- Google Scholar
1. Qian J
2. Tanigawa Y
3. Du W
4. Aguirre M
5. Chang C
6. Tibshirani R
7. Rivas MA
8. Hastie T
(2020) A fast and Scalable framework for large-scale and Ultrahigh-dimensional sparse regression with application to the UK Biobank
PLOS Genetics 16:e1009141.

https://doi.org/10.1371/journal.pgen.1009141
- PubMed
- Google Scholar
(2013) A lasso multi-marker mixed model for Association mapping with population structure correction
Bioinformatics 29:206–214.

https://doi.org/10.1093/bioinformatics/bts669
- PubMed
- Google Scholar
1. Roselli C
2. Chaffin MD
3. Weng LC
4. Aeschbacher S
5. Ahlberg G
6. Albert CM
7. Almgren P
8. Alonso A
9. Anderson CD
10. Aragam KG
11. Arking DE
12. Barnard J
13. Bartz TM
14. Benjamin EJ
15. Bihlmeyer NA
16. Bis JC
17. Bloom HL
18. Boerwinkle E
19. Bottinger EB
20. Brody JA
21. Calkins H
22. Campbell A
23. Cappola TP
24. Carlquist J
25. Chasman DI
26. Chen LY
27. Chen YDI
28. Choi EK
29. Choi SH
30. Christophersen IE
31. Chung MK
32. Cole JW
33. Conen D
34. Cook J
35. Crijns HJ
36. Cutler MJ
37. Damrauer SM
38. Daniels BR
39. Darbar D
40. Delgado G
41. Denny JC
42. Dichgans M
43. Dörr M
44. Dudink EA
45. Dudley SC
46. Esa N
47. Esko T
48. Eskola M
49. Fatkin D
50. Felix SB
51. Ford I
52. Franco OH
53. Geelhoed B
54. Grewal RP
55. Gudnason V
56. Guo X
57. Gupta N
58. Gustafsson S
59. Gutmann R
60. Hamsten A
61. Harris TB
62. Hayward C
63. Heckbert SR
64. Hernesniemi J
65. Hocking LJ
66. Hofman A
67. Horimoto A
68. Huang J
69. Huang PL
70. Huffman J
71. Ingelsson E
72. Ipek EG
73. Ito K
74. Jimenez-Conde J
75. Johnson R
76. Jukema JW
77. Kääb S
78. Kähönen M
79. Kamatani Y
80. Kane JP
81. Kastrati A
82. Kathiresan S
83. Katschnig-Winter P
84. Kavousi M
85. Kessler T
86. Kietselaer BL
87. Kirchhof P
88. Kleber ME
89. Knight S
90. Krieger JE
91. Kubo M
92. Launer LJ
93. Laurikka J
94. Lehtimäki T
95. Leineweber K
96. Lemaitre RN
97. Li M
98. Lim HE
99. Lin HJ
100. Lin H
101. Lind L
102. Lindgren CM
103. Lokki ML
104. London B
105. Loos RJF
106. Low SK
107. Lu Y
108. Lyytikäinen LP
109. Macfarlane PW
110. Magnusson PK
111. Mahajan A
112. Malik R
113. Mansur AJ
114. Marcus GM
115. Margolin L
116. Margulies KB
117. März W
118. McManus DD
119. Melander O
120. Mohanty S
121. Montgomery JA
122. Morley MP
123. Morris AP
124. Müller-Nurasyid M
125. Natale A
126. Nazarian S
127. Neumann B
128. Newton-Cheh C
129. Niemeijer MN
130. Nikus K
131. Nilsson P
132. Noordam R
133. Oellers H
134. Olesen MS
135. Orho-Melander M
136. Padmanabhan S
137. Pak HN
138. Paré G
139. Pedersen NL
140. Pera J
141. Pereira A
142. Porteous D
143. Psaty BM
144. Pulit SL
145. Pullinger CR
146. Rader DJ
147. Refsgaard L
148. Ribasés M
149. Ridker PM
150. Rienstra M
151. Risch L
152. Roden DM
153. Rosand J
154. Rosenberg MA
155. Rost N
156. Rotter JI
157. Saba S
158. Sandhu RK
159. Schnabel RB
160. Schramm K
161. Schunkert H
162. Schurman C
163. Scott SA
164. Seppälä I
165. Shaffer C
166. Shah S
167. Shalaby AA
168. Shim J
169. Shoemaker MB
170. Siland JE
171. Sinisalo J
172. Sinner MF
173. Slowik A
174. Smith AV
175. Smith BH
176. Smith JG
177. Smith JD
178. Smith NL
179. Soliman EZ
180. Sotoodehnia N
181. Stricker BH
182. Sun A
183. Sun H
184. Svendsen JH
185. Tanaka T
186. Tanriverdi K
187. Taylor KD
188. Teder-Laving M
189. Teumer A
190. Thériault S
191. Trompet S
192. Tucker NR
193. Tveit A
194. Uitterlinden AG
195. Van Der Harst P
196. Van Gelder IC
197. Van Wagoner DR
198. Verweij N
199. Vlachopoulou E
200. Völker U
201. Wang B
202. Weeke PE
203. Weijs B
204. Weiss R
205. Weiss S
206. Wells QS
207. Wiggins KL
208. Wong JA
209. Woo D
210. Worrall BB
211. Yang PS
212. Yao J
213. Yoneda ZT
214. Zeller T
215. Zeng L
216. Lubitz SA
217. Lunetta KL
218. Ellinor PT
(2018) Multi-ethnic genome-wide Association study for atrial fibrillation
Nature Genetics 50:1225–1233.

https://doi.org/10.1038/s41588-018-0133-9
- PubMed
- Google Scholar
(2002) Genetic structure of human populations
Science 298:2381–2385.

https://doi.org/10.1126/science.1078311
- PubMed
- Google Scholar
(2010) Genome-wide Association studies in diverse populations
Nature Reviews Genetics 11:356–366.

https://doi.org/10.1038/nrg2760
- PubMed
- Google Scholar
(2017) Identification of genetic Outliers due to sub-structure and cryptic relationships
Bioinformatics 33:1972–1979.

https://doi.org/10.1093/bioinformatics/btx109
- PubMed
- Google Scholar
1. Shchur V
2. Nielsen R
(2018) On the number of siblings and p-th cousins in a large population sample
Journal of Mathematical Biology 77:1279–1298.

https://doi.org/10.1007/s00285-018-1252-8
- PubMed
- Google Scholar
(2021) An overview of strategies for detecting genotype-phenotype associations across Ancestrally diverse populations
Frontiers in Genetics 12:703901.

https://doi.org/10.3389/fgene.2021.703901
- PubMed
- Google Scholar
(2018) A population genetic interpretation of GWAS findings for human quantitative traits
PLOS Biology 16:e2002985.

https://doi.org/10.1371/journal.pbio.2002985
- PubMed
- Google Scholar
1. Skoglund P
2. Posth C
3. Sirak K
4. Spriggs M
5. Valentin F
6. Bedford S
7. Clark GR
8. Reepmeyer C
9. Petchey F
10. Fernandes D
11. Fu Q
12. Harney E
13. Lipson M
14. Mallick S
15. Novak M
16. Rohland N
17. Stewardson K
18. Abdullah S
19. Cox MP
20. Friedlaender FR
21. Friedlaender JS
22. Kivisild T
23. Koki G
24. Kusuma P
25. Merriwether DA
26. Ricaut F-X
27. Wee JTS
28. Patterson N
29. Krause J
30. Pinhasi R
31. Reich D
(2016) Genomic insights into the peopling of the southwest Pacific
Nature 538:510–513.

https://doi.org/10.1038/nature19844
- Google Scholar
1. Sokal RR
2. Michener CD
(1958)
A statistical method for evaluating systematic relationships

Univ Kansas, Sci Bull 38:1409–1438.
- Google Scholar
1. Song M
2. Hao W
3. Storey JD
(2015) Testing for genetic associations in arbitrarily structured populations
Nature Genetics 47:550–554.

https://doi.org/10.1038/ng.3244
- PubMed
- Google Scholar
(2012) Improved Heritability estimation from genome-wide SNPs
American Journal of Human Genetics 91:1011–1021.

https://doi.org/10.1016/j.ajhg.2012.10.010
- PubMed
- Google Scholar
1. Storey JD
(2003) The positive false discovery rate: A Bayesian interpretation and the Q-value
The Annals of Statistics 31:2013–2035.

https://doi.org/10.1214/aos/1074290335
- Google Scholar
1. Storey JD
2. Tibshirani R
(2003) Statistical significance for Genomewide studies
PNAS 100:9440–9445.

https://doi.org/10.1073/pnas.1530509100
- PubMed
- Google Scholar
1. Sul JH
2. Eskin E
(2013) Mixed models can correct for population structure for Genomic regions under selection
Nature Reviews Genetics 14:300.

https://doi.org/10.1038/nrg2813-c1
- PubMed
- Google Scholar
1. Sul JH
2. Martin LS
3. Eskin E
(2018) Population structure in genetic studies: Confounding factors and mixed models
PLOS Genetics 14:e1007309.

https://doi.org/10.1371/journal.pgen.1007309
- PubMed
- Google Scholar
1. Sun G
2. Zhu C
3. Kramer MH
4. Yang SS
5. Song W
6. Piepho HP
7. Yu J
(2010) Variation explained in mixed-model Association mapping
Heredity 105:333–340.

https://doi.org/10.1038/hdy.2010.11
- PubMed
- Google Scholar
(2012) Rapid variance components–based method for whole-genome Association analysis
Nature Genetics 44:1166–1170.

https://doi.org/10.1038/ng.2410
- PubMed
- Google Scholar
1. Thornton T
2. McPeek MS
(2010) ROADTRIPS: Case-control Association testing with partially or completely unknown population and pedigree structure
American Journal of Human Genetics 86:172–184.

https://doi.org/10.1016/j.ajhg.2010.01.001
- PubMed
- Google Scholar
(2014) Improving the power of GWAS and avoiding confounding from population stratification with PC-select
Genetics 197:1045–1049.

https://doi.org/10.1534/genetics.114.164285
- PubMed
- Google Scholar
1. Vilhjálmsson BJ
2. Nordborg M
(2013) The nature of confounding in genome-wide Association studies
Nature Reviews Genetics 14:1–2.

https://doi.org/10.1038/nrg3382
- PubMed
- Google Scholar
1. Voight BF
2. Pritchard JK
(2005) Confounding from cryptic relatedness in case-control Association studies
PLOS Genetics 1:e32.

https://doi.org/10.1371/journal.pgen.0010032
- PubMed
- Google Scholar
1. Wang H
2. Aragam B
3. Xing EP
(2022) Trade-offs of linear mixed models in genome-wide Association studies
Journal of Computational Biology 29:233–242.

https://doi.org/10.1089/cmb.2021.0157
- PubMed
- Google Scholar
1. Wojcik GL
2. Graff M
3. Nishimura KK
4. Tao R
5. Haessler J
6. Gignoux CR
7. Highland HM
8. Patel YM
9. Sorokin EP
10. Avery CL
11. Belbin GM
12. Bien SA
13. Cheng I
14. Cullina S
15. Hodonsky CJ
16. Hu Y
17. Huckins LM
18. Jeff J
19. Justice AE
20. Kocarnik JM
21. Lim U
22. Lin BM
23. Lu Y
24. Nelson SC
25. Park S-SL
26. Poisner H
27. Preuss MH
28. Richard MA
29. Schurmann C
30. Setiawan VW
31. Sockell A
32. Vahi K
33. Verbanck M
34. Vishnu A
35. Walker RW
36. Young KL
37. Zubair N
38. Acuña-Alonso V
39. Ambite JL
40. Barnes KC
41. Boerwinkle E
42. Bottinger EP
43. Bustamante CD
44. Caberto C
45. Canizales-Quinteros S
46. Conomos MP
47. Deelman E
48. Do R
49. Doheny K
50. Fernández-Rhodes L
51. Fornage M
52. Hailu B
53. Heiss G
54. Henn BM
55. Hindorff LA
56. Jackson RD
57. Laurie CA
58. Laurie CC
59. Li Y
60. Lin D-Y
61. Moreno-Estrada A
62. Nadkarni G
63. Norman PJ
64. Pooler LC
65. Reiner AP
66. Romm J
67. Sabatti C
68. Sandoval K
69. Sheng X
70. Stahl EA
71. Stram DO
72. Thornton TA
73. Wassel CL
74. Wilkens LR
75. Winkler CA
76. Yoneyama S
77. Buyske S
78. Haiman CA
79. Kooperberg C
80. Le Marchand L
81. Loos RJF
82. Matise TC
83. North KE
84. Peters U
85. Kenny EE
86. Carlson CS
(2019) Genetic analyses of diverse populations improves discovery for complex traits
Nature 570:514–518.

https://doi.org/10.1038/s41586-019-1310-4
- PubMed
- Google Scholar
1. Wright S
(1949) The Genetical structure of populations
Annals of Eugenics 15:323–354.

https://doi.org/10.1111/j.1469-1809.1949.tb02451.x
- Google Scholar
1. Wu C
2. DeWan A
3. Hoh J
4. Wang Z
(2011) A comparison of Association methods correcting for population stratification in case-control studies
Annals of Human Genetics 75:418–427.

https://doi.org/10.1111/j.1469-1809.2010.00639.x
- PubMed
- Google Scholar
1. Xu H
2. Guan Y
(2014) Detecting local haplotype sharing and haplotype Association
Genetics 197:823–838.

https://doi.org/10.1534/genetics.114.164814
- PubMed
- Google Scholar
1. Yang J
2. Lee SH
3. Goddard ME
4. Visscher PM
(2011) GCTA: a tool for genome-wide complex trait analysis
The American Journal of Human Genetics 88:76–82.

https://doi.org/10.1016/j.ajhg.2010.11.011
- Google Scholar
1. Yang J
2. Zaitlen NA
3. Goddard ME
4. Visscher PM
5. Price AL
(2014) Advantages and pitfalls in the application of mixed-model association methods
Nature Genetics 46:100–106.

https://doi.org/10.1038/ng.2876
- Google Scholar
1. Yu J
2. Pressoir G
3. Briggs WH
4. Vroh Bi I
5. Yamasaki M
6. Doebley JF
7. McMullen MD
8. Gaut BS
9. Nielsen DM
10. Holland JB
11. Kresovich S
12. Buckler ES
(2006) A unified mixed-model method for Association mapping that accounts for multiple levels of relatedness
Nature Genetics 38:203–208.

https://doi.org/10.1038/ng1702
- PubMed
- Google Scholar
1. Zaidi AA
2. Mathieson I
(2020) Demographic history mediates the effect of stratification on Polygenic scores
eLife 9:e61548.

https://doi.org/10.7554/eLife.61548
- PubMed
- Google Scholar
1. Zeng J
2. de Vlaming R
3. Wu Y
4. Robinson MR
5. Lloyd-Jones LR
6. Yengo L
7. Yap CX
8. Xue A
9. Sidorenko J
10. McRae AF
11. Powell JE
12. Montgomery GW
13. Metspalu A
14. Esko T
15. Gibson G
16. Wray NR
17. Visscher PM
18. Yang J
(2018) Signatures of negative selection in the genetic architecture of human complex traits
Nature Genetics 50:746–753.

https://doi.org/10.1038/s41588-018-0101-4
- PubMed
- Google Scholar
1. Zhang S
2. Zhu X
3. Zhao H
(2003) On a Semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals
Genetic Epidemiology 24:44–56.

https://doi.org/10.1002/gepi.10196
- PubMed
- Google Scholar
1. Zhang Z
2. Ersoz E
3. Lai CQ
4. Todhunter RJ
5. Tiwari HK
6. Gore MA
7. Bradbury PJ
8. Yu J
9. Arnett DK
10. Ordovas JM
11. Buckler ES
(2010) Mixed linear model approach adapted for genome-wide Association studies
Nature Genetics 42:355–360.

https://doi.org/10.1038/ng.546
- PubMed
- Google Scholar
1. Zhang Y
2. Pan W
(2015) Principal component regression and linear mixed model in association analysis of structured samples: Competitors or complements
Genetic Epidemiology 39:149–155.

https://doi.org/10.1002/gepi.21879
- PubMed
- Google Scholar
1. Zhao K
2. Aranzana MJ
3. Kim S
4. Lister C
5. Shindo C
6. Tang C
7. Toomajian C
8. Zheng H
9. Dean C
10. Marjoram P
11. Nordborg M
(2007) An Arabidopsis example of Association mapping in structured samples
PLOS Genetics 3:e4.

https://doi.org/10.1371/journal.pgen.0030004
- PubMed
- Google Scholar
1. Zheng X
2. Weir BS
(2016) Eigenanalysis of SNP data with an identity by descent interpretation
Theoretical Population Biology 107:65–76.

https://doi.org/10.1016/j.tpb.2015.09.004
- PubMed
- Google Scholar
(2019) On using local ancestry to characterize the genetic architecture of human traits: Genetic regulation of gene expression in Multiethnic or admixed populations
American Journal of Human Genetics 104:1097–1115.

https://doi.org/10.1016/j.ajhg.2019.04.009
- PubMed
- Google Scholar
1. Zhou X
2. Stephens M
(2012) Genome-Wide efficient mixed-model analysis for association studies
Nature Genetics 44:821–824.

https://doi.org/10.1038/ng.2310
- PubMed
- Google Scholar
1. Zhou Q
2. Zhao L
3. Guan Y
4. Akey JM
(2016) Strong selection at MHC in Mexicans since admixture
PLOS Genetics 12:e1005847.

https://doi.org/10.1371/journal.pgen.1005847
- Google Scholar
1. Zhou W
2. Nielsen JB
3. Fritsche LG
4. Dey R
5. Gabrielsen ME
6. Wolford BN
7. LeFaive J
8. VandeHaar P
9. Gagliano SA
10. Gifford A
11. Bastarache LA
12. Wei WQ
13. Denny JC
14. Lin M
15. Hveem K
16. Kang HM
17. Abecasis GR
18. Willer CJ
19. Lee S
(2018) Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies
Nature Genetics 50:1335–1341.

https://doi.org/10.1038/s41588-018-0184-y
- PubMed
- Google Scholar
1. Zhu C
2. Yu J
(2009) Nonmetric multidimensional Scaling corrects for population structure in association mapping with different sample types
Genetics 182:875–888.

https://doi.org/10.1534/genetics.108.098863
- PubMed
- Google Scholar

Article and author information

Author details

Yiqi Yao

Department of Biostatistics and Bioinformatics, Duke University, Durham, United States

Present address
BenHealth Consulting, Shanghai, Shanghai, China

Contribution
Software, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review and editing

Competing interests
is affiliated with BenHealth Consulting. The author has no financial interests to declare
Alejandro Ochoa
1. Department of Biostatistics and Bioinformatics, Duke University, Durham, United States
2. Duke Center for Statistical Genetics and Genomics, Duke University, Durham, United States
Contribution
Conceptualization, Resources, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing

For correspondence
alejandro.ochoa@duke.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-4928-3403

Funding

Whitehead Foundation

Alejandro Ochoa

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Thanks to Tiffany Tu, Ratchanon Pornmongkolsuk, and Zhuoran Hou for feedback on this article. This work was funded in part by the Duke University School of Medicine Whitehead Scholars Program, a gift from the Whitehead Charitable Foundation. The 1000 Genomes data were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.