Limitations of principal components in quantitative genetic association models for human studies

  1. Yiqi Yao
  2. Alejandro Ochoa  Is a corresponding author
  1. Department of Biostatistics and Bioinformatics, Duke University, United States
  2. Duke Center for Statistical Genetics and Genomics, Duke University, United States
8 figures, 7 tables and 1 additional file

Figures

Population structures of simulated and real human genotype datasets.

First two columns are population kinship matrices as heatmaps: individuals along x- and y-axis, kinship as color. Diagonal shows inbreeding values. (A) Admixture scenario for both Large and Small simulations. (B) Last generation of 20-generation admixed family, shows larger kinship values near diagonal corresponding to siblings, first cousins, etc. (C) Minor allele frequency (MAF) distributions. Real datasets and subpopulation tree simulations had MAF0.01 filter. (D) Human Origins is an array dataset of a large diversity of global populations. (G) Human Genome Diversity Panel (HGDP) is a WGS dataset from global native populations. (J) 1000 Genomes Project is a WGS dataset of global cosmopolitan populations. (F, I, L) Trees between subpopulations fit to real data. (E, H, K). Simulations from trees fit to the real data recapitulate subpopulation structure.

Figure 2 with 3 supplements
Illustration of evaluation measures.

Three archetypal models illustrate our complementary measures: M1 is ideal, M2 overfits slightly, M3 is naive. (A) QQ plot of p-values of “null” (non-causal) loci. M1 has desired uniform p-values, M2/M3 are miscalibrated. (B)SRMSDp (p-value Signed Root Mean Square Deviation) measures signed distance between observed and expected null p-values (closer to zero is better). (C) Precision and Recall (PR) measure causal locus classification performance (higher is better). (D) AUCPR (Area Under the PR Curve) reflects power (higher is better).

Figure 2—figure supplement 1
Comparison between SRMSDp and inflation factor.

Each point is a pair of statistics for one replicate, one association model (PCA or LMM with some number of PCs r), one trait model (FES vs RC, all heritability/environments tested), and one dataset (color coded by dataset). Note log y-axis. The sigmoidal curve in Equation 10 is fit to the data.

Figure 2—figure supplement 2
Comparison between SRMSDp and type I error rate.

Type I error rate calculated at a p-value threshold of 1e-2 (horizontal dashed gray line). Thus, a calibrated model has a type I error rate of 1e-2 and SRMSDp=0 (where the dashed lines meet). As expected, increased type I error rates correspond to SRMSDp>0, while reduced type I error rates correspond to SRMSDp<0. Each point is a pair of statistics for one replicate, one association model (PCA or LMM with some number of PCs r), one trait model (FES vs RC, all heritability/environments tested), and one dataset (color coded by dataset). Note log y-axis.

Figure 2—figure supplement 3
Comparison between AUCPR and calibrated power.

Calibrated power is power calculated at an empirical type I error threshold of 1e-4. Each point is a pair of statistics for one replicate, one association model (PCA or LMM with some number of PCs r), one trait model (FES vs RC, all heritability/environments tested), and one dataset (color coded by dataset). Gray dashed line is y=x line.

Figure 3 with 5 supplements
Evaluations in admixture simulations with FES traits, high heritability.

PCA and LMM models have varying number of PCs (r{0,,90} on x-axis), with the distributions (y-axis) of SRMSDp (top subpanel) and AUCPR (bottom subpanel) for 50 replicates. Best performance is zero SRMSDp and large AUCPR. Zero and maximum median AUCPR values are marked with horizontal gray dashed lines, and |SRMSDp|<0.01 is marked with a light gray area. LMM performs best with r=0, PCA with various r. (A) Large simulation (n=1,000 individuals). (B) Small simulation (n=100) shows overfitting for large r. (C) Family simulation (n=1,000) has admixed founders and large numbers of close relatives from a realistic random 20-generation pedigree. PCA performs poorly compared to LMM: SRMSDp>0 for all r and large AUCPR gap.

Figure 3—figure supplement 1
Evaluations in admixture simulations with RC traits, high heritability.
Figure 3—figure supplement 2
Evaluations in admixture simulations with FES traits, low heritability.
Figure 3—figure supplement 3
Evaluations in admixture simulations with RC traits, low heritability.
Figure 3—figure supplement 4
Evaluations in admixture simulations with FES traits, environment.

‘LMM lab.’ was only tested with r=0.

Figure 3—figure supplement 5
Evaluations in admixture simulations with RC traits, environment.

‘LMM lab.’ was only tested with r=0.

Figure 4 with 5 supplements
Evaluations in real human genotype datasets with FES traits, high heritability.

Same setup as Figure 3, see that for details. These datasets strongly favor LMM with no PCs over PCA, with distributions that most resemble the family simulation. (A) Human Origins. (B) Human Genome Diversity Panel (HGDP). (C) 1000 Genomes Project.

Figure 4—figure supplement 1
Evaluations in real human genotype datasets with RC traits, high heritability.
Figure 4—figure supplement 2
Evaluations in real human genotype datasets with FES traits, low heritability.
Figure 4—figure supplement 3
Evaluations in real human genotype datasets with RC traits, low heritability.
Figure 4—figure supplement 4
Evaluations in real human genotype datasets with FES traits, environment.

‘LMM lab.’ was only tested with r=0.

Figure 4—figure supplement 5
Evaluations in real human genotype datasets with RC traits, environment.

‘LMM lab.’ was only tested with r=0.

Figure 5 with 1 supplement
Evaluations in subpopulation tree simulations fit to human data with FES traits, high heritability.

Same setup as Figure 3, see that for details. These tree simulations, which exclude family structure by design, do not explain the large gaps in LMM-PCA performance observed in the real data. (A) Human Origins tree simulation. (B) Human Genome Diversity Panel (HGDP) tree simulation. (C) 1000 Genomes Project tree simulation.

Figure 5—figure supplement 1
Evaluations in subpopulation tree simulations fit to human data with RC traits, high heritability.
Figure 6 with 2 supplements
Local kinship distributions.

Curves are complementary cumulative distribution of lower triangular kinship matrix (self kinship excluded) from KING-robust estimator. Note log x-axis; negative estimates are counted but not shown. Most values are below 4th degree relative threshold. Each real dataset has a greater cumulative than its subpopulation tree simulations.

Figure 6—figure supplement 1
Estimated relatedness dimensions of datasets.

(A) Kinship matrix rank estimated with the Tracy-Widom test with p<0.01. (B) Cumulative variance explained versus eigenvalue rank fraction. (C) Variance explained by first 10 eigenvalues.

Figure 6—figure supplement 2
Number of PCs significantly associated with traits.

PCs are tested using an ordinary linear regression sequentially, with the th PC tested conditionally on the previous k1 PCs and the intercept. Q-values are estimated from the 90 p-values (one for each PC in a given dataset and replicate) using the R package qvalue assuming π0=1 (necessary since the default π0 estimates were unreliable for such small numbers of p-values and occasionally produced errors), and an FDR threshold of 0.05 is used to determine the number of significant PCs. Distribution per dataset is over its 50 replicates. Shown are results for FES traits with h2=0.8 (the results for RC were very similar, not shown).

Figure 7 with 1 supplement
Evaluation in real datasets excluding 4th degree relatives, FES traits, high heritability.

Each dataset is a column, rows are measures. Boxplot whiskers are extrema over 50 replicates. First row has |SRMSDp|<0.01 band marked as gray area.

Figure 7—figure supplement 1
Evaluation in real datasets excluding 4th degree relatives, FES traits, low heritability.
Figure 8 with 1 supplement
Evaluation in real datasets excluding 4th degree relatives, FES traits, environment.

Traits simulated with environment effects, otherwise the same as Figure 7. ‘LMM lab.’ includes as fixed effects true groups from which environment was simulated.

Figure 8—figure supplement 1
Comparison of performance in low heritability vs environment simulations.

Each curve traces as the number of PCs r is increased from r=0 (marked with an “x”) until r=90 (unmarked end), on one axis is the mean value over replicates of either SRMSDp or AUCPR, for low heritability simulations on the x-axis and environment simulations on the y-axis. Each curve corresponds to one dataset (color) and association model (solid or dashed line type). Columns: (A) FES and (B) RC traits show similar results. First row shows that for PCA curves (dashed), SRMSDp is higher (worse) in environment simulations for low r, but becomes equal in both simulations once r is sufficiently large; for LMM curves (solid), SRMSDp is equal in both simulations for all r, all datasets. Second row shows that for PCA, AUCPR is higher (better) in low heritability simulations for low r, but becomes higher in environment simulations once r is sufficiently large; for LMM, performance is better in environment simulations for all r, all datasets.

Tables

Table 1
Previous PCA-LMM evaluations in the literature.
Sim. GenotypesGeneral
PublicationType*KFSTReal §Trait PowerPCs(r)Best
Zhao et al., 2007Q8LMM
Zhu and Yu, 2009I, A, F3, 8≤0.15Q1–22LMM
Astle and Balding, 2009I30.10CC10Tie
Kang et al., 2010Both2–100LMM
Price et al., 2010I, F20.01CC1Mixed
Wu et al., 2011I, A2–40.01CC10Mixed
Liu et al., 2011S, A2–3RQ10Tie
Sul and Eskin, 2013I20.01CC1Tie
Tucker et al., 2014I20.05Both5Tie
Yang et al., 2014CC5Tie
Song et al., 2015S, A2–3RQ3LMM
Loh et al., 2015Q10LMM
Zhang and Pan, 2015Q20–100LMM
Liu et al., 2016Q3–6LMM
Sul et al., 2018Q100LMM
Loh et al., 2018Both20LMM
Mbatchou et al., 2021Both1LMM
This workA, T, F10–243≤0.25Q0–90LMM
  1. *

    Genotype simulation types. I: Independent subpopulations; S: subpopulations (with parameters drawn from real data); A: Admixture; T: Subpopulation Tree; F: Family.

  2. Model dimension (number of subpopulations or ancestries).

  3. R: simulated parameters based on real data, FST not reported.

  4. §

    Evaluations using unmodified real genotypes.

  5. Q: quantitative; CC: case-control.

Table 2
Features of simulated and real human genotype datasets.
DatasetTypeLoci(m)Ind. (n)Subpops.* (K)Causal loci (m1)FST
Admix. Large sim.Admix.100 0001000101000.1
Admix. Small sim.Admix.100 00010010100.1
Admix. Family sim.Admix.+Pedig.100 0001000101000.1
Human OriginsReal190 394292211–2432920.28
HGDPReal771 3229297–54930.28
1000 GenomesReal1 111 26625045–262500.22
Human Origins sim.Tree190 39429222432920.23
HGDP sim.Tree771 32292954930.25
1000 Genomes sim.Tree1 111 2662504262500.21
  1. *

    For admixed family, ignores additional model dimension of 20 generation pedigree structure. For real datasets, lower range is continental subpopulations, upper range is number of fine-grained subpopulations.

  2. m1=round(nh2/8) to balance power across datasets, shown for h2=0.8 only.

  3. Model parameter for simulations, estimated value on real datasets.

Table 3
Overview of PCA and LMM evaluations for high heritability simulations.
LMM r=0 vs best rPCA vs LMM r=0
DatasetMetricTrait*Cal.Best rP-value §Best rCal.P-value §Best model
Admix. Large sim.|SRMSDp|FESTrue0112True0.036Tie
Admix. Small sim.|SRMSDp|FESTrue014True0.055Tie
Admix. Family sim.|SRMSDp|FESTrue0190False3.9e-10*LMM
Human Origins|SRMSDp|FESTrue0189False3.9e-10*LMM
HGDP|SRMSDp|FESTrue0187True4.4e-10*LMM
1000 Genomes|SRMSDp|FESTrue0190False3.9e-10*LMM
Human Origins sim.|SRMSDp|FESTrue0188True0.017Tie
HGDP sim.|SRMSDp|FESTrue0147True0.046Tie
1000 Genomes sim.|SRMSDp|FESTrue0178True9.6e-10*LMM
Admix. Large sim.|SRMSDp|RCTrue0126True0.11Tie
Admix. Small sim.|SRMSDp|RCTrue014True0.00097Tie
Admix. Family sim.|SRMSDp|RCTrue0190False3.9e-10*LMM
Human Origins|SRMSDp|RCTrue0190True0.00065Tie
HGDP|SRMSDp|RCTrue0137True1.5e-05*LMM
1000 Genomes|SRMSDp|RCTrue0176True3.9e-10*LMM
Human Origins sim.|SRMSDp|RCTrue0185True0.14Tie
HGDP sim.|SRMSDp|RCTrue0144True8.8e-07*LMM
1000 Genomes sim.|SRMSDp|RCTrue0190True3.9e-10*LMM
Admix. Large sim.AUCPRFES0135.9e-06*LMM
Admix. Small sim.AUCPRFES0120.025Tie
Admix. Family sim.AUCPRFES10.35223.9e-10*LMM
Human OriginsAUCPRFES01343.9e-10*LMM
HGDPAUCPRFES10.33164.4e-10*LMM
1000 GenomesAUCPRFES10.1183.9e-10*LMM
Human Origins sim.AUCPRFES01363.9e-10*LMM
HGDP sim.AUCPRFES01171.7e-05*LMM
1000 Genomes sim.AUCPRFES01105e-10*LMM
Admix. Large sim.AUCPRRC0131.4e-05*LMM
Admix. Small sim.AUCPRRC0110.095Tie
Admix. Family sim.AUCPRRC01343.9e-10*LMM
Human OriginsAUCPRRC30.4369.6e-10*LMM
HGDPAUCPRRC40.21160.013Tie
1000 GenomesAUCPRRC50.00490.00043Tie
Human Origins sim.AUCPRRC01374.1e-10*LMM
HGDP sim.AUCPRRC30.087170.0014Tie
1000 Genomes sim.AUCPRRC30.37108.5e-10*LMM
  1. *

    FES: Fixed Effect Sizes, RC: Random Coefficients.

  2. Calibrated: whether mean |SRMSDp|<0.01 over 50 replicates.

  3. Value of r (number of PCs) with minimum mean |SRMSDp| or maximum mean AUCPR.

  4. §

    Wilcoxon paired 1-tailed test of distributions (|SRMSDp| or AUCPR) between models in header. Asterisk marks significant value using Bonferroni threshold (p<α/ntests with α=0.01 and ntests=72 is the number of tests in this table).

  5. Tie if no significant difference using Bonferroni threshold.

Table 4
Dataset sizes after 4th degree relative filter.
DatasetLoci (m)Ind. (n)Ind. removed (%)
Human Origins189 72226369.8
HGDP758 0098478.8
1000 Genomes1 097 41523904.6
Table 5
Overview of PCA and LMM evaluations for low heritability simulations.
LMM r=0 vs best rPCA vs LMM r=0
DatasetMetricTrait*Cal.Best rp-value §Best rCal.p-value §Best model
Admix. Large sim.|SRMSDp|FESTrue0162True0.00012*LMM
Admix. Small sim.|SRMSDp|FESTrue013True0.27Tie
Admix. Family sim.|SRMSDp|FESTrue0190False3.9e-10*LMM
Human Origins|SRMSDp|FESTrue0181True3.9e-10*LMM
HGDP|SRMSDp|FESTrue0137True6.2e-09*LMM
1000 Genomes|SRMSDp|FESTrue0184True3.9e-10*LMM
Admix. Large sim.|SRMSDp|RCTrue0135True0.00094Tie
Admix. Small sim.|SRMSDp|RCTrue013True0.087Tie
Admix. Family sim.|SRMSDp|RCTrue0190False4.1e-10*LMM
Human Origins|SRMSDp|RCTrue0175True0.00016*LMM
HGDP|SRMSDp|RCTrue0123True1.7e-05*LMM
1000 Genomes|SRMSDp|RCTrue0141True6.7e-10*LMM
Admix. Large sim.AUCPRFES0130.11Tie
Admix. Small sim.AUCPRFES0100.58Tie
Admix. Family sim.AUCPRFES0172.2e-06*LMM
Human OriginsAUCPRFES01168e-10*LMM
HGDPAUCPRFES110.6860.0043Tie
1000 GenomesAUCPRFES60.3442.3e-07*LMM
Admix. Large sim.AUCPRRC0130.14Tie
Admix. Small sim.AUCPRRC0100.1Tie
Admix. Family sim.AUCPRRC0151.9e-06*LMM
Human OriginsAUCPRRC40.16120.003Tie
HGDPAUCPRRC20.1450.14Tie
1000 GenomesAUCPRRC0140.078Tie
  1. *

    FES: Fixed Effect Sizes, RC: Random Coefficients.

  2. Calibrated: whether mean |SRMSDp|<0.01 over 50 replicates.

  3. Value of r (number of PCs) with minimum mean |SRMSDp| or maximum mean AUCPR.

  4. §

    Wilcoxon paired 1-tailed test of distributions (|SRMSDp| or AUCPR) between models in header. Asterisk marks significant value using Bonferroni threshold (p<α/ntests with α=0.01 and ntests=48 is the number of tests in this table).

  5. Tie if no significant difference using Bonferroni threshold.

Table 6
Overview of PCA and LMM evaluations for environment simulations.
LMM r=0 vs best rPCA vs LMM r=0LMM lab. r=0 vs PCA/LMM
DatasetMetricTrait*Cal.rp-value §rCal.p-value §Best Cal.p-value §Best
Admix. Large sim.|SRMSDp|FESTrue0183True0.38TieTrue1.8e-14*PCA/LMM
Admix. Small sim.|SRMSDp|FESTrue0190True0.001TieFalse1.4e-14*PCA/LMM
Admix. Family sim.|SRMSDp|FESTrue40.1890False3.9e-10*LMMTrue0.066LMM/LMM lab.
Human Origins|SRMSDp|FESTrue93.9e-05*90False1.4e-08*LMMFalse3.9e-10*LMM
HGDP|SRMSDp|FESTrue0190True0.0037TieFalse2.1e-09*PCA/LMM
1000 Genomes|SRMSDp|FESFalse88.8e-08*85True0.053TieTrue3.9e-10*LMM lab.
Admix. Large sim.|SRMSDp|RCTrue0160True0.033TieTrue6.3e-10*PCA/LMM
Admix. Small sim.|SRMSDp|RCTrue019True0.85TieFalse1.4e-14*PCA/LMM
Admix. Family sim.|SRMSDp|RCTrue50.1490False3.9e-10*LMMTrue0.011LMM/LMM lab.
Human Origins|SRMSDp|RCFalse91.1e-08*90True2.3e-07*PCAFalse3.9e-10*PCA
HGDP|SRMSDp|RCTrue0189True6.5e-09*PCAFalse3.9e-10*PCA
1000 Genomes|SRMSDp|RCFalse81.6e-08*88True4.9e-09*PCATrue0.09PCA/LMM lab.
Admix. Large sim.AUCPRFES42.4e-06*60.0021Tie1.8e-15*LMM lab.
Admix. Small sim.AUCPRFES30.05540.033Tie0.28Tie
Admix. Family sim.AUCPRFES127e-04633.9e-10*LMM3.9e-10*LMM lab.
Human OriginsAUCPRFES203.7e-06*901.4e-05*LMM3.9e-10*LMM lab.
HGDPAUCPRFES124.3e-06*450.0044Tie3.9e-10*LMM lab.
1000 GenomesAUCPRFES91.9e-08*550.028Tie3.9e-10*LMM lab.
Admix. Large sim.AUCPRRC40.0008550.0018Tie5e-10*LMM lab.
Admix. Small sim.AUCPRRC20.1350.093Tie0.0028Tie
Admix. Family sim.AUCPRRC90.01861.7e-09*LMM3.9e-10*LMM lab.
Human OriginsAUCPRRC220.0039901e-06*PCA3.9e-10*LMM lab.
HGDPAUCPRRC190.0057642.8e-05*PCA3e-07*LMM lab.
1000 GenomesAUCPRRC98.7e-05*871.2e-09*PCA4.4e-10*LMM lab.
  1. *

    FES: Fixed Effect Sizes, RC: Random Coefficients.

  2. Calibrated: whether mean |SRMSDp|<0.01 over 50 replicates.

  3. Value of r (number of PCs) with minimum mean |SRMSDp| or maximum mean AUCPR.

  4. §

    Wilcoxon paired 1-tailed test of distributions (|SRMSDp| or AUCPR) between models in header. Asterisk marks significant value using Bonferroni threshold (p<α/ntests with α=0.01 and ntests=72 is the number of tests in this table).

  5. Tie if no significant difference using Bonferroni threshold; in last column, pairwise ties are specified and “Tie” is three-way tie.

Table 7
Variance parameters of trait simulations.
Trait variance typeh2ση2σϵ2
High heritability0.80.00.2
Low heritability0.30.00.7
Environment0.30.50.2

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yiqi Yao
  2. Alejandro Ochoa
(2023)
Limitations of principal components in quantitative genetic association models for human studies
eLife 12:e79238.
https://doi.org/10.7554/eLife.79238