Limitations of principal components in quantitative genetic association models for human studies

  1. Yiqi Yao
  2. Alejandro Ochoa  Is a corresponding author
  1. Duke University, United States

Abstract

Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Code is available at https://github.com/OchoaLab/pca-assoc-paper

The following previously published data sets were used

Article and author information

Author details

  1. Yiqi Yao

    Department of Biostatistics and Bioinformatics, Duke University, Durham, United States
    Competing interests
    Yiqi Yao, is affiliated with BenHealth Consulting. The author has no financial interests to declare..
  2. Alejandro Ochoa

    Department of Biostatistics and Bioinformatics, Duke University, Durham, United States
    For correspondence
    alejandro.ochoa@duke.edu
    Competing interests
    No competing interests declared.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4928-3403

Funding

Whitehead Foundation

  • Alejandro Ochoa

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2023, Yao & Ochoa

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,483
    views
  • 141
    downloads
  • 7
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yiqi Yao
  2. Alejandro Ochoa
(2023)
Limitations of principal components in quantitative genetic association models for human studies
eLife 12:e79238.
https://doi.org/10.7554/eLife.79238

Share this article

https://doi.org/10.7554/eLife.79238

Further reading

    1. Developmental Biology
    2. Genetics and Genomics
    Anne-Sophie Pepin, Patrycja A Jazwiec ... Sarah Kimmins
    Research Article Updated

    Paternal obesity has been implicated in adult-onset metabolic disease in offspring. However, the molecular mechanisms driving these paternal effects and the developmental processes involved remain poorly understood. One underexplored possibility is the role of paternally induced effects on placenta development and function. To address this, we investigated paternal high-fat diet-induced obesity in relation to sperm histone H3 lysine 4 tri-methylation signatures, the placenta transcriptome, and cellular composition. C57BL6/J male mice were fed either a control or high-fat diet for 10 weeks beginning at 6 weeks of age. Males were timed-mated with control-fed C57BL6/J females to generate pregnancies, followed by collection of sperm, and placentas at embryonic day (E)14.5. Chromatin immunoprecipitation targeting histone H3 lysine 4 tri-methylation (H3K4me3) followed by sequencing (ChIP-seq) was performed on sperm to define obesity-associated changes in enrichment. Paternal obesity corresponded with altered sperm H3K4me3 at promoters of genes involved in metabolism and development. Notably, altered sperm H3K4me3 was also localized at placental enhancers. Bulk RNA-sequencing on placentas revealed paternal obesity-associated sex-specific changes in expression of genes involved in hypoxic processes such as angiogenesis, nutrient transport, and imprinted genes, with a subset of de-regulated genes showing changes in H3K4me3 in sperm at corresponding promoters. Paternal obesity was also linked to impaired placenta development; specifically, a deconvolution analysis revealed altered trophoblast cell lineage specification. These findings implicate paternal obesity effects on placenta development and function as one potential developmental route to offspring metabolic disease.

    1. Biochemistry and Chemical Biology
    2. Genetics and Genomics
    Federico A Vignale, Andrea Hernandez Garcia ... Adrian G Turjanski
    Research Article

    Yerba mate (YM, Ilex paraguariensis) is an economically important crop marketed for the elaboration of mate, the third-most widely consumed caffeine-containing infusion worldwide. Here, we report the first genome assembly of this species, which has a total length of 1.06 Gb and contains 53,390 protein-coding genes. Comparative analyses revealed that the large YM genome size is partly due to a whole-genome duplication (Ip-α) during the early evolutionary history of Ilex, in addition to the hexaploidization event (γ) shared by core eudicots. Characterization of the genome allowed us to clone the genes encoding methyltransferase enzymes that catalyse multiple reactions required for caffeine production. To our surprise, this species has converged upon a different biochemical pathway compared to that of coffee and tea. In order to gain insight into the structural basis for the convergent enzyme activities, we obtained a crystal structure for the terminal enzyme in the pathway that forms caffeine. The structure reveals that convergent solutions have evolved for substrate positioning because different amino acid residues facilitate a different substrate orientation such that efficient methylation occurs in the independently evolved enzymes in YM and coffee. While our results show phylogenomic constraint limits the genes coopted for convergence of caffeine biosynthesis, the X-ray diffraction data suggest structural constraints are minimal for the convergent evolution of individual reactions.