A sex-specific evolutionary interaction between ADCY9 and CETP

  1. Isabel Gamache
  2. Marc-André Legault
  3. Jean-Christophe Grenier
  4. Rocio Sanchez
  5. Eric Rhéaume
  6. Samira Asgari
  7. Amina Barhdadi
  8. Yassamin Feroz Zada
  9. Holly Trochet
  10. Yang Luo
  11. Leonid Lecca
  12. Megan Murray
  13. Soumya Raychaudhuri
  14. Jean-Claude Tardif
  15. Marie-Pierre Dubé
  16. Julie Hussin  Is a corresponding author
  1. Université de Montréal, Canada
  2. Montreal Heart Institute, Canada
  3. Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Canada
  4. Center for Data Sciences, Brigham and Women’s Hospital, Harvard Medical School, United States
  5. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, United States
  6. Socios En Salud, Peru
  7. Harvard Medical School, United States
  8. Centre for Genetics and Genomics Versus Arthritis, Manchester Academic Health Science Centre, University of Manchester, United Kingdom
  9. Department of Biomedical Informatics, Harvard Medical School, United States
  10. Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, United States

Abstract

Pharmacogenomic studies have revealed associations between rs1967309 in the adenylyl cyclase type 9 (ADCY9) gene and clinical responses to the cholesteryl ester transfer protein (CETP) modulator dalcetrapib, however, the mechanism behind this interaction is still unknown. Here, we characterized selective signals at the locus associated with the pharmacogenomic response in human populations and we show that rs1967309 region exhibits signatures of positive selection in several human populations. Furthermore, we identified a variant in CETP, rs158477, which is in long-range linkage disequilibrium with rs1967309 in the Peruvian population. The signal is mainly seen in males, a sex-specific result that is replicated in the LIMAA cohort of over 3400 Peruvians. Analyses of RNA-seq data further suggest an epistatic interaction on CETP expression levels between the two SNPs in multiple tissues, which also differs between males and females. We also detected interaction effects of the two SNPs with sex on cardiovascular phenotypes in the UK Biobank, in line with the sex-specific genotype associations found in Peruvians at these loci. We propose that ADCY9 and CETP coevolved during recent human evolution due to sex-specific selection, which points toward a biological link between dalcetrapib’s pharmacogene ADCY9 and its therapeutic target CETP.

Introduction

Coronary artery disease (CAD) is the leading cause of mortality worldwide. It is a complex disease caused by the accumulation of cholesterol-loaded plaques that block blood flow in the coronary arteries. The cholesteryl ester transfer protein (CETP) mediates the exchange of cholesterol esters and triglycerides between high-density lipoproteins (HDL) and lower density lipoproteins (Lagrost, 1994; Shinkai, 2012). Dalcetrapib is a CETP modulator that did not reduce cardiovascular event rates in the overall dal-OUTCOMES trial of patients with recent acute coronary syndrome (Schwartz et al., 2012). However, pharmacogenomic analyses revealed that genotypes at rs1967309 in the ADCY9 gene, coding for the ninth isoform of adenylate cyclase, modulated clinical responses to dalcetrapib (Tardif et al., 2015). Individuals who carried the AA genotype at rs1967309 in ADCY9 had less cardiovascular events, reduced atherosclerosis progression, and enhanced cholesterol efflux from macrophages when treated with dalcetrapib compared to placebo (Tardif et al., 2015; Tardif et al., 2016). In contrast, those with the GG genotype had the opposite effects from dalcetrapib. Furthermore, a protective effect against the formation of atherosclerotic lesions was seen only in the absence of both Adcy9 and CETP in mice (Rautureau et al., 2018), suggesting an interaction between the two genes. However, the underlying mechanisms linking CETP and ADCY9, located 50 Mb apart on chromosome 16, as well as the relevance of the rs1967309 non-coding genetic variant are still unclear.

Identification of selection pressure on a genetic variant can help shed light on its importance. Adaptation to different environments often leads to a rise in frequency of variants, by favoring survival and/or reproduction fitness. An example is the lactase gene (LCT) (Bersaglieri et al., 2004; Enattah et al., 2007; Gamba et al., 2014; Itan et al., 2009; Poulter et al., 2003), where a positively selected intronic variant in MCM6 leads to an escape from epigenetic inactivation of LCT and facilitates lactase persistence after weaning (Labrie et al., 2016). Results of genomic studies for phenotypes such as adaptation to high-altitude hypoxia in Tibetans (Yi et al., 2010), fatty acid metabolism in Inuits (Fumagalli et al., 2015) or response to pathogens across populations (Hollenbach et al., 2001) have also been confirmed by functional studies (Li et al., 2018; Reynolds et al., 2020; Tashi et al., 2017; Meyer et al., 2018; Blais et al., 2012). Thus, population and regulatory genomics can be leveraged to unveil the effect of genetic mutations at a single non-coding locus and reveal the biological mechanisms of adaptation.

When two or more loci interact during adaptation, a genomic scan will likely be underpowered to pinpoint the genetic determinants. In this study, we took a multi-step approach on the ADCY9 and CETP candidate genes to specifically study their interaction (Figure 1). We used a joint evolutionary analysis to evaluate the potential signatures of selection in these genes (Step 1), which revealed positive selection pressures acting on ADCY9. Sex-specific genetic associations between the two genes are discovered in Peruvians (Step 2), a population in which natural selection for high-altitude was previously found on genes related to cardiovascular health (Crawford et al., 2017). Furthermore, our know-down experiments and analyses of large-scale transcriptomics (Step 3) as well as available phenome-wide resources (Step 4) bring further evidence of a sex-specific epistatic interaction between ADCY9 and CETP.

Flowchart of experimental design and main results.

The four main steps of the analyses conducted in this study are reported along with the datasets used for each step and the genetic loci on which the analyses are performed. Green colored boxes represent analyses for which sex is considered. Abbreviations: KD = Knock down; OX = Overexpression.

Results

Signatures of selection at rs1967309 in ADCY9 in human populations

The genetic variant rs1967309 is located in intron 2 of ADCY9, in a region of high linkage disequilibrium (LD), in all subpopulations in the 1000 Genomes Project (1000G), and harbors heterogeneous genotype frequencies across human populations (Figure 2a). Its intronic location makes it difficult to assess its functional relevance but exploring selective signals around intronic SNPs in human populations can shed light on their importance. In African populations (AFR), the major genotype is AA, which is the homozygous genotype for the ancestral allele, whereas in Europeans (EUR), AA is the minor genotype. The frequency of the AA genotype is slightly higher in Asia (EAS, SAS) and America (AMR) compared to that in Europe, becoming the most frequent genotype in the Peruvian population (PEL). Using the integrated haplotype score (iHS) (Voight et al., 2006) (Step 1 a, Figure 1), a statistic that enables the detection of evidence for recent strong positive selection (typically when |iHS| > 2), we observed that several SNPs in the LD block around rs1967309 exhibit selective signatures in non-African populations (|iHSSAS| = 2.66, |iHSEUR| = 2.31), whereas no signal is seen in this LD block in African populations (Figure 2b, Appendix 1—figure 1, Appendix 1). Our analyses suggest that this locus in ADCY9 has been the target of recent positive selection in several human populations, with multiple, possibly independent, selective signals detectable around rs1967309. However, recent positive selection as measured by iHS does not seem to explain the notable increase in frequency for the A allele in the PEL population (fA = 0.77), compared to the European (fA = 0.41), Asian (fA = 0.44), and other American populations (fA = 0.54 in AMR without PEL).

Natural selection signature at rs1967309 in ADCY9.

(a) Genotype frequency distribution of rs1967309 in populations from the 1000 Genomes (1000G) Project and in Native Americans (NAGD). (b) Significant iHS values (absolute values above 2) for 1000G continental populations and recombination rates from AMR-1000G population-specific genetic maps, in the ADCY9 gene. (c) PBS values in the ADCY9 gene, in CHB (outgroup, left panel), PEL (middle panel), and MXL (right panel). Horizontal lines represent the 95th percentile PBS value genome-wide for each population. Vertical dotted black lines define the LD block around rs1967309 (black circle) from 1000G population-specific genetic maps. Gene plots for ADCY9 showing location of its exons are presented in blue below each plot. Abbreviations: Altaic from Mongolia and Russia: ALT; Uralic Yukaghir from Russia: URY; Chukchi Kamchatkan from Russia: CHK; Northern American from Canada, Guatemala and Mexico: NOA; Central American from Costal Rica and Mexico: CEA; Chibchan Paezan from Argentina, Bolivia, Colombia, Costa Rica, and Mexico: CHP; Equatorial Tucanoan from Argentina, Brazil, Colombia, Gualana and Paraguay: EQT; Andean from Bolivia, Chile, Colombia and Peru: AND. For 1000G populations, abbreviations can be found here https://www.internationalgenome.org/category/population/.

Figure 2—source data 1

Source file for genotype frequency distribution of rs1967309.

This file contains the genotype frequency of rs1967309 for each subpopulation from 1000G and NAGD, with the number of individuals by subpopulation.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig2-data1-v2.txt
Figure 2—source data 2

Source file for iHS plot in the ADCY9 gene.

This file contains the iHS values for each position in the ADCY9 gene for each population of the 1000G dataset.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig2-data2-v2.txt
Figure 2—source data 3

Source file for PBS plots in the ADCY9 gene.

This file contains the PBS value for PEL, MXL, and CHB for each position in the ADCY9 gene.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig2-data3-v2.txt

To test whether the difference between PEL and other AMR allele frequencies at rs1967309 is significant, we used the population branch statistic (PBS) (Step 1b, Figure 1). This statistic has been developed to locate selection signals by summarizing differentiation between populations using a three-way comparison of allele frequencies between a specific group, a closely related population, and an outgroup (Yi et al., 2010). It has been shown to increase power to detect incomplete selective sweeps on standing variation. Applying this statistic to investigate rs1967309 allele frequency in PEL, we used Mexicans (MXL) as a closely related group and a Chinese population (CHB) as the outgroup (Methods). Over the entire genome, the CHB branches are greater than PEL and MXL branches (meanCHB = 0.020, meanMXL = 0.008, meanPEL = 0.009), which reflects the expectation under genetic drift. However, the estimated PEL branch length at rs1967309 (Figure 2c), which reflects differentiation since the split from the MXL population (PBSPEL,rs1967309=0.051, empirical p-value = 0.014), surpasses the CHB branch length (PBSCHB,rs1967309=0.049, empirical p-value > 0.05), which reflects differentiation since the split between Asian and American populations, whereas no such effect is seen in MXL (PBSMXL,rs1967309=0.026, empirical p-value > 0.05), or for any other AMR populations. Furthermore, the PEL branch lengths at several SNPs in this LD block (Figure 2c) are in the top 5 % of all PEL branch lengths across the whole genome (PBSPEL,95th = 0.031), whereas these increased branch lengths are not observed outside of the LD block (Figure 2c). These results are robust to the choice of the outgroup and the closely related AMR population (Materials and methods).

The increase in frequency of the A allele at rs1967309 is also seen in genotype data from Native American populations (Reich et al., 2012), with Andeans showing genotype frequencies highly similar to PEL (fA = 0.77, Figure 2a). The PEL population has a large Andean ancestry (Materials and methods, Appendix 1—figure 2a and b) and almost no African ancestry, strongly suggesting that the increase in AA genotype arose in the Andean population and not from admixture with Africans. The PEL individuals that harbor the AA genotype for rs1967309 do not exhibit a larger genome-wide Andean ancestry than non-AA individuals (p-value = 0.30, Mann-Whitney U test). Overall, these results suggest that the ancestral allele A at rs1967309, after dropping in frequency following the out-of-Africa event, has increased in frequency in the Andean population and has been preferentially retained in the Peruvian population’s genetic makeup, potentially because of natural selection.

Evidence for co-evolution between ADCY9 and CETP in Peru

The pharmacogenetic link between ADCY9 and the CETP modulator dalcetrapib raises the question of whether there is a genetic relationship between rs1967309 in ADCY9 and CETP, both located on chromosome 16. Such a relationship can be revealed by analyzing patterns of long-range linkage disequilibrium (LRLD) (Lewontin and Kojima, 1960; Rohlfs et al., 2010), in order to detect whether specific combinations of alleles (or genotypes) at two loci are particularly overrepresented. To do so, we calculated the genotyped-based linkage disequilibrium (r2) (Step 2 a, Figure 1) between rs1967309 and each SNP in CETP with minor allele frequency (MAF) above 0.05. In the Peruvian population, there are four SNPs, (including two in perfect LD in PEL) that exhibit r2 values with rs1967309 that are in the top 1 % of r2 values (Figure 3a) computed for all 37,802 pairs of SNPs in ADCY9 and CETP genes with MAF >0.05 (Materials and methods). Despite the r2 values themselves being low (r2rs158477=0.080, r2rs158480;rs158617=0.089, r2rs12447620=0.090), these values are highly unexpected for these two genes situated 50 Mb apart (ADCY9/CETP empirical p-value < 0.006, Appendix 1—table 1) and thus correspond to a significant LRLD signal. This signal is not seen in other 1000G populations (Appendix 1—table 1). We also computed r2 between the four identified SNPs’ genotypes and all ADCY9 SNPs with MAF above 0.05 (Figure 3b). The distribution of r2 values for the rs158477 CETP SNP shows a clear bell-shaped pattern around rs1967309 in ADCY9, which strongly suggests the rs1967309-rs158477 genetic association detected is not simply a statistical fluke, while the signal in the region for the other SNPs is less conclusive. The SNP rs158477 in CETP is also the only one that has a PEL branch length value higher than the 95th percentile, also higher than the CHB branch length value (PBSPEL,rs158477=0.062, Appendix 1—figure 3a), in line with the observation at rs1967309. Strikingly, this CETP SNP’s genotype frequency distribution across the 1000G and Native American populations resembles that of rs1967309 in ADCY9 (Figure 3c).

Figure 3 with 2 supplements see all
Long-range linkage disequilibrium between rs1967309 and rs158477 in Peruvians from Lima, Peru.

(a) Genotype correlation (r2) between rs1967309 and all SNPs with MAF >5% in CETP, for the PEL population. (b) Genotype correlation between the three loci identified in (a) to be in the 99th percentile and all SNPs with MAF >5% in ADCY9. The dotted line indicates the position of rs1967309. The horizontal lines in (a,b) represent the threshold for the 99th percentile of all comparisons of SNPs (MAF >5%) between ADCY9 and CETP. Figure 3—figure supplement 1 presents the same plots for Andeans and in the replication cohort (LIMAA) and Figure 3—figure supplement 2 compares the r2 values between PEL and LIMAA (c) Genotype frequency distribution of rs158477 in 1000G and Native American populations. (d) Genomic distribution of r2 values from 3,513 pairs of SNPs separated by between 50 and 60 Mb and 61 ± 10 cM away across all Peruvian chromosomes from the PEL sample, compared to the rs1967309-rs158477 r2 value (dotted gray line) (genome-wide empirical p-value = 0.01). The vertical black line shows the threshold for the 95th percentile threshold of all pairs. Gene plots showing location of exons for CETP (a) and ADCY9 (b) are presented in blue below each plot. Abbreviations: Altaic from Mongolia and Russia: ALT; Uralic Yukaghir from Russia: URY; Chukchi Kamchatkan from Russia: CHK; Northern American from Canada, Guatemala and Mexico: NOA; Central American from Costal Rica and Mexico: CEA; Chibchan Paezan from Argentina, Bolivia, Colombia, Costa Rica and Mexico: CHP; Equatorial Tucanoan from Argentina, Brazil, Colombia, Gualana and Paraguay: EQT; Andean from Bolivia, Chile, Colombia and Peru: AND. For 1000G populations, abbreviations can be found here https://www.internationalgenome.org/category/population/.

Figure 3—source data 1

R2 values of all SNPs between ADCY9 and CETP genes in the PEL population from 1000G.

This file contains the result from the geno-r2 command of the vcftools software for all SNPs (MAF >5%) of the PEL population between ADCY9 and CETP genes. The script to create Figure 3a and b can be found here .

https://cdn.elifesciences.org/articles/69198/elife-69198-fig3-data1-v2.txt
Figure 3—source data 2

Source file for genotype frequency distribution of rs158477.

This file contains the genotype frequency of rs158477 for each subpopulation from 1000G and NAGD, with the number of individuals by subpopulation.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig3-data2-v2.txt
Figure 3—source data 3

R2 values used for the null distribution in the PEL population from 1000G.

3,513 pairs of SNPs on chromosome 1–18 with a MAF between 15% and 30%, separated by between 50 and 60 Mb and 51–71 cM based on the PEL genetic map from 1000G. R2 values were obtained from the geno-r2 command of the vcftools software.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig3-data3-v2.txt

Given that the Peruvian population is admixed (Harris et al., 2018), particular enrichment of genome segments for a specific ancestry, if present, would lead to inflated LRLD between these segments (Li and Nei, 1974; Nei and Li, 1973; Park, 2019; Slatkin, 2008), we thus performed several admixture-related analyses (Step 2b, Figure 1). No significant enrichment is seen at either locus and significant LRLD is also seen in the Andean source population (Figure 3—figure supplement 1a, b, Appendix 1). Furthermore, we see no enrichment of Andean ancestry in individuals harboring the overrepresented combination of genotypes, AA at rs1967309+ GG at rs158477, compared to other combinations (p-value = 0.18, Mann-Whitney U test). These results show that admixture patterns in PEL cannot be solely responsible for the association found between rs1967309 and rs158477. Finally, using a genome-wide null distribution which allows to capture the LRLD distribution expected under the admixture levels present in this sample (Appendix 1), we show that the r2 value between the two SNPs is higher than expected given their allele frequencies and the physical distance between them (genome-wide empirical p-value = 0.01, Figure 3d). Taken together, these findings strongly suggest that the AA/GG combination is being transmitted to the next generation more often (i.e. is likely selectively favored) which reveals a signature of co-evolution between ADCY9 and CETP at these loci.

Still, such a LRLD signal can be due to a small sample size (Park, 2019). To confirm independently the association between genotypes at rs1967309 of ADCY9 and rs158477 of CETP, we used the LIMAA cohort (Asgari et al., 2020; Luo et al., 2019), a large cohort of 3509 Peruvian individuals with genotype information, to replicate our finding. The ancestry distribution, as measured by RFMix (Methods) is similar between the two cohorts (Appendix 1—figure 2a), however, the LIMAA cohort population structure shows additional subgroups compared to the 1000G PEL population sample (Appendix 1—figure 2c-e): to limit confounders, we excluded individuals coming from these subgroups (Appendix 1). In this dataset (N = 3,243), the pair of SNPs rs1967309-rs158477 is the only pairs identified in PEL who shows evidence for LRLD, with an r2 value in the top 1 % of all pairs of SNPs in ADCY9 and CETP (ADCY9/CETP empirical p-value = 0.003, Figure 3—figure supplement 1c, d, Figure 3—figure supplement 2, Appendix 1—table 1). The r2 test used above is powerful to detect allelic associations, but the net association measured will be very small if selection acts on a specific genotype combination rather than on alleles. In that scenario, and when power allows it, the genotypic association is better assessed by with a χ2 distributed test statistic (with four degrees of freedom, χ42) comparing the observed and expected genotype combination counts (Rohlfs et al., 2010). The test confirmed the association in LIMAA (χ42 = 82.0, permutation p-value < 0.001, genome-wide empirical p-value = 0.0003, Appendix 1). The association discovered between rs1967309 and rs158477 is thus generalizable to the Peruvian population and not limited to the 1000G PEL sample.

Sex-specific long-range linkage disequilibrium signal

Because the allele frequencies at rs1967309 were suggestively different between males and females (Figure 4—figure supplement 1), we performed sex-stratified PBS analyses, which suggested that the LD block around rs1967309 is differentiated between sexes in the Peruvians (Figure 4—figure supplement 2, Appendix 1). We therefore explored further the effect of sex on the LRLD association found between rs1967309 and rs158477 and performed sex-stratified LRLD analyses. These analyses revealed that the correlation between rs1967309 and rs158477 is only seen in males in PEL (Figure 4a and b, Appendix 1—figure 4a and b, Appendix 1—table 1): the r2 value rose to 0.348 in males (ADCY9/CETP empirical p-value = 8.23 × 10–5, genome-wide empirical p-value < 2.85 × 10–4, N = 41) and became non-significant in females (ADCY9/CETP empirical p-value = 0.78, genome-wide empirical p-value = 0.80, N = 44). In the Andean population, the association between rs1967309 and rs158477 is not significant when we stratified by sex (Appendix 1—table 1), but we still see significant association signals with rs158477 at other SNPs in ADCY9 LD block in both sexes (Figure 4—figure supplement 3, Appendix 1—figure 5a,b). The LRLD result in PEL cannot be explained by differences of Andean ancestry proportion between males and females (p-value = 0.27, Mann-Whitney U test). A permutation analysis that shuffled the sex labels of samples established that the observed difference between the sexes is larger than what we expect by chance (p-value = 0.002, Appendix 1—figure 4c, Appendix 1). In the LIMAA cohort, we replicate this sex-specific result (Figure 4c and d, Appendix 1—figure 5c,d,e,f, Appendix 1—table 1) where the r2 test is significant in males (ADCY9/CETP empirical p-value = 0.003, N = 1,941) but not in females (ADCY9/CETP empirical p-value = 0.52, N = 1302). The genotypic χ42 test confirms the association between ADCY9 and CETP is present in males (χ42 = 56.6, permutation p-value = 0.001, genome-wide empirical p-value = 0.002, Appendix 1), revealing an excess of rs1967309-AA+ rs158477 GG. This is also the genotype combination driving the LRLD in PEL. In females, the test also shows a weaker but significant effect (χ42 = 37.0, permutation p-value = 0.017, genome-wide empirical p-value = 0.001) driven by an excess of a different genotype combination, rs1967309-AA+ rs158477 AA, which is, however, not replicated in PEL possibly because of lack of power (Appendix 1).

Figure 4 with 3 supplements see all
Sex-specific long-range linkage disequilibrium.

Genotype correlation between the loci identified in CETP in Figure 3a and all SNPs with MAF >5% in ADCY9 for (a,b) the PEL population and (c,d) LIMAA cohort in males (a,c) and in females (b,d). Genotype frequencies per sex are shown in Figure 4—figure supplement 1 and sex-specific PBS values in Figure 4—figure supplement 2. The horizontal line shows the threshold for the 99th percentile of all comparisons of SNPs (MAF >5%) between ADCY9 and CETP. The vertical dotted line represents the position of rs1967309. Blue dots represent the rs158477 SNPs and pink represents the other three SNPs identified in Figure 3a (rs158480, rs158617, and rs12447620), which are in near-perfect LD. Figure 4—figure supplement 3 shows the same analysis in Andeans from NAGD. Gene plots for ADCY9 showing location of its exons are presented in blue below each plot.

Figure 4—source data 1

R2 values of all SNPs between ADCY9 and CETP genes in the PEL population from 1000G and LIMAA cohort in male and female.

This zip archive contains all files of r2 values obtained from the geno-r2 command of the vcftools software for all SNPs (MAF >5%) of the PEL population (files beginning by F4a [male] and F4b [female]) and the LIMAA cohort (files beginning by F4c [male] and F4d [female]) between ADCY9 and CETP genes stratified by sex. Scripts to create those figures can be found here: Gamache, 2021.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig4-data1-v2.zip

Epistatic effects on CETP gene expression

LRLD between variants can suggest the existence of gene-gene interactions, especially if they are functional variants (Park, 2019). In order to be under selection, mutations typically need to modulate a phenotype or an endophenotype, such as gene expression. We have shown previously (Rautureau et al., 2018) that CETP and Adcy9 interact in mice to modulate several phenotypes, including atherosclerotic lesion development. To test whether these genes interact in humans, we knocked down (KD) ADCY9 in hepatocyte HepG2 cells (Step 3 a, Figure 1) and performed RNA sequencing on five KD biological replicates and five control replicates, to evaluate the impact of decreased ADCY9 expression on the transcriptome. We confirmed the KD was successful as ADCY9 expression is reduced in the KD replicates (Figure 5a), which represents a drastic drop in expression compared to the whole transcriptome changes (False Discovery Rate [FDR] = 4.07 x 10–14, Materials and methods). We also observed that CETP expression was increased in ADCY9-KD samples compared to controls (Figure 5a), an increase that is also transcriptome-wide significant (FDR = 1.97 × 10–7, ß = 1.257). This increased expression was validated by qPCR, and western blot also showed increased CETP protein product (Materials and methods, Figure 5—figure supplement 1a, b, Appendix 1), but its overexpression did not significatively modulate CETP expression (Figure 5—figure supplement 1c). Knocking down or overexpressing CETP did not impact ADCY9 expression on qPCR (Figure 5—figure supplement 1d, e). These experiments demonstrate an interaction between ADCY9 and CETP at the gene expression level and raised the hypothesis that ADCY9 potentially modulates the expression of CETP through a genetic effect mediated by rs1967309.

Figure 5 with 2 supplements see all
Effect of ADCY9 on CETP expression.

(a) Normalized expression of ADCY9 or CETP genes depending on wild type (WT) and ADCY9-KD in HepG2 cells from RNA sequencing on five biological replicates in each group. p-Values were obtained from a two-sided Wilcoxon paired test. qPCR and western blot results in HepG2 are presented in Figure 5—figure supplement 1. (b,c,d)CETP expression depending on the combination of rs1967309 and rs158477 genotypes in (b) GEUVADIS (p-value = 0.03, ß = –0.22, N = 287), (c) GTEx-Skin Sun Exposed in males (p-value = 0.0017, ß = –0.32, N = 330) and in (d) GTEx-tibial artery in females (p-value = 0.026, ß = 0.38, N = 156), for individuals of European descent according to principal component analysis. p-Values reported were obtained from a two-way interaction of a linear regression model for the maximum number of PEER/sPEER factors considered. Figure 5—figure supplement 2 show the interaction p-values depending on number of PEER/sPEER factors included in the linear models.

Figure 5—source data 1

Normalized expression of ADCY9 and CETP genes HepG2 cells.

This file contains the normalized expression of ADCY9 (ENSG00000162104) and CETP (ENSG00000087237) for the WT (samples beginning by ‘Scr’) and ADCY9-KD (samples beginning by si-1039) in the HepG2 cell line. Each sample from the WT experiment is paired with the sample in the ADCY9-KD experiment finishing by the same number (from 1 to 5).

https://cdn.elifesciences.org/articles/69198/elife-69198-fig5-data1-v2.txt
Figure 5—source data 2

Residual of CETP expression by genotype.

This zip archive contains all files of CETP expression for correction of all covariables (Materials and methods) in the GEUVADIS (file beginning by F5b) and GTEx (Skin-male: file beginning by F5c; Artery-female: file beginning by F5d) datasets. The number of PEER factors added in the linear regression is written in the title of the file. In each file, the first column represents residual values of CETP expression after correcting for each covariable. The second column is the genotype of rs1967309 (0 = AA, 1 = AG, 2 = GG). The third column is the genotype combination of the rs1967309 (first number, same coding that the second column) and rs158477 (second number, 0 = GG, 1 = GA, 2 = AA).

https://cdn.elifesciences.org/articles/69198/elife-69198-fig5-data2-v2.zip

To test for potential interaction effects between rs1967309 and CETP, we used RNA-seq data from diverse projects in humans: the GEUVADIS project (Lappalainen et al., 2013), the Genotype-tissue Expression (GTEx v8) project (GTEx Consortium, 2013) and CARTaGENE (CaG) (Awadalla et al., 2013) (Step 3b, Figure 1). When looking across tissues in GTEx, ADCY9 and CETP expressions negatively correlate in almost all the tissues (Appendix 1—figure 6, Appendix 1), which is consistent with the effect observed during the ADCY9-KD experiment, showing increased expression of CETP expression when ADCY9 is lowly expressed (Figure 5a, Figure 5—figure supplement 1a, b). We evaluated the effects of the SNPs on expression levels of ADCY9 and CETP by modelling both SNPs as continuous variables (additive model) (Methods). The CETP SNP rs158477 was reported as an expression quantitative trait locus (eQTL) in GTEx v7 and, in our models, shows evidence of being a cis eQTL of CETP in several other tissues (Appendix 1), although not reaching genome-wide significance. To test specifically for an epistatic effect between rs1967309 and rs158477 on CETP expression, we included an interaction term in eQTL models (Materials and methods). We note here that we are testing for association for this specific pair of SNPs only, and that effects across tissues are not independent, such that we set our significance threshold at p-value = 0.05. This analysis revealed a significant interaction effect (p-value = 0.03, ß = −0.22) between the two SNPs on CETP expression in GEUVADIS lymphoblastoid cell lines (Figure 5b, Appendix 1—figure 7a). In rs1967309 AA individuals, copies of the rs158477 A allele increased CETP expression by 0.46 (95% CI 0.26–0.86) on average. In rs1967309 AG individuals, copies of the rs158477 A allele increased CETP expression by 0.24 (95% CI 0.06–0.43) on average and the effect was null in rs1967309 GG individuals (p-valueGG = 0.58). This suggests that the effect of rs158477 on CETP expression changes depending on genotypes of rs1967309. The interaction is also significant in several GTEx tissues, most of which are brain tissues, like hippocampus, hypothalamus, and substantia nigra, but also in skin, although we note that the significance of the interaction depends on the number of PEER factors included in the model (Appendix 1—figure 8). These factors are needed to correct for unknown biases in the data, but also potentially lead to decreased power to detect interaction effects (Brynedal et al., 2017). In CaG whole blood samples, the interaction effect using additive genetic effect at rs1967309 was not significant, similarly to results from GTEx in whole blood samples. However, given the larger size of the dataset, we evaluated a genotypic encoding for the rs1967309 SNP in which the interaction effect is significant (P-value = 0.008, Appendix 1—figure 7b) in whole blood, suggesting that rs1967309 could be modulating rs158477 eQTL effect, in this tissue at least, with a genotype-specific effect. We highlight that the sample sizes of current transcriptomic resources do not allow to detect interaction effects at genome-wide significance, however the likelihood of finding interaction effects between our two SNPs on CETP expression in three independent datasets is unlikely to happen by chance alone, providing evidence for a functional genetic interaction.

Given the sex-specific results reported above, we stratified our interaction eQTL analyses by sex. We observed that the interaction effect on CETP expression in CaG whole blood samples (Nmale = 359) is restricted to male individuals, and, despite low power due to smaller sample size in GEUVADIS, the interaction is also only suggestive in males (Appendix 1—figure 7c and d). In GTEx, most well-powered tissues that showed a significant effect in the sex-combined analyses also harbor male-specific interactions (Appendix 1—figure 9). For instance, GTEx skin male samples (Nmale = 330) show the most significant male-specific interaction effects, with the directions of effects replicating the sex-combined result in GEUVADIS (an increase of CETP expression for each rs158477 A allele in rs1967309 AA individuals) albeit with an observable reversal of the direction in rs1967309 GG individuals (decrease of CETP expression with additional rs158477 A alleles) (Figure 5c, Figure 5—figure supplement 2a). However, significant effects in females are detected in tissues not previously seen as significant for the interaction in the sex-combined analysis, in the tibial artery (Figure 5d, Figure 5—figure supplement 2) and the heart atrial appendage (Appendix 1—figure 9). For tissues with evidence of sex-specific effects in stratified analyses, we also tested the effect of an interaction between sex, rs158477 and rs1967309 (Materials and methods) on CETP expression: the three-way interaction is only significant for tibial artery (Figure 5—figure supplement 2).

Epistatic effects on phenotypes

The interaction effect of rs1967309 and rs158477 on CETP expression in several tissues, found in multiple independent RNA-seq datasets, coupled with the detection of LRLD between these SNPs in the Peruvian population suggest that selection may act jointly on these loci, specifically in Peruvians or Andeans. These populations are well known for their adaptation to life in high altitude, where the oxygen pressure is lower and where the human body is subjected to hypoxia (Beall, 2007; Brutsaert et al., 2005; Julian and Moore, 2019; Moore, 2017a). High altitude hypoxia impacts individuals’ health in many ways, such as increased ventilation, decreased arterial pressure, and alterations of the energy metabolism in cardiac and skeletal muscle (Milledge et al., 2007; Murray, 2016). To test which phenotype(s) may explain the putative coevolution signal discovered (Step 4, Figure 1), we investigated the impact of the interaction between rs1967309 and rs158477 on several physiological traits, energy metabolism and cardiovascular outcomes using the UK Biobank and GTEx cohort (Figure 6—figure supplement 1, Appendix 1—table 2). The UK Biobank has electronic medical records and GTEx has cause of death and variables from medical questionnaires (GTEx Consortium, 2013). The interaction term was found to be nominally significant (p-value < 0.05) for forced vital capacity (FVC), forced expiratory volume in 1 s (FEV1) and whole-body water mass, and suggestive (p-value < 0.10) for the basal metabolic rate, all driven by the effects in females (Figure 6a). For CAD, the interaction is suggestive (p-value < 0.10) and, in this case, driven by males (Figure 6a).

Figure 6 with 2 supplements see all
Epistatic association of rs1967309 and rs158477 on phenotypes in the UK biobank.

(a) Significance of the interaction effect between rs1967309 and rs158477 on several physiological traits, energy metabolism and cardiovascular outcomes overall and stratified by sex in the UK biobank. Horizontal lines represent the p-value thresholds at 0.05 (plain) and 0.10 (dotted). Single-SNP p-values are shown in Figure 6—figure supplement 1. (b,c) Sex-stratified effects of rs158477 on (b) cardiovascular phenotypes and (c) biomarkers depending on the genotype of rs1967309 (genotypic encoding). The p-values pitx reported come from a likelihood ratio test comparing models with and without the three-way interaction term between the two SNPs and sex. Sex-combined results using GTEx cardiovascular phenotype data are shown in Figure 6—figure supplement 2. See Appendix 1—table 2 for the list of abbreviations.

Figure 6—source data 1

Results of the interaction between rs1967309 and rs158477 on phenotypes in the UK biobank.

This file contains the results of the PheWAS for each phenotype in the Figure 6a for the sex-combined and stratified by sex analyses. p-Value are already converted to a -log10(p) scale and sorted by the most significant to the less significant results. Covariables used for the linear or logistic regressions are mentioned in Materials and methods. See Appendix 1—table 2 for the list of abbreviations.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig6-data1-v2.txt
Figure 6—source data 2

Results for the cardiovascular phenotypes and biomarkers by sex and by rs1967309 genotypes in the UK biobank.

This zip archive contains the results for the cardiovascular phenotypes (file beginning by F6b) and biomarkers (file beginning by F6c) analyses. Those files contain the p-value, the estimate (AME) and standard error (SE, to multiply by 1.959964 to get the confidence interval for α = 0.05/2) of the association of rs158477 for each genotype of rs1967309 in male or female. The covariable used are mentioned in the Materials and methods. See Appendix 1—table 2 for the list of abbreviations.

https://cdn.elifesciences.org/articles/69198/elife-69198-fig6-data2-v2.zip

Given this sex-specific result on CAD, the condition targeted by dalcetrapib, we tested the effect of an interaction between sex, rs158477 and rs1967309 (genotypic encoding, see Materials and methods) on binary cardiovascular outcomes including myocardial infarction (MI) and CAD. For CAD, we see a significant three-way interaction effect, meaning that for individuals carrying the AA genotype at rs1967309, the association between rs158477 and CAD is in the opposite direction in males and females. In other words, in rs1967309-AA females, having an extra A allele at rs158477, which is associated with higher CETP expression (Figure 5b), has a protective effect on CAD. Conversely, in rs1967309-AA males, each A allele at rs158477 increases the probability of having an event (Figure 6a). Little effect is seen in either sex for AG or GG at rs1967309, although the heterozygotes AG behave differently in females (which further justifies the genotypic encoding of rs1967309). The beneficial effect of the interaction on CAD thus favors the rs1967309-AA+ rs153477 GG males and the rs1967309-AA+ rs153477 AA females, two genotype combinations which are respectively enriched in a sex-specific manner in the LIMAA cohort (Appendix 1). Again, observing such a result that concords with the direction of effects in the LRLD sex-specific finding is noteworthy. A significant interaction between the SNPs is also seen in the GTEx cohort (p-value = 0.004, Figure 6—figure supplement 2, Appendix 1), using questionnaire phenotypes reporting on MI, but the small number of individuals precludes formally investigating sex effects.

Among the biomarkers studied (Appendix 1—table 2), only lipoprotein(a) [Lp(a)] is suggestive in males (P-value = 0.08) for an interaction between rs1967309 and rs158477, with the same direction of effect as that for CAD (Figure 6). Again, given the differences observed between the sexes, we tested the effect of an interaction between sex, rs158477 and rs1967309 (genotypic coding, Materials and methods) on biomarkers, and only Lp(a) was nominally significant in a three-way interaction (p-value = 0.049). The pattern is similar to the results for CAD, ie. a change in the effect of rs158477 depending on the genotype of rs1967309 in males, with the effect for AA females in the opposite direction compared to males (Figure 6b). These concordant results between CAD and Lp(a) support that the putative interaction effect between the loci under study on phenotypes involves sex as a modifier.

Discussion

In this study, we used population genetics, transcriptomics and interaction analyses in biobanks to study the link between ADCY9 and CETP. Our study revealed selective signatures in ADCY9 and a significant genotypic association between ADCY9 and CETP in two Peruvian cohorts, specifically between rs1967309 and rs158477, which was also seen in the Native population of the Andes. The interaction between the two SNPs was found to be nominally significant for respiratory and cardiovascular disease outcomes (Figure 6, Figure 6—figure supplement 2). Additionally, a nominally significant epistatic interaction was seen on CETP expression in many tissues, including the hippocampus and hypothalamus in the brain. Despite brain tissues not displaying the highest CETP expression levels, CETP that is synthesized and secreted in the brain could play an important role in the transport and the redistribution of lipids within the central nervous system (Albers et al., 1992; Yamada et al., 1995) and has been associated with Alzheimer’s disease risk (Murphy et al., 2012; Oestereich et al., 2020). These findings reinforce the fact that the SNPs are likely functionally interacting, but extrapolating on the specific phenotypes under selection from these results is not straight forward. Identifying the phenotype and environmental pressures that may have caused the selection signal is complicated by the fact that the UK Biobank participants, on which the marginally significant associations have been found, do not live in the same environment as Peruvians. In Andeans from Peru, selection in response to hypoxia at high altitude was proposed to have effects on the cardiovascular system (Crawford et al., 2017). The hippocampus functions are perturbed at high altitude (eg. deterioration of memory Lieberman et al., 2005; Shukitt-Hale et al., 1994), whereas the hypothalamus regulates the autonomic nervous system (ANS) and controls the heart and respiratory rates (Horiuchi et al., 2009; Rahmouni, 2016), phenotypes which are affected by hypoxia at high altitude (Bärtsch and Gibbs, 2007; Hainsworth et al., 2007). Furthermore, high altitude-induced hypoxia (Bigham and Lee, 2014; Moore, 2017b) and cardiovascular system disturbances (Abe et al., 2017; Lee et al., 2019) have been shown to be associated in several studies (Faeh et al., 2009; Naeije, 2010; Ostadal and Kolar, 2007; Riley and Gavin, 2017; Savla et al., 2018), thus potentially sharing common biological pathways. Therefore, our working hypothesis is that selective pressures on our genes of interest in Peru are linked to the physiological response to high-altitude, which might be the environmental driver of coevolution.

The significant interaction effects on CETP expression vary between sexes in amplitude and direction, with most signals driven by male samples, but significant interaction effects observed in females only, despite sample sizes being consistently lower than for males. Notably, in the tibial artery and heart atrial appendage, two tissues directly relevant to the cardiovascular system, the female-specific interaction effect on CETP expression is reversed between rs1967309 genotypes AA and GG, compared to the effects seen in males in skin and brain tissues. Given our ADCY9-KD were done in liver cell lines from male donors, future work to fully understand how rs1967309 and rs158477 interact will focus on additional experiments in cells from both male and female donors in these relevant tissues. In a previous study, we showed that inhibition of both Adcy9 and CETP impacted many phenotypes linked to the ANS in male mice (Rautureau et al., 2018), but in the light of our results, these experiments should be repeated in female mice. The function of ANS is important in a number of pathophysiological states involving the cardiovascular system, like myocardial ischemia and cardiac arrhythmias, with significant sex differences reported (Abhishekh et al., 2013; Dart et al., 2002; Nugent et al., 2011).

The interaction effect between the ADCY9 and CETP SNPs on both respiratory and cardiovascular phenotypes differs between the sexes, with effects on respiratory phenotypes limited to females (Figure 6a) and cardiovascular disease phenotype associations showing significant three-way sex-by-SNPs effects (Figure 6). Furthermore, the LRLD signal is present mainly in males (Figure 4), although the genotype association is also seen in female for a different genotype combination, suggesting the presence of sex-specific selection. This type of selection is very difficult to detect, especially on autosomes, with very few empirical examples found to date in the human genome despite strong theoretical support of their occurrence (Morrow and Connallon, 2013). However, sexual dimorphism in gene expression between males and females on autosomal genes has been linked to evolutionary pressures (Connallon and Clark, 2010; Parsch and Ellegren, 2013; Williams and Carroll, 2009), possibly with a contribution of epistasis. As the source of selection, we favor the hypothesis of differential survival over differential ability to reproduce, because the genetic combination between ADCY9 and CETP has high chances to be broken up by recombination at each generation. Even in the case where recombination is suppressed in males between these loci, they would still have equal chances to pass the favored combination to both male and female offspring, which would not explain the sex-specific LRLD signal. We see an enrichment for the rs1967309-AA+ rs158477 GG in males and rs1967309-AA+ rs158477 AA in females, which are the beneficial combination for CAD in the corresponding sex, possibly pointing to a sexually antagonistic selection pressure, where the fittest genotype combination depends on the sex.

Such two-gene selection signature, where only males show strong LRLD, can happen if a specific genotype combination is beneficial in creating males (through differential gamete fitness or in utero survival, for example) or if survival during adulthood is favored with a specific genotype combination compared to other genotypes. In the case of age-dependent differential survival, the genotypic association is expected to be weaker at younger ages, however the LRLD signal between rs1967309 and rs158477 in the LIMAA cohort did not depend on age neither in males nor in females (Appendix 1). Since very few individuals were younger than 20 years old, it is likely that the age range in this cohort is not appropriate to distinguish between the two possibilities. This age-dependent survival therefore remains to be tested in comparison with pediatric cohorts of Peruvians: if the LRLD signal is absent in newborns for example, it will suggest a strong selective pressure acts early in life on boys. To specifically test the in utero hypothesis, a cohort of stillborn babies with genetic information could allow to evaluate if the genotype combination is more frequent in these. Lastly, it may be that the evolutionary pressure is linked to the sex chromosomes (Cox et al., 2017; McGlothlin et al., 2019), and a three-way interaction between ADCY9, CETP and Y chromosome haplotypes or mitochondrial haplogroups remains to be explored.

Even though we observed the LRLD signal between rs1967309 and rs158477 in two independent Peruvian cohorts, reducing the likelihood that our result is a false positive, one limitation is that the individuals were recruited in the same city (Lima) in both cohorts. However, we show that both populations are heterogeneous with respect to ancestry (Appendix 1—figure 2), suggesting that they likely represent accurately the Peruvian population. As recent admixture and population structure can strongly influence LRLD, we performed several analyses to consider these confounders, in the full cohorts and in the sex-stratified analyses. All analyses were robust to genome-wide and local ancestry patterns, such that our results are unlikely to be explained by these effects alone (Appendix 1). Unfortunately, we could not use expression and phenotypic data from Peruvian individuals, which makes all the links between the selection pressures and the phenotype associations somewhat indirect. Future studies should focus on evaluating the phenotypic impact of the interaction specifically in Peruvians individuals, in cohorts such as the Population Architecture using Genomics and Epidemiology (PAGE) (Wojcik et al., 2019), in order to confirm the marginally significant associations found in European cohorts. Indeed, the Peruvian/Andean genomic background could be of importance for the interaction effect observed in this population, which reduces the power of discovery in individuals of unmatched ancestry. Furthermore, not much is known about the strength of this type of selection, and simulations would help evaluate how strong selection would need to be in a single generation to produce this level of LRLD. Another limitation is the low number of samples per tissue in GTEx and the cell composition heterogeneity per tissue and per sample (Battle et al., 2017), which can be partially captured by PEER factors and can modulate the eQTL effects. Therefore, our power to detect tissue-specific interaction effects is reduced in this dataset, making it quite remarkable that we were able to observe multiple nominally significant interaction effects between the loci.

Despite these limitations, our results support a functional role for the ADCY9 intronic SNP rs1967309, likely involved in a molecular mechanism related to CETP expression, but this mechanism seems to implicate sex as a modulator in a tissue-specific way, which complicates greatly its understanding. In the dal-OUTCOMES clinical trial, the partial inhibitor of CETP, dalcetrapib, did not decrease the risk of cardiovascular outcomes in the overall population, but rs1967309 in the ADCY9 gene was associated to the response to the drug, which benefitted AA individuals (Tardif et al., 2015). Interestingly, rs1967309 AA is found in both the male and female beneficial combinations of genotypes for CAD, the same that are enriched in Peruvians, but without taking rs158477 and sex into account, this association was masked. The modulation of CETP expression by rs1967309 could impact CETP’s functions that are essential for successfully reducing cardiovascular events. The rs158477 locus could be a key player for these functions, and dalcetrapib may be mimicking its impact, hence explaining the pharmacogenomics association. Furthermore, in the light of our results, some of these effects could differ between men and women (Metzinger et al., 2020), which may need to be taken into consideration in the future precision medicine interventions potentially implemented for dalcetrapib.

In conclusion, we discovered a putative epistatic interaction between the pharmacogene ADCY9 and the drug target gene CETP, that appears to be under selection in the Peruvian population. Our approach exemplifies the potential of using evolutionary analyses to help find relationships between pharmacogenes and their drug targets. We characterized the impact of the ADCY9/CETP interaction on a range of phenotypes and tissues. Our gene expression results in brain tissues suggest that the interaction could play a role in protection against challenges to the nervous system caused by stress such as hypoxia. The female-specific eQTL interaction results in arteries and heart tissues further suggest a link with the cardiovascular system, and the phenotype association results support further this hypothesis. In light of the associations between high altitude-induced hypoxia and cardiovascular system changes, the interaction identified in this study could be involved in both systems: for example, ADCY9 and CETP could act in pathways involved in adaptation to high altitude, which could influence cardiovascular risk via their interaction in a sex-specific manner. Finally, our findings of an evolutionary relationship between ADCY9 and CETP during recent human evolution points towards a biological link between dalcetrapib’s pharmacogene ADCY9 and its therapeutic target CETP.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Gene (Homo sapiens)CETPGenBankHGNC:1,869
Gene (Homo sapiens)ADCY9GenBankHGNC:240
Cell line (Homo sapiens)HepG2ATCCRRID:CVCL_0027Hepatoblastoma
Recombinant DNA reagentpEZ-M46-AC9 plasmidGeneCopoeiaEX-H0609-M46Methods section
Recombinant DNA reagentpEZ-M50-CETP plasmidGeneCopoeiaEX-C0070-M50Methods section
AntibodyAnti-CETP (rabbit monoclonal)Abcam#ab157183(1:1000) in 3 % BSA, TBS, tween 20 0.5%,
O/N 4 °C
AntibodyGoat anti-rabbit antibody (goat polyclonal)AbcamRRID:AB_955447(1:10 000)
in 3 % BSA
1 h at room
temperature
Sequence-based reagentHuman CETP_FIDT TechnologiesPCR primersCTACCTGT
CTTTCCATAA
Sequence-based reagentHuman CETP_RIDT TechnologiesPCR primersCATGATGT
TAGAGATGAC
Sequence-based reagentHuman ADCY9_FIDT TechnologiesPCR primersCTGAGGTT
CAAGAACATCC
Sequence-based reagentHuman ADCY9_RIDT TechnologiesPCR primersTGATTAATG
GGCGGCTTA
Sequence-based reagentSilencer Select siRNA against human ADCY9AmbionCat. #4390826 ID 1039CCUGAUGA
AAGAUUACUU
Utt
Sequence-based reagentSilencer Select siRNA against human CETPAmbionCat. #4392420 ID 2933GGACAGAUC
UGCAAAGAGAtt
Sequence-based reagentNegative Control siRNAAmbionCat. #4390844
Commercial assay or kitLipofectamine RNAiMAX reagentInvitrogenCat. #13,778
Commercial assay or kitLipofectamine 2000 reagentInvitrogenCat. #11668–019
Commercial assay or kitRNeasy Plus Mini KitQiagenCat. #74,136
Commercial assay or kitHigh-Capacity cDNA Reverse Transcription KitApplied BiosystemsCat. #4368814
Commercial assay or kitAgilent RNA 6000 Nano Kit for Bioanalyzer 2,100 SystemAgilent TechnologiesCat. #5067–1511
Commercial assay or kitSYBR-Green reaction mixBioRadCat. #1725274
Commercial assay or kitAmicon Ultra 0.5 ml 10 kDa cutoff unitsMillipore SigmaCat. #UFC501096
Commercial assay or kitWestern Lightning ECL ProPerkin ElmerCat. #NEL122001EA
Commercial assay or kitTGX Stain-Free FastCast Acrylamide 10%BioRadCat# 1610183
Software, algorithmTrimGalore!DOI:10.14806/ej.17.1.200RRID:SCR_011847
Software, algorithmSTAR (v.2.6.1a)DOI:10.1093/bioinformatics/bts635RRID:SCR_019993
Software, algorithmRSEM (v.1.3.1)DOI:10.1186/1471-2105-12-323RRID:SCR_013027
Software, algorithmR statistical software (v.3.6.0/v.3.6.1)https://www.r-project.org/RRID:SCR_001905
Software, algorithmFlashPCA2DOI:10.1093/bioinformatics/btx299RRID:SCR_021680
Software, algorithmVcftools (v.0.1.17)DOI:10.1093/bioinformatics/btr330RRID:SCR_001235
Software, algorithmRFMix (v.2.03)DOI:10.1016 /j.ajhg.2013.06.020
Software, algorithmPEERDOI:10.1038/nprot.2011.457RRID:SCR_009326
Software, algorithmpyGenClean (v.1.8.3)DOI:10.1093/bioinformatics/btt261
Software, algorithmSAS (v.9.4)https://www.sas.com/en_us/software/stat.htmlRRID:SCR_008567
Software, algorithmEPO pipeline (version e59)DOI:10.1093/database/bav096
Software, algorithmBcftools (v.1.9)DOI:10.1093/bioinformatics/btr509RRID:SCR_005227
Software, algorithmGenotype
Harmonizer (v.1.4.20)
DOI:10.1186/1756-0500-7-901
Software, algorithmHapbin (v.1.3.0)DOI:10.1093/molbev/msv172
Software, algorithmSHAPEIT2 (r.837)DOI:10.1038/nmeth.1785
Software, algorithmPBWTDOI:10.1093/bioinformatics/btu014
Software, algorithmBeacon designer software (v.8) (Premier Biosoft)http://www.premierbiosoft.com/qOligo/Oligo.jsp?PID=1
Other1000 Genomes projectDOI:10.1038/nature15393RRID:SCR_006828
OtherLIMAADOI:10.1038 /s41467-019-11664-1dbGAP:phs002025.
v1.p1
dbgap project #26,882
OtherNative American genetic datasetDOI:10.1038/nature11258
OtherGEUVADISDOI:10.1038/nature12531RRID:SCR_000684
OtherGTEx (v8)DOI:10.1038 /ng.2653RRID:SCR_013042dbgap project #19,088
OtherCARTaGENE biobankDOI:10.1093/ije/dys160RRID:SCR_010614CAG project number 406,713
OtherUK biobankDOI:10.1371/journal.pmed.1001779RRID:SCR_012815UKB project #15,357 and #20,168
OtherSanger Imputation ServerDOI:10.3389/fgene.2019.00034

Population genetics datasets

Request a detailed protocol

The whole-genome sequencing data from the 1000 Genomes project (1000G) Phase III dataset (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) was filtered to exclude INDELs and CNVs so that we kept only biallelic SNPs. This database has genomic variants of 2504 individuals across five ancestral populations: Africans (AFR, n = 661), Europeans (EUR, n = 503), East Asians (EAS, n = 504), South Asians (SAS, n = 489), and Americans (AMR, n = 347) (Auton et al., 2015). The replication dataset, LIMAA, has been previously published (Asgari et al., 2020; Luo et al., 2019) and was accessed through dbGaP [phs002025.v1.p1, dbgap project #26,882]. This cohort was genotyped with a customized Affymetric LIMAAray containing markers optimized for Peruvian-specific rare and coding variants. We excluded related individuals as reported previously (Asgari et al., 2020), resulting in a final dataset of 3,509 Peruvians. We also identified fine-scale population structure in this cohort and a more homogeneous subsample of 3243 individuals (1302 females and 1941 males) in this cohort was kept for analysis (Table 1, Appendix 1). The Native American genetic dataset (NAGD) contains 2351 individuals from Native descendants from the data from a previously published study (Reich et al., 2012). Individuals were separated by their linguistic families identified by Reich and colleagues (Reich et al., 2012). NAGD came under the Hg18 coordinates, so a lift over was performed to transfer to the Hg19 genome coordinates. Pre-processing details for these datasets are described in Appendix 1.

Table 1
Cohort information.

Sample sizes are reported after quality control steps.

Cohort/SubpopulationAbbreviationEthnicitySample size(% female)AgeReference
1000 G – PeruvianPEL*Peruvian85 (52%)NAAuton et al., 2015
LIMAA/PeruvianLIMAAPeruvian3,243 (40%)29.6 ± 13.8Asgari et al., 2020; Luo et al., 2019
Native Amerind/AndeanNAGD/ANDAmerind/Peruvian88 (40%)NAReich et al., 2012
GEUVADISGEUVADIS*European descent287 (54%)NALappalainen et al., 2013
CARTaGENECaGEuropean descent728 (51%)53.6 ± 8.7Awadalla et al., 2013
GTExGTExEuropean descent699 (34%)52.6 ± 13.1GTEx Consortium, 2013
UK biobankUkb*European descent413,138 (54%)56.8 ± 8.0Sudlow et al., 2015
  1. *

    indicates a discovery cohort.

  2. NA: not available.

eQTL datasets

Request a detailed protocol

We used several datasets (Table 1) for which we had both RNA-seq data and genotyping. First, the GEUVADIS dataset (Lappalainen et al., 2013) for 1000 G individuals was used (available at https://www.internationalgenome.org/data-portal/data-collection/geuvadis). A total of 287 non-duplicated European samples (CEU, GBR, FIN, TSI) were kept for analysis. Second, the Genotype-tissue Expression v8 (GTEx) (GTEx Consortium, 2013) was accessed through dbGaP (phs000424.v8.p2, dbgap project #19088) and contains gene expression across 54 tissues and 948 donors, genetic and phenotypic information. Phenotype analyses are described in Appendix 1. The cohort contains mainly of European descent (84.6%), aged between 20 and 79 years old. Analyses were done on 699 individuals, 66 % of males and 34 % of females (Appendix 1—figure 10a). Third, we used the data from the CARTaGENE biobank (Awadalla et al., 2013) (CAG project number 406713) which includes 728 RNA-seq whole-blood samples with genotype data, from individuals from Quebec (Canada) aged between 36 and 72 years old (Appendix 1—figure 10b). Genotyping and RNA-seq data processing pipelines for these datasets are detailed in Appendix 1. To quantify ADCY9 gene expression, we removed the isoform transcript ENST00000574721.1 (ADCY9-205 from the Hg38) from the Gene Transfer Format (GTF) file because it is a “retained intron” and accumulates genomic noise (Appendix 1), masking true signals for ADCY9. To take into account hidden factors, we calculated PEER factors (Stegle et al., 2012) on the normalized expressions, on all samples and stratified by sex (sPEER factors). To detect eQTL effects, we performed a two-sided linear regression on ADCY9 and CETP expressions using R (v.3.6.0) (https://www.r-project.org/) with the formula lmprs1967309*rs158477+Covariates for evaluating the interaction effect, lmprs1967309+rs158477+Covariates for the main effect of the SNPs and lmprs1967309*rs158477*sex+Covariates for evaluating the three-way interaction effect. Under the additive model, each SNP is coded by the number of non-reference alleles (G for rs1967309 and A for rs158477), under the genotypic model, dummy coding is used with homozygous reference genotype set as reference. The covariates include the first 5 Principal Components (PCs), age (except for GEUVADIS, information not available), sex, as well as PEER factors. We tested the robustness of our results to the inclusion of different numbers of PEER factors in the models and we report them all for GEUVADIS, CARTaGENE and GTEx (Appendix 1—figures 79). Reported values in the text are for 5 PEER factors in GEUVADIS, 10 PEER factors in CARTaGENE, 25 sPEER for skin sun exposed in male and 10 sPEER for artery tibial in female in GTEx. Covariates specific to each cohort are reported in Appendix 1.

UK biobank processing and selected phenotypes

Request a detailed protocol

The UK biobank (Sudlow et al., 2015) contains 487,392 genotyped individuals from the UK still enrolled as of August 20th 2020, imputed using the Haplotype Reference Consortium as the main reference panel, and accessed through project #15,357 and UKB project #20,168. Additional genetic quality control was done using pyGenClean (v.1.8.3) (Lemieux Perreault et al., 2013). Variants or individuals with more than 2 % missing genotypes were filtered out. Individuals with discrepancies between the self-reported and genetic sex or with aneuploidies were removed from the analysis. We considered only individuals of European ancestry based on PCs, as it is the largest population in the UK Biobank, and because ancestry can be a confounder of the genetic effect on phenotypes. We used the PCs from UK Biobank to define a region in PC space using individuals identified as ‘white British ancestry’ as a reference population. Using the kinship estimates from the UK Biobank, we randomly removed individuals from kinship pairs where the coefficient was higher than 0.0884 (corresponding to a third-degree relationship). The resulting post QC dataset included 413,138 individuals. For the reported phenotypes, the date of baseline visit was between 2006 and 2010. The latest available hospitalization records discharge date was June 30th 2020 and the latest date in the death registries was February 14th 2018. We used algorithmically defined cardiovascular outcomes based on combinations of operation procedure codes (OPCS) and hospitalization or death record codes (ICD9/ICD10). A description of the tested continuous variables can be found in Appendix 1—table 2. We used age at recruitment defined in variable #21,022 and sex in variable #31. We ignored self-reported events for cardiovascular outcomes as preliminary analyses suggested they were less precise than hospitalization and death records.

In association models, each SNP analyzed is coded by the number of non-reference alleles, G for rs1967309 and A for rs158477. SNP rs1967309 was also coded as a genotypic variable, to allow for non-additive effects. For continuous traits (Appendix 1—table 2) in the UK Biobank, general two-sided linear models (GLM) were performed using SAS software (v.9.4). A GLM model was first performed using the covariates age, sex and PCs 1–10. The externally studentized residuals were used to determine the outliers, which were removed. The normality assumption was confirmed by visual inspection of residuals for most of the outcomes, except birthwt and sleep. For biomarkers and cardiovascular endpoints, regression analyses were done in R (v.3.6.1). Linear regression analyses were conducted on standardized outcomes and logistic regression was used for cardiovascular outcomes. Marginal effects were calculated using margins package in R. In both cases, models were adjusted for age at baseline and top 10 PCs, as well as sex when not stratified. In models assessing two-way (rs1967309 by rs158477) or three-way (rs1967309 by rs158477 by sex) interactions, we used a 2 d.f. likelihood ratio test for the genotypic dummy variables’ interaction terms (genotypic model) (Appendix 1).

RNA-sequencing of ADCY9-knocked-down Hepg2 cell line

Request a detailed protocol

The human liver hepatocellular HepG2 cell line was obtained from ATCC, a cell line derived from the liver tissue of a 15-year-old male donor (López-Terrada et al., 2009). Our cells tested negatively for mycoplasma contamination and have a morphology and expression profile concordant with this cell type. Cells were cultured in EMEM Minimum essential Medium Eagle’s, supplemented with 10 % fetal bovine serum (Wisent Inc). A total of 250,000 cells in 2 ml of medium in a six-well plate were transfected using 12.5 pmol of Silencer Select siRNA against human ADCY9 (Ambion cat # 4390826 ID 1039), Silencer Select siRNA against CETP (Ambion cat 4392420 ID 2933) or Negative Control siRNA (Ambion cat #4390844) with 5 μl of Lipofectamine RNAiMAX reagent (Invitrogen cat #13778) in 500 μl Opti-MEM I reduced serum medium (Invitrogen cat # 31985) for 72 hr (Appendix 1—table 3, Appendix 1). The experiment was repeated five times at different cell culture passages. Total RNA was extracted from transfected HepG2 cells using RNeasy Plus Mini Kit (Qiagen cat #74136) in accordance with the manufacturer’s recommendation. Preparation of sequencing library and sequencing was performed at the McGill University Innovation Center. Briefly, ribosomal RNA was depleted using NEBNext rRNA depletion kit. Sequencing was performed using Illumina NovaSeq 6000 S2 paired end 100 bp sequencing lanes. Basic QC analysis of the 10 samples was performed by the Canadian Centre for Computational Genomics (C3G). To process the RNA-seq samples, we first performed read trimming and quality clipping using TrimGalore! (Martin, 2011; Krueger et al., 2021), we aligned the trimmed reads on the Hg38 reference genome using STAR (v.2.6.1a) and we ran RSEM (v.1.3.1) on the transcriptome aligned libraries. Prior to normalization with limma and voom, we filtered out genes which had less than six reads in more than 5 samples. For ADCY9 and CETP gene-level differential expression analyses, we compared the mean of each group of replicates with a t-test for paired samples. The transcriptome-wide differential expression analysis was done using limma, on all genes having an average of at least 10 reads across samples from a condition. Samples were paired in the experiment design. The multiple testing was taken into account by correcting the p-values with the qvalue R package (v.4.0.0) (Storey, 2002), to obtain transcriptome-wide FDR values.

Overexpression of ADCY9 and CETP genes in HepG2 cell line

Request a detailed protocol

For ADCY9 and CETP overexpression experiments, 500,000 cells in 2 ml of medium in a six-well plate were transfected using 1 µg of pEZ-M46-AC9 or pEZ-M50-CETP plasmids (GeneCopoeia) with 5 µl of Lipofectamine 2000 reagent (Invitrogen cat # 11668–019) for 72 hr. Total RNA was extracted from transfected HepG2 cells using RNeasy Plus Mini Kit (Qiagen cat #74136) in accordance with the manufacturer’s recommendation (Appendix 1—table 3, Appendix 1).

Natural selection analyses

Request a detailed protocol

We used the integrated Haplotype Statistic (iHS) (Voight et al., 2006) and the population branch statistic (PBS) (Auton et al., 2015) to look for selective signatures. The iHS values were computed for the each 1000G population. An absolute value of iHS above two is considered to be a genome wide significant signal (Voight et al., 2006). Prior to iHS computation, ancestral alleles were retrieved from six primates using the EPO pipeline (version e59) (Herrero et al., 2016) and the filtered 1000 Genomes vcf files were converted to change the reference allele as ancestral allele using bcftools (Li et al., 2009) with the fixref plugin. The hapbin program (v.1.3.0) (Maclean et al., 2015) was then used to compute iHS using per population-specific genetic maps computed by Adam Auton on the 1000G OMNI dataset (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/). When the genetic map was not available for a subpopulation, the genetic map from the closest sub-population was selected according to their global FST value computed on the phase three dataset.

We scanned the ADCY9 and CETP genes using the population branch statistic (PBS), using 1000G sub-populations data. PBS summarizes a three-way comparison of allele frequencies between two closely related populations, and an outgroup. The grouping we focused on was PEL/MXL/CHB, with PEL being the focal population to test if allele frequencies are especially differentiated from those in the other populations. The CHB population was chosen as an outgroup to represent a Eurasian population that share common ancestors in the past with the American populations, after the out-of-Africa event. Using PJL (South Asia) and CEU (Europe) as an outgroup, or CLM as a closely related population (instead of MXL) yield highly similar results. To calculate FST for each pair of population in our tree, we used vcftools (Danecek et al., 2011) by subpopulation. We calculated normalized PBS values as in Crawford et al., 2017, which adjusts values for positions with large branches in all populations, for the whole genome. We use this distribution to define an empirical threshold for significance based on the 95th percentile of all PBS values genome-wide for each of the three populations.

Long-range linkage disequilibrium

Request a detailed protocol

Long-range linkage disequilibrium (LRLD) was calculated using the function geno-r2 of vcftools (v.0.1.17) which uses the genotype frequencies. LRLD was evaluated in all subpopulations from 1000 Genomes Project Phase III, in LIMAA and NAGD, for all biallelic SNPs in ADCY9 (chr16:4,012,650–4,166,186 in Hg19 genome reference) and CETP (chr16:56,995,835–57,017,756 in Hg19 genome reference). We analyzed loci from the phased VCF files that had a MAF of at least 5 % and a missing genotype of at most 10%, in order to retain a maximum of SNPs in NAGD which has higher missing rates than the others. We extracted the 99th percentile of all pairs of comparisons between ADCY9 and CETP genes to use as a threshold for empirical significance and we refer to these as ADCY9/CETP empirical p-values. In LIMAA, we also evaluated the genotypic association using a χ2 test with four degrees of freedom (χ42) using a permutation test, as reported in Rohlfs et al., 2010 (Appendix 1).

Furthermore, for both cohorts, we created a distribution of LRLD values for random pairs of SNPs across the genome to obtain a genome-wide null distribution of LRLD to evaluate how unusual the genotypic association between our candidate SNPs (rs1967309-rs158477) is while taking into account the cohort-specific background genomic noise/admixture and allele frequencies. We extracted 3513 pairs of SNPs that match rs1967309 and rs158477 in terms of MAF, physical distance (in base pairs) and genetic distance (in centiMorgans (cM), based on the PEL genetic map) between them in both cohorts (Appendix 1), and report genome-wide empirical p-values based on this distribution. For the analyses of LRLD between ADCY9 and CETP stratified by sex, we considered the same set of SNP pairs that we used for the full cohorts, but separated the dataset by sex before calculating the LRLD values. To evaluate how likely the differences observed in LRLD between sex are, we also performed permutations of the sex labels across individuals to create a null distribution of sex-specific effects (Appendix 1).

Local ancestry inference

Request a detailed protocol

To evaluate local ancestry in the PEL subpopulation and in the LIMAA cohort, we constructed a reference panel using the phased haplotypes from 1000 Genomes (YRI, CEU, CHB) and the phased haplotypes of NAGD (Northern American, Central American and Andean) (Appendix 1). We kept overlapping positions between all datasets, and when SNPs had the exact same genetic position, we kept the SNP with the highest variance in allele frequencies across all reference populations (Appendix 1). We ran RFMix (v.2.03) (Maples et al., 2013) (with the option ‘reanalyze-reference’ and for 25 iterations) on all phased chromosomes. We estimated the whole genome average proportion of each ancestry using a weighted mean of the chromosome specific proportions given by RFMix based on the chromosome size in cM. For comparing the overall Andean enrichment inferred by RFMix between rs1967309/rs158477 genotype categories, we used a two-sided Wilcoxon-t-test. To evaluate the Andean local ancestry enrichment specifically at ADCY9 and CETP, we computed the genome-wide 95th percentile for proportion of Andean attribution for all intervals given by RFMix.

Code and source data

Request a detailed protocol

Numerical summary data represented as a graph in main figures, as well as the code to reproduce figures and analyses, can be found here: Gamache, 2021. Raw RNA sequencing data for knocked down experiments in hepatocyte HepG2 cells are deposited the data on NCBI Gene Expression Omnibus, accession number GSE174640.

Appendix 1

Data pre-processing

Pre-processing of native american

The genetic data was obtained following correspondence with Reich et al., 2012 co-authors. The Native American Genetic Dataset (NAGD) dataset being quite sparse and samples coming from many different populations, no missing data threshold nor minor allele frequency or Hardy-Weinberg equilibrium filters were applied prior to the imputation. Harmonization to the hg19 reference genome has been done using GenotypeHarmonizer v.1.4.20 (Deelen et al., 2014) and bcftools v.1.9 (Li et al., 2009) with the fixref plugin (-m flip option). Imputation was done using the Sanger Imputation Server (McCarthy et al., 2016) using Haplotype Reference Consortium (r1.1) reference panel, with a pre-phasing using SHAPEIT2 r.837 (Delaneau et al., 2013) and imputation using PBWT (Durbin, 2014). Post-imputation quality control was done by keeping sites with an INFO score over 0.8 and keeping genotypes having a posterior probability over 0.9. SHAPEIT2 was run to get phased genotypes (parameters: effective size of 10,000, burn of 10, prune of 10, main of 25, states of 400). The obtained VCF was used in the RFMix analysis (see below). SNPs with missing genotypes higher than 90 % after imputation were removed for LRLD analysis.

Pre-processing of the LIMAA cohort

A pre-imputation step was conducted keeping only positions passing minor allele frequency (MAF) of 1%, 1 % of missing data per site and HWE P-value > 1e-5 using PLINK v.1.9 (Purcell et al., 2007). Harmonization to the hg19 reference genome has then been done using GenotypeHarmonizer and bcftools with the fixref plugin (-m flip option). Imputation was done using the Sanger Imputation Server, using Haplotype Reference Consortium (r1.1) reference panel, with a pre-phasing using SHAPEIT2 and imputation using PBWT. Post-imputation quality control was done by keeping sites with an INFO score over 0.8 and keeping genotypes having a posterior probability over 0.9. Furthermore, positions having less than 5 % missing rate after the genotyping recoding step were kept and duplicated positions were removed. SHAPEIT2 was run to get phased genotypes (parameters: effective size of 10,000, burn of 10, prune of 10, main of 25, states of 400). Another dataset was built to recover one of our SNPs of interest (rs1967309), which was excluded from our previous pipeline because of their INFO score (0.79). In this new dataset, the INFO score threshold was put to 0.7 and the post-imputation position missing data threshold was set to 35%, being less stringent, but recovering our positions. To make sure imputation quality did not impact our results because of incorrectly imputed genotypes, we redid the imputation of LIMAA with the TOPMED reference panel (https://imputation.biodatacatalyst.nhlbi.nih.gov/#!). The imputation r2 score with TOPMED is higher than 0.9 for both, and only very limited differences in imputed genotypes are seen (only 5% and 2% of individual allele mismatches in LIMAA for rs1967309 and rs158477, respectively for the 3,243 individuals).

Pre-processing of GTEx genetic data

Starting from the imputed genotyping dataset, we kept bi-allelic SNPs and removed positions with more than 5 % missing genotype, remaining 100,986 SNPs to calculate PCA using flashPCA2. To remove the Hispanic group, we reduced the dimensionality of the top 10 Principal Components (PCs) using the R package UMAP (McInnes et al., 2020) (default parameters) to obtain a two dimensional representation of the genetic information contained within those PCs. We identified the largest homogeneous group (self-reported ‘white’) and excluded outlier groups (Appendix 1—figure 10a), used only these individuals for the rest of the analyses. We did our all subsequent analyses with 699 individuals.

Pre-processing of CARTaGENE

CARTaGENE biobank (Awadalla et al., 2013) includes 40 K individuals from Quebec (Canada) having between 36 and 72 years old. 12,056 individuals were genotyped and among these 911 had RNAseq performed on whole blood (Hussin et al., 2015; Favé et al., 2018) The genotypes are coming from five different genotyping arrays on which imputation was processed independently. A pre-imputation step was conducted keeping only genotypes passing maf of 1%, 1 % of missing data per site and HWE P-value > 1e-5 using PLINK. Harmonization to the Hg19 reference genome has then been done using GenotypeHarmonizer and bcftools with the fixref plugin (-m flip option). Imputation was done using the Sanger Imputation Server, using Haplotype Reference Consortium (r1.1) reference panel, with a pre-phasing using SHAPEIT2 and imputation using PBWT. Post-imputation quality control was done by keeping sites with an INFO score over 0.8 and keeping individual genotypes having a posterior probability over 0.9.

To extract only white European, we used the same filter as for GTEx, except that we removed SNPs having any missing genotypes which could create bias by different chips, then followed the recommendation from flashPCA2, remaining 8,869 SNPs to calculate PCA. We reduced the dimensionality of the top 10 PCs using the R package UMAP (default parameters) to obtain a two dimensional representation of the genetic information contained within those PCs. We identified the largest homogeneous group (Appendix 1—figure 10b), which contains a majority of individuals from European descent (self-reported ‘white’), and used only those individuals for the rest of the analysis. We kept 11,362 individuals at the end and among these, 911 individuals for which we had RNAseq. For these individuals, we merged samples from different batches, we removed samples who had less than 10 millions of reads, remaining 790 individuals with expression. After filtering out individuals missing the genotype of either rs1967309 or rs158477 SNPs, we did our interaction analysis on 728 individuals.

Population genetics iHS analyses

We computed the integrated haplotype score (iHS) (Voight et al., 2006) for each subpopulation in the 1000 Genomes project (Methods), a statistics that allows us to detect evidence for recent strong positive selection on derived alleles. The SNP rs1967309 is located in a region of high linkage disequilibrium (LD), delimited by recombination hotspots present in all populations. Several SNPs in this LD block exhibit absolute iHS values higher than two in non-African populations (Figure 2b, Appendix 1—figure 1), specifically in CEU and GBR (highest signal is a 15 Kb away from rs1967309), CHB, CHS, CDX, KHV, and in all SAS sub-populations, all of which showing signals in several SNPs in less than 200 base pairs from rs1967309. Of note, however, rs1967309 itself does not show value over two in any population. In African populations, no signal is seen in this LD block (Appendix 1—figure 1). Other SNPs in ADCY9 are found to have absolute iHS values higher than 2, especially in the long intron one and around the last exon, but characterizing these signals is beyond the scope of this study.

Sex-specific differentiation at rs1967309 in ADCY9

We first used FST to evaluate differences in genotype frequencies between males and females. In the PEL from 1000G, we saw suggestive differences between males and females around rs1967309, but did not replicate in the LIMAA cohort, which suggests it was due to small sample size (Kasimatis et al., 2019). Another approach we took was to investigate the impact of sex on our PBS results, by splitting the sample between males and females, and recomputing all PBS values using PEL, MXL and CHB for SNPs on chromosome 16 in each subsample. We report result on chromosome 16 that account for chromosome specific population history, as in our analyses of the full cohort, tests on chromosome 16 were more conservative than on the whole genome (i.e. p-values were slightly larger with chromosome 16 alone). Although over the full chromosome, the distribution was not statistically different between males and females (PBS95th-PEL,male = 0.043; PBS95th-PEL,female = 0.040) as expected, curiously the PEL branch length for all SNPs around rs1967309 increases for males compared to the full-sample results: at rs1967309, the PBS value became 0.096 in males (chromosome 16 empirical p-value = 0.004). On the other hand, for females the value dropped to 0.017 (chromosome 16 empirical p-value = 0.20). No such male-female difference is seen in CETP, with the PEL PBS value for rs158477 remaining significantly elevated in both sexes (chromosome 16 empirical p-valuers158477,male = 0.04, chromosome 16 empirical p-valuers158477,female = 0.01, Appendix 1—figure 3b and c). This suggests that the LD block around rs1967309 is differentiated between males and females in the Peruvians from 1000G. However, we note that the null model for the FST statistic underlying PBS assumes no difference in genotype frequencies between sex (i.e. may not be the appropriate tool to address this specific question), and we cannot exclude the possibility of random sampling noise.

Admixture analyses

Recent admixture and migration events can influence LRLD. If segments of the genome are particularly enriched for a specific ancestry, this could lead to inflated LRLD between these segments. Given that the Peruvian is an admixed population between individuals of Native American ancestry (mainly Andean) as well as of European ancestry (Appendix 1—figure 2), we ran several analyses to establish whether our results at ADCY9/CETP can be explained by admixture patterns.

Local ancestry inference pre-processing

The reference populations used to run RFMix were YRI for the African ancestry, CEU for the European, CHB for the Asian from 1000G, subpopulations in NAGD (Northern American, Central American and Andean) for the Native American ancestry. We estimated local ancestry with RFMix on PEL from 1000G and LIMAA individuals.

For all 1000G populations (YRI, CEU, CHB, PEL), NAGD (Northern American, Central American and Andean) and LIMAA cohort, from the pre-processed datasets (see above) we kept only biallelic SNPs positions, removed SNPs with a MAF under 1 % for each subpopulation, with more than 1 % of missing individuals, with Hardy-Weinberg equilibrium p-value < 10–4 with mid-adjustment using PLINK. We kept overlapping positions between all datasets and extracted the minor allele frequencies for each reference group. To avoid overlapping positions on the genetic maps, when SNPs had the exact same genetic position, we selected the SNP with the higher variance in allele frequencies (using var in R) between the reference groups (all subpopulations except PEL and LIMAA), keeping between 6742 and 57,238 SNPs per chromosome for RFMix analysis.

Assessing proportions of global Andean ancestry

To see if there could be a potential enrichment or depletion of Andean ancestry at CETP and ADCY9 loci compare to the rest of the genome, we looked at the proportion of attribution of Andean at those loci compared to the overall distribution of all chromosomes. From the 584,797 positions used for RFMix on all chromosomes, 4,476 position intervals were given, and we calculated the proportion of Andean attribution for each interval, then calculated the 95 % confidence interval (CI) for all chromosomes which is [0.43–0.75]. The proportion at ADCY9 and CETP loci were 0.58 and 0.66 respectively, which suggests that the correlation between ADCY9 and CETP loci is unlikely to be due to an enrichment or depletion of Andean ancestry at both loci. Results are similar when only considering chromosome 16 to calculate the 95% CI.

LRLD in the Andean population from NAGD

Another question is to assess if the association was already present in the non-admixed ancestral Andean population. If this is the case, the association cannot be explained by the random distribution of Andean segments across the Peruvian genome. We computed LRLD as described in Peruvians in the Andean population from NAGD and we found that the association between rs1967309 and rs158477 is also significant (ADCY9/CETP empirical pvalue = 0.04, Figure 3—figure supplement 1a, b, Appendix 1—table 1). We note that, in this population, strong association signals with rs158477 are also seen at other SNPs in the ADCY9 LD block region. This result provides convincing evidence that the results in PEL and LIMAA are not due to random distribution of admixed segments but rather might have been inherited from the Andean population, where it was already present, and is maintained since then by selection.

In the Andean population, the association between rs1967309 and rs158477 is not significant when we stratified by sex (Appendix 1—table 1), but we still see significant association signals with rs158477 at other SNPs in ADCY9 LD block in both sexes (Figure 4—figure supplement 3)

Comparison between Peruvian cohorts

To evaluate the genetic difference between Peruvian from 1000G and LIMAA, we performed a PCA starting from the phased data files. We kept only biallelic SNPs with a MAF higher than 5 % in each cohort and kept only positions with no missing genotype. We followed the suggestion given by flashPCA2 (Abraham and Inouye, 2014), remaining 18,345 SNPs for the PCA. We then did a UMAP on 50 PCs given by flashPCA2 using the UMAP package on R (default parameters) (Appendix 1—figure 2c,d,e). As seen in the UMAP analysis, population structure exists in LIMAA, and PEL samples are mainly part of the largest subgroup observed in Appendix 1—figure 2e, which was kept for LIMAA analyses to remove any confounders linked to population subdivision (see below). Also, the LIMAA cohort was initially recruited as part of a tuberculosis study (Luo et al., 2019), but our PCA and UMAP analysis showed no separation according to disease state.

Null distributions of LRLD

To evaluate how likely it is to observe, specifically in the admixed Peruvian population, a genotype correlation of r2 = 0.08 between SNPs that are approximately 53 Mb apart on the same chromosome like between rs1967309 and rs158477, we have used two approaches. The first one was specific to the two genes under study, ADCY9 and CETP, and therefore controls for all genomic factors specific to these regions. We selected all SNPs with MAF >0.05 in the two genes, and computed r2 values for all 37,802 pairs (461 SNPs in ADCY9 and 82 SNPs in CETP), yielding a null distribution for the expected genetic correlation between these genes. We then compared our r2 value for rs1967309 and rs158477 to this distribution, with its rank being reported as an empirical p-value. This is referred to in the Results section as ‘ADCY9/CETP empirical p-value’.

This approach is appropriate to correct for the genomic context specific to our genes of interest, but does not account neither for allele frequencies (most SNPs in the null will be at lower frequencies than our two SNPs) nor for overall admixture levels in the genome of this sample, thus we used a second empirical approach to account for these important confounders. For this genome-wide null distribution of the LRLD matching our SNPs, we generated one set of pairs of SNPs and evaluated LRLD between these random pairs in both LIMAA cohort and PEL from 1000G. Since frequencies in the LIMAA cohort are likely better estimates of allele frequencies in the Peruvian population because of the size of the sample, we started our selection based on SNPs’ characteristic in this cohort: we extracted pairs of biallelic SNPs from chromosome 1–18, (the other chromosomes being too small) with a MAF between 15% and 30%, separated by between 50 and 60 Mb and 61 ± 10 cM based on the PEL genetic map from 1000G. If SNPs in a pair shared coordinates on the genetic map (in cM) with another SNP from another pair, we kept only one of these pairs. We ended up with 3,576 non-overlapping SNP pairs for calculating the LRLD null distribution matching our rs1967309-rs158477 pair obtained from the LIMAA cohort. For analysis in PEL from 1000G, we added an extra frequency filtering step to remove pairs for which one or both SNPs had a MAF below 5 % in PEL, leaving 3513 pairs for analysis for PEL of 1000G. To calculate an empirical p-value in PEL, we evaluated the number of pairs which had a LRLD value larger to the observed value for rs1967309-rs158477 and divided this number by the total number pairs (n = 3513). This is referred to in the Results section as ‘genome-wide empirical p-value’.

From the 3,513 pairs of SNPs sampled to create the genome-wide null distribution in both sexes in PEL, we stratified by sex and recomputed null distributions for males and females in the same way as for the full cohorts, also with a MAF filter at 5%, leaving 3,505 pairs in males and 3,512 in females in PEL. In males, the r2 value between rs1967309 and rs158477 was the highest of the distribution (genome-wide empirical p-value < 2.85 x 10–4), but for females, it was in the 20th percentile (genome-wide empirical p-value = 0.80).

Permutation analysis of sex-specific LRLD at the positions rs1967309 and rs158477

A second null distribution was derived for evaluating if the LRLD difference between sex for the rs1967309-rs158477 pair was significant, given the significant LRLD observed at these loci. We permuted the sex labels within the cohort and split them into two random groups of 42 pseudo-males and pseudo-females, while making sure an equal number of real males and females (21 of each) are found in each random group, yielding a total of 919 unique random splits that respected these conditions for the 85 PEL individuals. For each iteration, we calculated LRLD between rs1967309-rs158477 for each group and computed the absolute difference between them. To calculate a p-value, we evaluated the number of iterations that had a LRLD difference of more than or equal to the observed difference for the rs1967309-rs158477 pair between true males and females. The true absolute difference in r2 values between rs1967309 and rs158477 (0.346) is the third highest value in this null distribution (p-value = 0.002) (Appendix 1—figure 4c).

Genotype association between rs1967309 and rs158477 in LIMAA

In the LIMAA cohort, we performed a genotype association test using a χ2 test with four degrees of freedom (χ42) with a permutation scheme to obtain the p-values, as reported in Rohlfs et al., 2010, to control for the marginal one-locus genotype counts. To avoid the potential effects of population subdivision on LRLD (Nei and Li, 1973; Slatkin, 2008), we only kept individuals in the largest, likely more homogeneous, group seen in the UMAP performed on the first 50 PCs with PEL from 1000G (Appendix 1—figure 2e). Two smaller distinct groups were identified in the UMAP analysis and these individuals were excluded from our analysis (cross shaped individuals in Appendix 1—figure 2e), leaving 3243 individuals for analysis. The permutation scheme consists in permuting the rs1967309 values 1,000 times and computing the number χ42 values obtained by permutation that are higher than the observed value for the rs1967309/rs158477 pair. For LIMAA, the χ42 value is 82.0 (permutation p-value < 0.001). We then performed the same analysis by stratifying by sex, and obtained a χ42 value of 56.6 (permutation p-value = 0.001) in males and a χ42 value of 37.0 (permutation p-value = 0.017) in females. We note that performing this analysis in the full cohort of 3509 individuals (without excluding individuals from subpopulations shown in Appendix 1—figure 2e) yield very similar results (full cohort χ42 = 77.6, male χ42 = 56.5, female χ42 = 34.5). To assess which combination is driving the effect, we used an empirical combination-specific test: the p-value is obtained by breaking the real rs1967309-rs158477 genotype combinations by permuting rs1967309 genotypes and evaluating how many permutated samples show an enrichment of a specific genotype combination as large as in the real data. Interestingly, the combination driving the highly significant male effect is an excess of rs1967309-AA+ rs158477 GG (combination-specific permutation p-value < 0.001), whereas in female, the result seems to be driven by rs1967309-AA+ rs158477 AA (combination-specific permutation p-value = 0.014). These sex-specific genotypic effects could not be captured by a linear model and can explain why the r2 value in LIMAA is smaller than in PEL. Additionally, we note that in both sexes (but mainly in males), the low-frequency rs1967309-GG+ rs158477 GA combination is enriched in LIMAA (observed counts is 112 whereas expected according to allele frequencies at both loci is 59.7).

Finally, to evaluate the effect genome-wide, we calculated the χ42 for all 3576 pairs from the above described genome-wide null distribution for LRLD, then compared these with the value obtained for the rs1967309/rs158477 pair, in all individuals, males and females of LIMAA. In all groups, the rs1967309/rs158477 χ42 values were in the top values (genome-wide empirical p-valueall = 0.0003; genome-wide empirical p-valuemales = 0.002; genome-wide empirical p-valuefemales = 0.001), meaning that the association is significant genome-wide, as found in PEL using r2 (Figure 3d).

Despite lower power in 1000G PEL sample, we replicated the rs1967309-AA+ rs158477 GG enrichment in males in PEL using a 2 × 2 χ2 test, comparing specifically the rs1967309-AA+ rs158477 GG to the three other combinations (rs1967309-nonAA+ rs158477 GG; rs1967309-AA+ rs158477-nonGG; rs1967309-nonAA+ rs158477-nonGG, permutation p-value = 0.018). The rs1967309-AA+ rs158477 AA association seen in females does not replicate (permutation p-value = 0.51) possibly due to low sample size (observed counts is 1, expected counts is 0.66).

Age was available in the LIMAA cohort, enabling us to test whether the LRLD pattern is associated with age, which could suggest a survival benefit if the association is not seen at younger ages. No correlation was seen between genotype and age for rs1967309 and rs158477, and age distributions between males rs1967309-AA+ rs158477 GG and females rs1967309-AA+ rs158477 GG were not significantly different. Because sample size was large enough in this cohort to perform a stratified analysis, we further split the cohort into nearly balanced age categories in males (0–19 years old: 435; 20–25: 464, 26–35: 523; over 35: 519) to establish if the LRLD is present in a specific sub-group. To test if the enrichment of rs1967309-AA+ rs158477 GG in males varies between age group, we calculated the expected frequencies using the frequencies in all age combined in males only (using the whole sample allele frequencies did not change the results). First, we generated a 2 × 2 contingency table comparing rs1967309-AA+ rs158477 GG versus the three others (see above), then calculated a χ2, then we used a permutation test permuting rs1967309 genotypes 1,000 times to assess statistical significance. The empirical p-values suggest differences between age groups (permutation p-value0-19 = 0.12; permutation p-value20-25 = 0.21; permutation p-value26-35 = 0.01; permutation p-value>35 = 0.04), with significant p-values in older groups only. We thus more formally tested if the association differs between age groups between 0 and 25 years old (n = 899) and above 25 (n = 1042): we performed a χ2 test based on a 2 × 2 contingency table using the chisq.test() function in R, comparing the rs1967309-AA+ rs158477 GG versus all other genotypes for the two age groups, permutating age values 1,000 times to estimate an empirical p-value. There was no significant difference between age group (χ2 = 0.02, permutation p-value = 0.88). Similar results were obtained when the number of individuals were balanced across the two groups (cut off of 26 years old) or when the four initial age groups were used.

Finally, we also considered how the imputation quality in LIMAA could affect the main result of LRLD, because of imputation from non-representative reference panel populations is known to be problematic. We recomputed the genotype correlation (r2) in LIMAA with our two SNPs imputed with the TOPMED panel, a more representative panel than the Haplotype Reference Consortium initially used. The value obtained is 0.0047 compared to 0.0046 before, showing that imputation quality is unlikely to have affected our results.

Expression data

ADCY9 and CETP expression quantification from RNAseq data

By analysing more in depth the ADCY9 gene and its isoforms, we noticed that a considerable proportion of reads were assigned to a specific isoform, ADCY9-205 (ENST00000574721.1), a 2.4 Kb long retained intron that does not have a validated status. It was removed from the gene definition file (GTF) to remove any noise from spurious transcription. All GTEx data was therefore reprocessed for ADCY9 and CETP by the same pipeline (see below) to obtain transcription levels per sample at the gene level, consistent for the two genes across cohorts.

For each eQTL dataset (GEUVADIS, GTEx, CARTaGENE), we recalculated the top 5 PCs using genotype data with flashPCA2. For duplicated samples, we kept the sample with the highest read count in the library and removed samples which had a total of less than 10 million of reads. We trimmed the sequencing reads from Illumina adaptors and bad quality ends (BQ >20) using TrimGalore!. We mapped the alignment files on Hg38 human genome reference using STAR v2.6.1a (Dobin et al., 2013) with the Ensembl 87 genome annotation, then estimated count for each gene using RSEM v1.3.1 (Li and Dewey, 2011). For GTEx, we separated each tissue at this step, then removed tissues with less than 50 samples, leaving samples from 49 different tissues to avoid over-interpretation due to low sample size while maximizing the number of tissues to be tested. We kept the genes which had more than six reads in at least 20 % of the sample. We then normalized expression data using limma (TMM normalization) (Ritchie et al., 2015) and voom (Law et al., 2014). We calculated PEER factors (Stegle et al., 2012) on the normalized expressions. For all sex-stratified analysis, we kept sex-stratified tissues that had at least 50 samples, and recomputed PEER factors with samples from only one sex (which we term sPEER factors).

To test if ADCY9 and CETP expression is correlated across tissues in humans, we used data GTEx and performed a linear regression correcting for the first 5 PCs, age, sex, the collection site (SMCENTER), the sequencing platform (SMGEBTCHT) and total ischemic time (TRISCHD). We find that ADCY9 and CETP gene expression levels are negatively correlated (and significantly(p < 0.05) so for Adipose-Subcutaneous, Adrenal Gland, Artery Coronary, Artery Tibial, Brain-Cerebellar Hemisphere/Cerebellum/Cortex/Frontal Cortex/Putamen (basal ganglia), Breast-Mammary Tissue, Esophagus-Gastro esophageal Junction/Muscularis, Heart-Left Ventricle, Lung, Prostate, Small Intestine-Terminal, Spleen, Stomach, Uterus, Whole blood), except in skin tissues and cells cultured fibroblast, for which a significant positive correlation is found (Appendix 1—figure 6).

Expression quantitative trait loci (eQTL) analysis for rs1967309 and rs158477

We first looked at the effects of the SNPs independently on their respective genes. The covariates include the first 5 PCs, age (except for GEUVADIS, information not available), sex, as well as PEER factors, calculated to take into account hidden factors. In GTEx, we added additional covariates: the collection site (SMCENTER), the sequencing platform (SMGEBTCHT) and total ischemic time (TRISCHD). One limitation is there is no standardized way of deciding how many PEER factors to include. We tested the robustness of results to the inclusion of different numbers of PEER factors in the models and we report them all for GEUVADIS, CARTaGENE (CaG) and GTEx for transparency (Appendix 1—figures 79). The maximum number of PEER factors considered follows recommendation by GTEx based on sample size for each tissue.

SNP rs1967309 is a cis eQTL of ADCY9 in whole blood in CARTaGENE (p-value = 4.46 × 10–13, ß = –0.10, N = 728, 10 PEER factors) with AA individuals having increased ADCY9 expression compared to GG individuals. This effect is replicated in whole blood samples from GTEx (N = 559), and several other tissues in GTEx (with esophagus being the most significant), but some tissues show an inversion of the direction of the effect, such as lung and thyroid. These results may differ from GTEx reported eQTL results, because of the removal of ADCY9-205 isoform (see above), the different expression normalization method and filters by ethnicity applied here. SNP rs158477 is found as a cis eQTL of CETP in GEUVADIS (p-value = 1.69 × 10–4, ß = 0.26) lymphoblastoid cell lines, replicated in GTEx (EBV transformed lymphocytes) as well as in many other GTEx tissues. Tissues with p-value below 0.05 across most PEER factors values (if not all) are: adipose tissues, hippocampus, liver, lung, small intestine, muscle, stomach, thyroid, with GG genotype having consistently less CETP expression than AA.

We next tested whether the SNPs are trans eQTL for the genes. We found nominally significant associations between rs1967309 and CETP expression in the ovary (P-value = 0.0017, N = 138, max PEER factors = 15) and hippocampus (p-value = 0.049, N = 150, max PEER factors = 30), results that are stable across PEER factors values, two tissues for which rs1967309 is not significantly associated with ADCY9 expression (P-value > 0.05). We found nominally significant associations between rs158477 and ADCY9 expression in the brain-cerebellar hemisphere and liver.

We next evaluated the interaction effect between rs1967309 and rs158477 on gene expression levels, despite somewhat low statistical power, especially given the small number of samples by tissue. Because the appropriate value of the number of PEER factors to be included on the model is not obvious, we report all values of PEER until the maximum suggested by GTEx for the GTEx tissues (15 for N < 150, 30 for 150≤ N < 250, 45 for 250≤ N < 350, and 60 for N ≥ 350). We also required that the interaction term rs1967309*rs158477 for a tissue had p-values under 0.1 for a majority of values of the number of PEER factors included, to qualify a result to be suggestive of an interaction effect. In the GEUVADIS dataset, which has 287 samples, the interaction was significant on CETP expression and stable across PEER factors (Appendix 1—figure 7a), which could mean that the effect of a SNP could be modulated by the other SNP. To evaluate this effect further and make sure this is not due to outlier effects or other statistical flukes, we stratified by genotypes of each SNP and investigated the effect of the other SNP on CETP expression. We used a linear regression with the same covariates mentioned above. We first stratified by the genotype of rs1967309, then evaluated the effect of rs158477 on CETP expression. SNP rs158477 is significant in the AA of rs1967309 (p-value = 0.03, ß = 0.45, OR=[1.05–2.36], n = 46) and AG (p-value = 0.009, ß = 0.24, OR=[1.07–1.53], n = 143), but not for GG (p-value = 0.58, ß = 0.07, OR=[0.83–1.40], n = 96) (Figure 5b), potentially showing a mitigation of the eQTL effect of rs158477 for each alternative allele of rs1967309 on CETP expression. We also evaluated the effect of rs1967309 when we stratified by rs158477 on CETP expression. SNP rs1967309 is significant only for GG of rs158477 (p-value = 0.05, ß = 0.3, OR=[1.00–1.83], n = 72), and not for GA (p-value = 0.51, ß = −0.06, OR=[0.77–1.14], n = 139) nor AA (p-value = 0.68, ß = −0.07, OR=[0.66–1.30], n = 74). The second dataset that we used was the GTEx, in which we evaluated 49 tissues. Since the effects across tissues are likely not independent, we did not correct for multiple testing, keeping a suggestive threshold at 0.10 and a significant threshold at 0.05, but those values need to be reached for a majority values of the number of PEER factors included to be convincing. Among the 49 tissues (Appendix 1—figure 8), those with p-values under 0.10 for several numbers of PEER factors are hippocampus (N = 150), hypothalamus (N = 156), brain spinal cord (cervical c-1) (N = 114), substantia nigra (N = 100) and skin sun-exposed (N = 507). Among those tissues, rs1967309 is only a cis-eQTL of ADCY9 in the substantia nigra.

Since the selective pressure differ between sexes, we stratified our expression analysis by sex. For CETP expression analysis, there are no consistent signals for GEUVADIS, possibly reflecting lack of power or that there is no sex-specific effect in lymphoblastoid cell lines (Appendix 1—figure 7c). However, in CaG, the significant interaction found is present only in male (again for the genotypic coding, Appendix 1—figure 7d). In GTEx, we see significant interaction effects in males in tissues that had signals with sex-combined, such as brain hippocampus (Nmale = 105), hypothalamus (Nmale = 112) and spinal cord cervical (Nmale = 72), skin sun-exposed for CETP (Nmale = 330) (Figure 5d, Appendix 1—figure 9). We note that for most brain tissues, the low sample size in females does not allow to conclude on the presence of an interaction effect in that sex. In these tissues, the direction of the effect in males is reversed compared to what is observed in GEUVADIS with sexes combined (Figure 5b), whereas the highly significant result in skin shows an effect consistent with the sex-combined GEUVADIS result (p-value = 0.0017, ß = −0.32). More specifically, in the sun-exposed skin samples, in rs1967309 AA males, copies of the rs158477 A allele increase CETP expression by 0.49 (95% CI 0.12–0.87) on average. In rs1967309 AG males, the effect of rs158477 is null (p-valueAG = 0.33) and the effect of the rs158477 A allele is suggestive in rs1967309 GG individual (p-valueGG = 0.10) with a decrease of the CETP expression. Conversely, if we look at the effect of rs1967309 on CETP expression in skin sun-exposed when we stratified by rs158477 in males, SNP rs1967309 is neither a significant eQTL of CETP in GG of rs158477 in males (p = 0.11, n = 89) nor for GA (p = 0.65, n = 164), but is a significant eQTL for AA males (p = 0.026, ß = −0.46, OR=[-0.87,–0.06], n = 76).

We also identified new tissues where the interaction is either suggestive or significant in females, in artery tibial (Nfemale = 156), heart atrial appendage (Nfemale = 97), spleen (Nfemale = 69), and stomach (Nfemale = 105) (Appendix 1—figure 9), with an effect reversed compared to the initial GEUVADIS result (Appendix 1—figure 7a). For the pituitary tissue, it is significant in both sex (additive coding in female and genotypic coding in male, Nmale = 156, Nfemale = 63), but the direction of the additive coding is reversed between sexes, possibly explaining why the sex-combined analysis did not show any signal. We note that the newly discovered signals are mainly for females, indicating that the signal was hidden by the male effects (or absence of effects), likely because of higher sample sizes.

Experiments

Real-time PCR quantification

Reverse transcription was performed from 500 ng total RNA in a 20 ml reaction using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems cat #4368814). RNA quantification was assessed using Agilent RNA 6000 Nano Kit for Bioanalyzer 2,100 System (Agilent Technologies). Primers were designed using the Beacon designer software v.8 (Premier Biosoft) (Appendix 1—table 3). The real-time PCR was carried out with SYBR-Green reaction mix (BioRad cat #1725274). The thermal cycling program was 3 min at 95 °C for initial denaturation followed by 40 cycles of denaturation for 10 s at 95 °C, 30 sec annealing at 60 °C and 30 sec extension at 72 °C. qPCR assay was normalized with PGK1 and HBS1L genes.

Western blot analysis

A total of 200 ml of cell media from HepG2 transfected cells were concentrated using Amicon Ultra 0.5 ml 10 kDa cutoff units (cat #UFC501096) to 25 ml. Proteins were separated on 10 % TGX-acrylamide gel. After O/N electrotransfer at 10 volts to PVDF membranes, CETP protein was determined using a primary anti-CETP rabbit monoclonal antibody (Abcam cat #ab157183) 1:1000 in 3 % BSA, TBS, tween 20 0.5%, O/N 4 °C, followed by HRP-conjugated secondary antibody goat anti-rabbit 1:10,000 in 3 % BSA 1 h at room temperature. Detection was performed using Western Lightning ECL Pro (Perkin Elmer cat #NEL122001EA). Proteins levels were normalized with total proteins loaded.

Phenotype associations

Two-way and three-way interaction models in UK biobank

With rs1967309 coded under the genotypic model, allowing to capture non-additive effects, we tested if the effect of the interaction term was significant for a phenotype Y using a likelihood ratio test (LRT) by comparing the following models:

Yrs158477+rs1967309+sex+age+PC1-5m1
Yrs158477*rs1967309+sex+age+PC1-5m2
Yrs158477*rs1967309*sex+age+PC1-5m3

We used the R function glm with family = "binomial" and compared models using the following: anova(a, b, test = "LRT"), with a = m1 and b = m2 to test for two-way interaction effects, and a = m2 and b = m3 to test for three-way interaction effects.

Individually, SNP rs1967309 (fA = 39%) is nominally associated with heart rate, and rs158477 (fG = 47%) with the systolic blood pressure (Figure 6—figure supplement 1), both results being mainly driven by association in females. Both SNPs are nominally associated with waist-hip ratio, rs1967309 in females, rs158477 in males. None of these effects are genome-wide significant.

Phenotype associations in GTEx

In GTEx, we had the variable MHHRTATT (phv00169162.v8.p2) for cardiovascular disease. This variable is defined as Heart attack, acute myocardial infarction, acute coronary syndrome. GTEx also has phenotypes, including cardiovascular traits. The variable DTHFUCOD (First Underlying Cause Of Death) was used to identify individuals whose cause of death included Heart Attack/Stroke, Heart Disease, Acute Myocardial Infarction, Possible MI, who were considered as cases. From the 699 samples kept, we excluded six for which the phenotype was missing or unknown for MHHRTATT variable and for which the cause of death was unrelated to heart disease, yielding a total of 130 cases and 563 controls. We added as covariates: sex, age and top 5 PCs. For this phenotype, neither rs1967309 nor rs158477 are associated with those phenotypes when taken alone. However, the interaction and both SNPs in the equation are significant (p-value ≤ 0.05) or close to be significant (p-value ≤ 0.10) for the phenotype (p-valueMHHRATT = 0.01, EstimateMHHRATT = −0.54) (Figure 6—figure supplement 1). Like for CETP expression, this means that for each G allele for rs1967309, there is a decrease of the effect of rs158477 on cardiovascular outcome. A difference with CETP’s expression is that there is an inversion of the direction of effect. In other word, for AA of rs1967309, directions of the effect of rs158477 are positive, with GG having less probability to have an event than AA (EstimateMHHRATT = 0.47), but for GG of rs1967309, estimates of rs158477 are negative (EstimateMHHRATT = −0.79). Those results are consistent with the direction of dalcetrapib pharmacogenomic analysis. Considering that the GG genotype of rs158477, with less CETP’s expression, is a proxy for dalcetrapib, which is an inhibition of CETP, the same gradient is present for rs1967309. In AA of rs1967309, there is less heart disease with dalcetrapib (Tardif et al., 2015). In GG of rs1967309, there is more heart disease with dalcetrapib. More study of this interaction is needed to understand the mechanism. However, this could lead to new insights into the potential biological mechanism behind the pharmacogenomic association involving the gene ADCY9 with cardiovascular outcome of dalcetrapib.

Appendix 1—figure 1
Selection signature in ADCY9.

iHS values and recombination for all populations in the ADCY9 gene. Vertical black lines represent the highest recombination rates around rs1967309 from 1000G population-specific genetic maps. Horizontal line represents the value at 2 and –2. Different colors represent one super population. In order of color: African, European, East Asia, South Asia and America. Abbreviations for the subpopulation of 1000G can be found here https://www.internationalgenome.org/category/population/.

Appendix 1—figure 2
Population structure of Peruvian from LIMAA and Peruvian from 1000G.

Ancestry distribution on all chromosomes in the Peruvian from 1000G (a) and LIMAA cohort (b). Overall weighted proportion given by RFMix using reference populations from 1000G and Native American Genetic Dataset (NAGD) for the Peruvian population from 1000G (a) and from LIMAA cohort (b). 1000G populations YRI, CEU, and CHB were chosen to represent African, European, and Asian ancestry, respectively. (c,d) Principal Component Analysis using flashPCA on Peruvian from 1000G and LIMAA cohort. The top three PCs is shown. (e) UMAP analysis on the top 50 PCs. To limit confounders due to population structure, we excluded individuals in LIMAA coming from the two small groups identified by the UMAP (cross shaped light green symbols in (e)). Abbreviations for 1000G can be found here: https://www.internationalgenome.org/category/population/. Abbreviations for the Native American (NAGD): NOA: northern American; CEA: central American; AND: Andean.

Appendix 1—figure 3
Populational differentiation of CETP gene using PBS statistic.

PBS values in the CETP gene, comparing the CHB (outgroup), MXL and PEL identified by different colors, overall (a), in males (b) and in females (c). Horizontal lines represent the 95th percentile PBS value genome-wide (a) or the chromosome 16 (b,c) for each population. Position with r2 higher than the 99th percentile in the Peruvian population from the 1000G are represented by colored shape.

Appendix 1—figure 4
Long-range linkage disequilibrium shown in CETP for the PEL population from 1000G, stratified by sex.

Genotype correlation (r2) between the three loci identified in CETP (see Figure 2a) to be higher than the 99th percentile and all SNPs with MAF >5% in ADCY9, in males (a) and females (b). The horizontal black line is the 99th of all those comparisons between ADCY9 and CETP by sex. (c) Distribution of absolute difference of genotype correlation values obtained during the permutation analysis that shuffled the sex label for rs1967309 and rs158477 (red), compared to the value obtain with the real sex labels (blue).

Appendix 1—figure 5
Long-range linkage disequilibrium in the Andean population from NAGD (a,b) and LIMAA cohort (c–f).

(a,b,d,f) Genotype correlation (r2) between rs1967309 and all SNPs with MAF >5% in CETP, for the Andean population from NAGD (a,b) and the LIMAA cohort (d,f). (c,e) Genotype correlation between the three loci identified in Figure 3a to be higher than the 99th percentile and all SNPs with MAF >5% in ADCY9 in LIMAA. Males (NAndean = 54, NLIMAA = 1941) (a,c,d) and females (NAndean = 34, NLIMAA = 1302) (b,e,f) are shown separately. The horizontal line is the 95th (a,b) and 99th (c–f) percentile of all comparisons between ADCY9 and CETP genes.

Appendix 1—figure 6
Significance of the correlation between ADCY9 and CETP expression across GTEx tissues.

P-values are presented on a -log10 scale and are obtained from a linear regression on normalized expression with correction for age, sex, top 5 PCs, ischemic time death, sequencing platform, and sequencing center. Regular triangles mean that both gene expression levels are positively correlated, inverted triangles mean that both gene expression levels are inversely correlated. The dashed line represents the p-value at 0.05.

Appendix 1—figure 7
Epistatic effects between rs1967309 and rs158477 on CETP expression in GEUVADIS (LCL, N = 287) and CARTaGENE (Whole blood samples, N = 728).

P-values are presented on a -log10 scale and are reported in function of the number of PEER/sPEER factors in GEUVADIS (LCL) (a,c) and CARTaGENE (b,d) in sex-combined (a,b) and sex-stratified (c,d) analyses. For all models, rs158477 is coded as additive (GG = 0, GA = 1, AA = 2). In the additive model (green triangle), rs1967309 is coded as additive (AA = 0, AG = 1, GG = 2), p-values are obtained using a linear regression in R. In the genotypic model (orange circle), rs1967309 is coded as a genotypic variable and p-values are obtained from a likelihood ratio test comparing models with and without the interaction term between the SNPs. The orange, red, and pink lines represent p-values of 0.1, 0.05, and 0.01 respectively. The sample sizes reported are the number of individuals left after removing participants with missing genotypes for rs1967309 and/or rs158477. In (c,d) the color of the lines represents the sex label.

Appendix 1—figure 8
Sex-combined epistatic effect p-values for the interaction between rs1967309 and rs158477 on CETP expression depending on the number of PEER factors in GTEx by tissue.

P-values are presented on a -log10 scale. For all models, rs158477 is coded as additive (GG = 0, GA = 1, AA = 2). In the additive model (green triangle), rs1967309 is coded as additive (AA = 0, AG = 1, GG = 2), p-values are obtained using a linear regression in R. In the genotypic model (orange circle), rs1967309 is coded as a genotypic variable and p-values are obtained from a likelihood ratio test comparing models with and without the interaction term between the SNPs. The orange, red and pink lines represent p-values of 0.1, 0.05 and 0.01 respectively. The tissue type and the number of samples for each, used in the analysis, are reported in the titles of the subgraphs.

Appendix 1—figure 9
Sex-specific epistatic effects between rs1967309 and rs158477 on CETP expression depending on the number of sPEER factors in GTEx by tissue.

P-values are presented on a -log10 scale. For all models, rs158477 is coded as additive (GG = 0, GA = 1, AA = 2). In the additive model (green triangle), rs1967309 is coded as additive (AA = 0, AG = 1, GG = 2), p-values are obtained using a linear regression in R. In the genotypic model (orange circle), rs1967309 is coded as a genotypic variable and p-values are obtained from a likelihood ratio test comparing models with and without the interaction term between the SNPs. The orange, red and pink lines represent p-values of 0.1, 0.05 and 0.01 respectively. The tissue type and the number of samples for each, used in the analysis, are reported in the titles of the subgraphs. The color of lines represents the sex label. Only tissues with at least one value under 0.10 are showed. Tissues with an asterisk (*) next to their title are tissues showing a the suggestive/significant effect in the sex-combined analysis.

Appendix 1—figure 10
Population structure in datasets analysed.

We estimate population structure using UMAP on the top 10 PCs generated with flashPCA2 on (a) GTEx (N = 699) and (b) CARTaGENE (N = 12,056) biobanks. The self-reported white non-Latino individuals were selected for further analyses.

Appendix 1—table 1
Long-range linkage disequilibrium analysis in three datasets, and in subsets of the cohorts.

Number of individuals (N) in each subset is reported. P-values correspond to the ADCY9/CETP empirical p-values computed as described in Section Long-range linkage disequilibrium in Methods. r2 were obtained from the geno-r2 option of vcftools software. For 1000G populations, abbreviations can be found here https://www.internationalgenome.org/category/population/.

CohortPopulationSexNumberr2p-value ADCY9-CETP
1000GYRIAll1080.02360.11
CEUAll990.00030.86
GBRAll910.01170.28
CHBAll1030.0040.53
MXLAll640.00070.83
PEL*All850.07965.42 × 10–3
Male410.34838.23 × 10–5
Female440.00160.78
LIMAALIMAAAll3,2430.00463.24 × 10–3
Male19410.00973.71 × 10–3
Female1,3020.00030.52
NAGDNorthern Amerind(NOA)All810.00840.44
Male270.06340.16
Female540.06990.07
Central Amerind(CEA)All810.02810.12
Male340.03160.28
Female470.02570.24
Andean(AND)All880.02930.04
Male540.04360.09
Female340.01250.55
  1. *

    Discovery cohort.

Appendix 1—table 2
Details on metabolic and clinical variables extracted from the UK Biobank.
Variable IDUK biobank variable locationNumber of samples used for interaction
Category 100011 - Blood pressure - Physical measures - UK Biobank Assessment Centre
Pulse rate at baseline(Pulse rate)Units: bpmData-Field 102 (automatic entry) or Data-Field 95 (manual entry), to be derived as follows:
  • Pulse rate, automated reading (Data-Field 102) used mean of available measures for instance 0 (baseline) only. If a manual measure is available for an individual (Data Field 95 below) then do not use this automated reading (assumed to be abnormal).

  • Pulse rate (during blood-pressure measurement) (Data-Field 95), use Instance 0 (baseline). Use mean when there are multiple measures for a same individual.

All = 395,319Male = 182,279Female = 213,040
Diastolic blood pressure at baseline(Diastolic BP)Units: mmHgData-Field 4,079 (automatic entry) or Data-Field 94 (manual entry), as follow:
  • Diastolic blood pressure, automated reading: Data-Field 4079,, use mean of available measures for instance 0 (baseline) only. If a manual measure is available for an individual (Data Field 94) then do not use this automated reading (assumed to be abnormal).

  • Diastolic blood pressure, manual reading: Data-Field 94, use mean of available measures for instance 0 (baseline) only.

All = 395,384Male = 182,326Female = 213,058
Systolic blood pressure at baseline(Systolic BP)Units: mmHgData-Field 4,080 (automatic entry) or Data-Field 93 (manual entry), as follow:
  1. Systolic blood pressure, automated reading: Data-Field 4080,, use mean of available measures for instance 0 (baseline) only. If a manual measure is available for an individual (Data Field 93) then do not use this automated reading (assumed to be abnormal).

  2. Systolic blood pressure, manual reading: Data-Field 93, use mean of available measures for instance 0 (baseline) only.

All = 395,353Male = 182,316Female = 213,037
Category 100010 - Body size measures - Anthropometry - Physical measures - UK Biobank Assessment Centre
Waist circumference at baseline (Waist circumference)Units: cmData field 48, use mean of available measures for instance 0 (baseline) only.All = 395,006Male = 182,089Female = 212,917
Hip circumference at baseline (Hip circumference)Units: cmData field 49, use mean of available measures for instance 0 (baseline) only.All = 394,651Male = 181,988Female = 212,663
Waist-hip ratioCompute waist/hipAll = 394,944Male = 182,056Female = 212,888
WeightUnits: KgData-Field 21,002 (automatic entry) or Data-Field 3,160 (manual entry), as follow:(3) Weight: Data-Field 21002,, use mean of available measures for instance 0 (baseline) only.Only if unavailable, then use:(4) Weight, manual reading: Data-Field 3160,, use mean of available measures for instance 0 (baseline) only.All = 394,377Male = 181,732Female = 212,645
HeightUnits: cmData-Field 50 or 12,144.(5) Standing height: Data Field 50, used mean of available measures for instance 0 (baseline) only.Only if unavailable, then use:(6) Height: Data-Field 12144,, used mean of available measures, as this is a singular instance fieldAll = 394,871Male = 181,969Female = 212,902
UK Biobank BMI(BMI)Units: Kg/m2Data field 21001,, used mean of available measures for instance 0 (baseline) only.All = 394,173Male = 181,705Female = 212,468
Category 100009 - Impedance measures - Anthropometry - Physical measures - UK Biobank Assessment Centre
Trunk fat percentage(% Trunk fat)Units: %Data field 23127,, use mean of available measures for instance 0 (baseline) only.All = 388,569Male = 178,837Female = 209,732
Body fat percentage(% Body fat)Units: %Data field 23099,, use mean of available measures for instance 0 (baseline) only.All = 388,600Male = 178,752Female = 209,848
Basal metabolic rateUnits: KJData field 23105,, use mean of available measures for instance 0 (baseline) only.All = 388,585Male = 178,758Female = 209,827
Whole body water massUnites: KgData field 23102,, use mean of available measures for instance 0 (baseline) only.All = 388,719Male = 178,881Female = 209.838
Category 100020 - Spirometry - Physical measures - UK Biobank Assessment Centre
Forced vital capacity(FVC)Units: LData field 20151,, use mean if more than one measure.All = 297,461Male = 138,909Female = 158,552
Forced expiratory volume in 1 second(FEV1)Units: LData field 20150,, use mean if more than one measure.All = 297,499Male = 138,937Female = 158,562
Category 100057 - Sleep - Lifestyle and environment - Touchscreen - UK Biobank Assessment Centre
Sleep durationUnits: hours/dayData field 1160,, use mean of available measures for instance 0 (baseline) only.All = 393,133Male = 181,452Female = 211,681
Category 100072 - Early life factors - Verbal interview - UK Biobank Assessment Centre
Birth weightUnits: KgData field 20022,, use mean if more than one measure.All = 227,244Male = 89,715Female = 137,529
Category 717 - Biomarkers
Apolipoprotein A1(ApoA)Units: g/LData field 30630, use mean of available measures for instance 0 (baseline) only.Standardized using the mean: (x-mean)/sdAll = 413,138Male = 190,454Female = 222,684
High Density Lipoprotein(HDL-c)Units: mmol/LData field 30760, use mean of available measures for instance 0 (baseline) only.Standardized using the mean: (x-mean)/sd
Lipoprotein (a)(Lp(a))Units: nmol/LData field 30780, use mean of available measures for instance 0 (baseline) only.Standardized using the mean: (x-mean)/sd
C-Reactive Protein(CRP)Units: mmol/LData field 30710, use mean of available measures for instance 0 (baseline) only.Ln transformation, then standardized using the mean: (x-mean)/sd
Low Density Lipoprotein(LDL-c)Units: mmol/LData field 30790, use mean of available measures for instance 0 (baseline) only.Standardized using the mean: (x-mean)/sd
Apolipoprotein B(ApoB)Units: g/LData field 30640, use mean of available measures for instance 0 (baseline) only.Standardized using the mean: (x-mean)/sd
Category of operation procedure codes (OPCS) and hospitalization or death record codes(ICD9/ICD10)
Coronary artery disease(CAD)Prevalent or incident(cases/controls)All = 413,138 (44,713/368,425)Male = 190,454 (29,910/160,544)Female = 222,684 (14,803/207,881)
Myocardial Infarction(MI)Prevalent or incident(cases/controls)All = 413,138 (18,559/394,579)Male = 190,454 (13,812/176,642)Female = 222,684 (4,747/217,937)
Appendix 1—table 3
Primers sequence for real-time PCR quantification in HepG2 cells for the KD-ADCY9 and KD-CETP experimentations.
SpeciesGeneStrainSequence
HumanADCY9Forward5’ CTGAGGTTCAAGAACATCC 3’
Reverse5’ TGATTAATGGGCGGCTTA 3’
CETPForward5’ CTACCTGTCTTTCCATAA 3’
Reverse5’ CATGATGTTAGAGATGAC 3’
HBS1LForward5’ ACAAGAATGAGGCAACAG 3’
Reverse5’ AGATACTCCAGGCACTTC 3’
PGK1Forward5’ GTGGAGGAAGAAGGGAAG 3’
Reverse5’ AAGCATCATTGACATAGACAT 3’

Data availability

The 1000 Genomes Project, GEUVADIS is freely available. The Native American genetic dataset was shared to us upon request to the authors of the initial paper and through a data access agreement with Universidad de Antioquia (Prof. Omar Triana Chavez). We contacted bedoya.g@gmail.com and a.ruizlin@ucl.ac.uk to get access to the dataset and we completed a data access application form and signed a data access approval once approved. Applications for access to these data can be submitted at any time. These are considered on a rolling basis and a decision was given within 1 month of receipt. PhD student applicants must include their supervisors as a co-applicant and provide their full contact details. A publication list must be provided for the applicant, co-applicants and PhD supervisors where PhD students have applied to provide proof of competence in handling datasets of this size and nature. The UK Biobank was accessed through data access approval under the project number #15357 and #20168. Information to apply for data access can be found here: https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. The CARTaGENE biobank was accessed through data access approval under the project number #406713. Information to apply for data access can be found here: https://www.cartagene.qc.ca/en/researchers/access-request. The GTEx v8 dataset was accessed through dbGaP under project number #19088. The LIMAA dataset was accessed through dbGaP under the project number #26882. Information to apply for data access through dbGAP can be found here: https://dbgap.ncbi.nlm.nih.gov. RNA-sequencing of ADCY9-knocked-down HepG2 cell line data has been deposited under GSE174640: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE174640. Source data files and code for all main figures are available here: https://github.com/HussinLab/adcy9_cetp_Gamache_2021.

The following data sets were generated
    1. Gamache I
    (2021) NCBI Gene Expression Omnibus
    ID GSE174640. RNA-sequencing of ADCY9-knocked-down HepG2 cell line (embargo).
The following previously published data sets were used
    1. Lonsdale J
    (2013) dbGaP
    ID phs000424.v8.p2. Common Fund (CF) Genotype-Tissue Expression Project (GTEx).
    1. Luo Y
    (2019) dbGaP
    ID phs002025.v1.p1. Early progression to active tuberculosis is a highly heritable trait driven by 3q23 in Peruvians.
    1. Awadalla P
    (2013) CARTaGENE biobank
    ID 406713. CAG project number #406713.

References

    1. Lieberman P
    2. Morey A
    3. Hochstadt J
    4. Larson M
    5. Mather S
    (2005)
    Mount Everest: a space analogue for speech monitoring of cognitive deficits and stress
    Aviation, Space, and Environmental Medicine 76:B198–B207.

Decision letter

  1. Ellen Leffler
    Reviewing Editor; University of Utah, United States
  2. George H Perry
    Senior Editor; Pennsylvania State University, United States
  3. Eduardo Tarazona-Santos
    Reviewer; Universidade Federal de Minas Gerais, Brazil

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

This study provides a range of evidence for population and sex-specific selection and an epistatic interaction in humans between variants in the genes ADCY9 and CETP of pharmacogenetic importance. Few such interactions have been identified making this a novel finding that motivates more research and methods to look for such effects. The relationship uncovered contributes new insight towards understanding why drug response to a CETP-modulator is affected by genotypes at ADCY9.

Decision letter after peer review:

Thank you for submitting your article "A sex-specific evolutionary interaction between ADCY9 and CETP" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and George Perry as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Eduardo Tarazona-Santos (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The reviewers identified a number of issues and questions that should be addressed in a revised manuscript. Here is a list of revisions considered essential, many of which come from the individual reviews. The individual reviews are included below, which have both a public review and recommendations for the authors from each reviewer. Please do also respond to all individual reviewers' points and suggestions in your response and revised manuscript.

1. Please include a figure or flowchart that illustrates the overall experimental design and analyses conducted, and which datasets each were performed with. A clear key for the population datasets with sample sizes and the abbreviations that refer to them would be helpful. Please also carefully define and use consistent terms in how datasets are referred to (e.g., First Nation, LIMAA, NAGD, Andean vs. Peruvian).

2. Please include a summary table of the male / female sample sizes, LD values and p-values for the datasets where LRLD was analyzed.

3. Please include the results showing the effect of CETP knockdowns on ADCY9 expression, in addition to the current results showing ADCY9 knockdowns on CETP expression.

4. Please include an additional analysis that considers the effect of sex on gene expression. Please also clarify which analyses did not consider the effect of sex (and why) and for which it was considered but no effect found.

5. Please consider the imputation quality across the different datasets, and especially how it differs across populations to address the concern that imputation from non-representative reference panel populations could be problematic.

6. Please discuss in more detail the proposed interpretation for sex-specific selection. First, the type of selection is not stated in the abstract (only "natural selection"). Second, in the discussion two possibilities are proposed (lines 353-358) but it is not clear if any are supported under the presented evidence. Does the lack of correlation with age rule out a mechanism affecting survival? How strong would selection need to be in a single generation to produce this level of LRLD?

7. Please improve figure visibility and naming as suggested in the specific recommendations below. Also, where possible indicate on the figures where the annotated genes are located, as is only currently shown in Figure 1B. Use consistent naming, e.g., for populations in Figure 1 and 3 (are 3a and b only PEL although the figure title says 1000G?).

Reviewer #1 (Recommendations for the authors):

Because the experimental design of the authors is complex, I suggest that they provide a Figure representing and the different steps of the rationales as well as their experiments. It may also help if they number the different steps of the experimental design both in the Figure and the Main Text.

Specific comments:

– the authors should explicitly state which array was used for the genotyping of the LIMAA replication cohort.

Reviewer #2 (Recommendations for the authors):

a) I think having a summary figure that goes through and connects the different analyses could potentially be very helpful for the flow of this manuscript. I think the collection of analyses are very impressive, but I find myself continually taking a step back to keep track of the different levels presented. You move from selection analyses, to LD analyses, to gene expression analyses, and finally to phenotype analyses. In addition, at multiple points, there is a further stratification by sex. It may not be immediately intuitive on what a summary figure would be for this, but having something to help readers keep track of how or why these different analyses are connected would be very useful.

b) In the introduction the authors highlight that one of the open questions to be addressed is: "However, the underlying mechanisms linking CETP and ADCY9, located 50 Mb apart on chromosome 16, as well as the relevance of the rs1967309 non-coding genetic variant are still unclear." However, the authors do not return to this specific point regarding rs1967309 later in the discussion. I think, since the authors do indeed identify and explore some of the relevance of rs1967309, it would benefit the manuscript to reconnect to this point later at the end of the paper.

c) At the start of the 'sex-specific long-range linkage disequilibrium' section, the authors state the following: "We explored the effect of sex on the LRLD association found between rs1967309 and rs158477. The allele frequencies at rs1967309 were suggestively different between males and females and sex-stratified PBS analyses suggest that the LD block around rs1967309 is differentiated between sexes in the Peruvians…" It is unclear to me if the second sentence is a result of exploring the effect of sex on LRLD associations, or if it is meant to be the justification for exploring the effect of sex. I think clarifying this would be important since it is not otherwise clear why you would move to doing sex-specific analyses, though it does produce an interesting result.

d) I'm wondering why the data regarding the lack of impact CETP knockdowns have on ADCY9 expression is not shown? I think this is an interesting and important result. It helps demonstrate a directionality to the interaction that continually shows up elsewhere. If possible, showing this data would be helpful.

e) Now that you have identified some more relevant tissue types and phenotypes from your epistasis analyses, I am wondering if there are any links that can be made back to your mouse knockouts or any other ADCY9 or CETP knockouts? It seems like with a new spectrum of traits to consider, previous knockout results may make more sense now. Are there any OMIM conditions that now may have some connection to these results?

f) In regards to the CAD outcome findings, I am wondering if you could make use of some of the CAD polygenic risk scores that exist now to further show a link. Specifically, I am wondering whether you could stratify individuals into high and low risk for CAD, and see whether the protective genotype combinations are overrepresented among the low risk individuals vs. high risk (and between the sexes as well). Some example CAD PRSs can be found in the PGS Catalogue: https://www.pgscatalog.org/trait/EFO_0001645/

Reviewer #3 (Recommendations for the authors):

I have a few comments and questions below on the approach and results.

1. Population/cohort specific effects. Is there any known ascertainment bias of the cohorts? Skimming the source papers, it looks like for one cohort, they all presented to a clinic for tuberculosis? I'm not sure if I missed it – but is the sex-effect replicated in the UK biobank data? And the linkage missing or apparent in the UK biobank data? One result that is glossed over (again apologies if I missed it in the supplement), but the effect looks reversed/stronger in females in the Andean population (sup Figure 4 d/f)? Is this true or just not significant/too noisy? Also, if this is testable with the other datasets available (since it is a much larger cohort), they could be used as negative controls of sorts. Ie the UK biobank and/or the other native cohorts.

2. Speculation/comments on why there would be a male/female difference in these population (or any). This circles back to the UK biobank question I had earlier. Are frequencies linked to Mt or Y chromosome haplotypes? Is there a way to measure or view this? I understand the RFMix analysis was meant to test for admixture, but would it also show something like this? As it is hard to say definitively or exclude all potential counterfactuals for these results, without including additional datasets from the same region, along with geographic and historical accounts. I wouldn't want to speculate on this of course, but perhaps characterizing the samples a little further in the supplement might help (like a table of sorts, with totals, split by male/female, and any other known demographics). I found it a little difficult to follow all the different populations tested; and which is the "discovery" and which is the "replication" cohort. There are also some discrepancies in the numbers in the text, supplement and captions.

PEL (1000G) LIMAA (?) Andean (NAGD) UK Biobank?

Males 41 2076 ? 2078 54 ?

Females 44 1433 ? 1434 34 ?

Total 85 3509 ? 88 413083

3. Data quality and artefacts. Although this is probably unlikely (but a potential concern/caveat) – wouldn't imputing data from non-represented reference genomes lead to false positives or incorrect phasing? Just curious if there was any check using a different reference panel? I understand this might be out of the scope of this particular analysis to delve into more details about it, but maybe as a sanity check? I'm coming to this from looking through this paper that evaluated potential influences/causes for LRLD:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080754

4. Sex-biased gene expression and eQTLs. Was there any sex-biased expression in the assessed data? Noted sample sizes are still limited here, so I agree this might not be powered. But for some tissues, if you repeat the PEER analysis not only by tissue, but by sex? Correct me if I am wrong, but including sex as a covariate would regress out the sex effects? I'd be interested in seeing the split analysis, if there was a modulating effect that was also sex specific.

Is the expectation that ADCY9 and CETP are negatively correlated with a particular genotype but only in males? Or is there no expectation for sex-specific functional differences too (ie it is either genetic or phenotypic/functional). One way to check could be to look to see if these genes correlated in the GTEx/GEUVADIS/CARTaGENE data.

5. Epistatic/phenotypic data associations. The difference in genotype and MI association is a very interesting effect, but because the data is from a different population, I believe the message/point is kind of lost in the narrative. What are the distributions of the genotypes in the UK biobank? And how do these differ by sex? And what are the distributions of the two genotype combinations? Furthermore, a lot of these traits are correlated, I'm curious to see what those correlations are for these selected traits? Those that aren't associated with the traits but are associated when split by genotype, is that also a numbers effect? I'm not sure I caught that in the tests.

Figure suggestions/comments:

Figure 1 and consistency with labels/IDs. Would it be possible to show the LIMAA cohort here? It is not the "First nation" set -which I believe is the NAGD? The AND subpopulation (Andean) is the one carried on in the analyses, correct? It would be good to highlight these/specify in the caption and/or the figure. Is there a way to also show these frequencies by sex in those subpopulations you later work with? Ie for PEL, AND and LIMAA?

In Figure 1b – Is the recombination rate the lines? Would it be possible to only plot the AMR? It is hard to see/interpret the results here. The others could be in a supplementary Figure

In Figure 1 c – the PBS is also hard to see clearly here. One option would be to smooth the lines out or again, only show the PEL population.

The sex specific results are also super interesting (sup Figure 3 b/d) -> I think moving these also to this main figure shows a clearer point on sex effects.

iHS and recombination rates. I'm not very aware of this measure, but I was curious as to whether the other SNPs that have highly significant iHS values are of relevance, and what the interpretation would be here? Ie the one near the start of the gene seems to have the highest iHS value for the AMR pop. Is that also in high LD with the main SNP of interest in the ADCY9 gene?

Figure 2:Figure 2c, also show the LIMAA cohort as suggested for figure 1?

Figure 4: Could you repeat panel b with the GTEx data at all in the well-powered tissues?

Figure 5:Points in panel a are a little small, perhaps increasing the point size as in panel b/c?

https://doi.org/10.7554/eLife.69198.sa1

Author response

Essential revisions:

The reviewers identified a number of issues and questions that should be addressed in a revised manuscript. Here is a list of revisions considered essential, many of which come from the individual reviews. The individual reviews are included below, which have both a public review and recommendations for the authors from each reviewer. Please do also respond to all individual reviewers' points and suggestions in your response and revised manuscript.

1. Please include a figure or flowchart that illustrates the overall experimental design and analyses conducted, and which datasets each were performed with. A clear key for the population datasets with sample sizes and the abbreviations that refer to them would be helpful. Please also carefully define and use consistent terms in how datasets are referred to (e.g., First Nation, LIMAA, NAGD, Andean vs. Peruvian).

We agree that our overall experimental design is complex, and that it can be confusing to the reader, as exemplified by comments from all 3 reviewers.

To improve this aspect, we added main Figure 1 showing for each step the analyses performed, the datasets used, and the key results obtained (needed to move forward to next steps). We also identified the steps where we did consider sex in the analysis (green coloring). We also modified the main text to add references to the different steps throughout (see track changes). To increase clarity about the population datasets, we added main Table 1 containing information about each cohort used, defining clearly the abbreviations used, ethnicity, number of samples, female percentage, age (if information was available) and the citation. We carefully checked the terms and abbreviations in the main text and modified it where appropriate (see track changes).

Finally, in response to reviewer 3 (R3) comment R3.2, we have verified all numbers reported throughout the text, making sure it is clear if the sample sizes are reported before or after quality control (QC) steps (see track changes).

2. Please include a summary table of the male / female sample sizes, LD values and p-values for the datasets where LRLD was analyzed.

We have added Appendix – Table 1 to clarify the nature of cohorts used for the LRLD analyses, along with male/female sample sizes, LD values (r2) and empirical p-values. This is also in response to comment R3.2.

3. Please include the results showing the effect of CETP knockdowns on ADCY9 expression, in addition to the current results showing ADCY9 knockdowns on CETP expression.

Our initial knockdowns (KD) experiments included only one replicate, and in that first step, significant results were only detected for ADCY9 KD, which led us to validate this result further (by adding 5 replicates, performing western blot, and ultimately RNAseq). In the CETP-KD, we did not see any significant ADCY9 expression changes, but we did not want a figure with N=1, hence the “data not shown” mention in the previous version.

In response to your request and to comment R2.4, we have now increased the number of replicates to N=4 for CETP-KD and to N=4 for CETP overexpression, to report these results more appropriately. The results remain unchanged, the differences in ADCY9 expression between CETP-KD/overexpressed and controls are still not significant and are now reported in Figure 5—figure supplement 1. We also repeated with N=4 an experiment for ADCY9 overexpression, which is also not significant (Figure 5—figure supplement 1). The ADCY9 KD experiment was also repeated in this batch and remained significant (p<0.0026).

The following changes to the main text were made:

In Results:

– “This increased expression was validated by qPCR, and western blot also showed increased CETP protein product (Methods, Figure 5—figure supplement 1a,b, Appendix 1), but its overexpression did not significatively modulate CETP expression (Figure 5—figure supplement 1c).”

– L. 307 P. 11, we replaced “data not shown” by Figure 5—figure supplement 1

– in Material and Methods:

– “Silencer Select siRNA against CETP (Ambion 4392420 ID 2933)”

– We added a section: “Overexpression of ADCY9 and CETP genes in HepG2 cell line”

“For ADCY9 and CETP overexpression experiments, 500 000 cells in 2 ml of medium in a six-well plate were transfected using 1 ug of pEZ-M46-AC9 or pEZ-M50-CETP plasmids (GeneCopoeiaTM) with 5 ul of Lipofectamine 2000 reagent (Invitrogen cat # 11668-019) for 72h. Total RNA was extracted from transfected HepG2 cells using RNeasy Plus Mini Kit (Qiagen cat #74136) in accordance with the manufacturer’s recommendation (Appendix – Table 3, Appendix 1).”

– We added a clarification about the sex of donor of the HepG2 cell line:

“a cell line derived from the liver tissue of a 15-year-old male donor (78)”

We also updated the legend of the Figure 5—figure supplement 1:

– “(a) Relative mRNA expression of CETP of HepG2 cells 72h post-transfection with siRNA against human ADCY9 (si1039). qPCR assay was normalized with PGK1 and HBS1L genes, n=5 independent experiments, (p-value=0.0026 from t-test). (b) Quantification of CETP protein by Western blot assay, 200 ml of cell media (concentrated with Amicon ultra 0.5 ml 10 kDA units) from cells transfected with siRNA against human ADCY9 (si1039), were separated on 10% TGX-acrylamide gel and transferred to PVDF membrane. CETP protein expression was determined using a primary antibody rabbit monoclonal anti-CETP (Abcam, ab157183) 1:1000 (3% BSA, TBS, Tween 20 0.5%) O/N 4oC, followed by HRP-conjugated secondary antibody goat anti-rabbit 1:10 000 (3% BSA) 1h RT. Figure b represents densitometry analysis of n=3 experiments, p-value=0.0029 from t-test. (c,e) Relative mRNA expression of (c) CETP and (e) ADCY9 genes in HepG2 cells post-transfection with pEZ-M50-CETP (overexpression of CETP) or pEZ-M46-ADCY9 (overexpression of ADCY9 ) plasmids. qPCR assay was normalized with PGK1 and HBS1L genes, n=4 independent experiments. (d) Relative mRNA expression of ADCY9 of HepG2 cells 72h post-transfection with siRNA against human CETP. qPCR assay was normalized with PGK1 and HBS1L genes, n=4 independent experiments.”

4. Please include an additional analysis that considers the effect of sex on gene expression. Please also clarify which analyses did not consider the effect of sex (and why) and for which it was considered but no effect found.

This is a very important point that we carefully considered in this new version of the manuscript. Initially, we performed a preliminary sex-stratified analysis in our RNAseq discovery cohort (GEUVADIS) which was not very conclusive likely due to low sample size. We did not pursue this in our replication cohorts (GTEx and CaG). Thanks to this request and comment R3.4, we calculated PEER factors by stratifying the datasets by sex (sPEER factors) and performed further analyses in GTEx and CaG that produced very interesting new results, in line with the sex-specific nature of our other results, that we are happy to report.

We see differences between sex in these datasets. In summary, most sex-combined results were driven by males (including the CaG result) but stratifying by sex in GTEx samples revealed additional tissues in which females only show a significant interaction, and very intriguingly, the sign of the interaction effect is reversed in this case. We tested for a three-way interaction in tissues harboring a convincing sex-specific pattern in stratified analyses. We note that some tissues in GTEx did not have enough samples in one sex (<50 samples), so we did not include them in the sex-stratified analyses. This number was selected to maximize the number of tissues for which we tested the interaction while having high chances to have samples in all genotype combinations. Also, with this N, there are still some tissues where the rarest combinations in Europeans (AA-GG or AA-GA) are missing, so going below is not informative. If the reviewers/editors think that a higher threshold would be better, we can always remove the plots for some of the tissues. See Author response table 1 for the list of tissues excluded from our sex-specific analyses.

Author response table 1
OrganSexNumber of samples
Brain-AmygdalaFemale34
Brain-Anterior Cingulate Cortex (BA24)Female39
Brain-Caudate basal gangliaFemale45
Brain-Cerebellar HemisphereFemale46
Brain-Frontal Cortex (BA9)Female44
Brain-HippocampusFemale45
Brain-HypothalamusFemale44
Brain-Nucleus accumbens basal gangliaFemale49
Brain-Putamen (basal ganglia)Female38
Brain-Spinal cord cervical c-1Female42
Brain-Substantia nigraFemale28
Cells-EBV-transformed lymphocytesFemale43
Kidney-CortexMale/Female48/17
Minor Salivary GlandFemale32

We added these additional results to the “Epistatic effects on CETP gene expression” section of the main text and “Expression Quantitative Trait Loci (eQTL) analysis for rs1967309 and rs158477” section of Appendix 1. We modified Figure 5 (previously Figure 4) to add two box-plot panels (c,d), similar to previous Figure 4b, to include the most interesting results (skin in males, tibial artery in females), as well as Figure 5—figure supplement 2 (child) reporting on these analyses. Sex-stratified p-value plots for all tissues based on number of sPEER factors are included in new Appendix 1-figure 9.

– We added new subfigures in the Figure 5

– The new subfigures of Figure 5 (c,d) represent the interaction between our SNPs for CETP expression in Skin-Sun exposed in male (c) and Artery Tibial in female (d) in GTEx. Both have the interaction and enough samples to create a plot like the one of GEUVADIS (b), answering R3 request to repeat panel b with the GTEx data (R3.6)

– The legend was modified: “Figure 5. Effect of ADCY9 on CETP expression. (a) Normalized expression of ADCY9 or CETP genes depending on wild type (WT) and ADCY9-KD in HepG2 cells from RNA sequencing on five biological replicates in each group. P-values were obtained from a two-sided Wilcoxon paired test. qPCR and western blot results in HepG2 are presented in Figure 5—figure supplement 1. (b,c,d) CETP expression depending on the combination of rs1967309 and rs158477 genotypes in (b) GEUVADIS (p-value=0.03, ß=-0.22, N=287), (c) GTEx-Skin Sun Exposed in males (p-value=0.0017, ß=-0.32, N=330) and in (d) GTEx-Tibial artery in females (p-value=0.026, ß=0.38, N=156), for individuals of European descent according to principal component analysis. P-values reported were obtained from a two-way interaction of a linear regression model for the maximum number of PEER/sPEER factors considered. Figure 5—figure supplement 2 show the interaction p-values depending on number of PEER/sPEER factors included in the linear models.”

– We added a sentence in the abstract:

L 32-33, P. 2 “Analyses of RNA-seq data further suggest an epistatic interaction on CETP expression levels between the two SNPs in multiple tissues, which also differs between males and females.”

– We added a section in the main text for the sex-analysis results and discussion:

“Given the sex-specific results reported above, we stratified our interaction eQTL analyses by sex. We observed that the interaction effect on CETP expression in CaG whole blood samples (Nmale=359) is restricted to male individuals, and, despite low power due to smaller sample size in GEUVADIS, the interaction is also only suggestive in males (Appendix 1-figure 7c,d). In GTEx, most well-powered tissues that showed a significant effect in the sex-combined analyses also harbor male-specific interactions (Appendix 1-figure 9). For instance, GTEx skin male samples (Nmale=330) show the most significant male-specific interaction effects, with the directions of effects replicating the sex-combined result in GEUVADIS (an increase of CETP expression for each rs158477 A allele in rs1967309 AA individuals) albeit with an observable reversal of the direction in rs1967309 GG individuals (decrease of CETP expression with additional rs158477 A alleles) (Figure 5c, Figure 5—figure supplement 2a). However, significant effects in females are detected in tissues not previously seen as significant for the interaction in the sex-combined analysis, in the tibial artery (Figure 5d, Figure 5—figure supplement 2) and the heart atrial appendage (Appendix 1-figure 9). For tissues with evidence of sex-specific effects in stratified analyses, we also tested the effect of an interaction between sex, rs158477 and rs1967309 (Methods) on CETP expression: the three-way interaction is only significant for tibial artery (Figure 5—figure supplement 2).”

“The significant interaction effects on CETP expression vary between sexes in amplitude and direction, with most signals driven by male samples, but significant interaction effects observed in females only, despite sample sizes being consistently lower than for males. Notably, in the tibial artery and heart atrial appendage, two tissues directly relevant to the cardiovascular system, the female-specific interaction effect on CETP expression is reversed between rs1967309 genotypes AA and GG, compared to the effects seen in males in skin and brain tissues. Given our ADCY9-KD were done in liver cell lines from male donors, future work to fully understand how rs1967309 and rs158477 interact will focus on additional experiments in cells from both male and female donors in these relevant tissues.”

“The female-specific eQTL interaction results in arteries and heart tissues further suggest a link with the cardiovascular system, and the phenotype association results support further this hypothesis.”

– We added in Methods:

“To take into account hidden factors, we calculated PEER factors (75) on the normalized expressions, on all samples and stratified by sex (sPEER factors). To detect eQTL effects, we performed a two-sided linear regression on ADCY9 and CETP expressions using R (v.3.6.0) (https://www.r-project.org/) with the formula lm(p ∼rs1967309*rs158477+Covariates) for evaluating the interaction effect, lm(p ∼rs1967309+rs158477+Covariates) for the main effect of the SNPs and lm(p ∼rs1967309*rs158477*sex+Covariates) for evaluating the three-way interaction effect.”

“Reported values in the text are for five PEER factors in GEUVADIS, ten PEER factors in CARTaGENE, 25 sPEER for skin sun exposed in male and 10 sPEER for artery tibial in female in GTEx.”

We also added several paragraphs in the Appendix 1 in the sections of “ADCY9 and CETP expression quantification from RNAseq data” and “Expression Quantitative Trait Loci (eQTL) analysis for rs1967309 and rs158477”:

Appendix 1: “For all sex-stratified analysis, we kept sex-stratified tissues that had at least 50 samples, and recomputed PEER factors with samples from only one sex (which we term sPEER factors).”

Appendix 1: “Since the selective pressure differ between sexes, we stratified our expression analysis by sex. For CETP expression analysis, there are no consistent signals for GEUVADIS, possibly reflecting lack of power or that there is no sex-specific effect in lymphoblastoid cell lines (Appendix 1-figure 7c). However, in CaG, the significant interaction found is present only in male (again for the genotypic coding, Appendix 1-figure 7d). In GTEx, we see significant interaction effects in males in tissues that had signals with sex-combined, such as brain hippocampus (Nmale=105), hypothalamus (Nmale=112) and spinal cord cervical (Nmale=72), skin sun-exposed for CETP (Nmale=330) (Figure 5d, Appendix 1-figure 9). We note that for most brain tissues, the low sample size in females does not allow to conclude on the presence of an interaction effect in that sex. In these tissues, the direction of the effect in males is reversed compared to what is observed in GEUVADIS with sexes combined (Figure 5b), whereas the highly significant result in skin shows an effect consistent with the sex-combined GEUVADIS result (p-value=0.0017, ß=-0.32). More specifically, in the sun-exposed skin samples, in rs1967309 AA males, copies of the rs158477 A allele increase CETP expression by 0.49 (95% CI 0.12-0.87) on average. In rs1967309 AG males, the effect of rs158477 is null (p-valueAG=0.33) and the effect of the rs158477 A allele is suggestive in rs1967309 GG individual (p-valueGG=0.10) with a decrease of the CETP expression. Conversely, if we look at the effect of rs1967309 on CETP expression in skin sun-exposed when we stratified by rs158477 in males, SNP rs1967309 is neither a significant eQTL of CETP in GG of rs158477 in males (p=0.11, n=89) nor for GA (p=0.65, n=164), but is a significant eQTL for AA males (p=0.026, ß=-0.46, OR=[-0.87, -0.06], n=76). We also identified new tissues where the interaction is either suggestive or significant in females, in artery tibial (Nfemale=156), heart atrial appendage (Nfemale=97), spleen (Nfemale=69) and stomach (Nfemale=105) (Appendix 1-figure 9), with an effect reversed compared to the initial GEUVADIS result (Appendix 1-figure 7a). For the pituitary tissue, it is significant in both sex (additive coding in female and genotypic coding in male, Nmale=156, Nfemale=63), but the direction of the additive coding is reversed between sexes, possibly explaining why the sex-combined analysis did not show any signal. We note that the newly discovered signals are mainly for females, indicating that the signal was hidden by the male effects (or absence of effects), likely because of higher sample sizes.”

As shown in our new Figure 1, most analyses consider the effect of sex (green boxes), except:

– iHS, this statistic is powerful to detect hard selective sweeps and is not designed to detect sex-specific selection (which will likely look like an incomplete soft sweep) and did not show significant results in the PEL cohort.

– RFMix inference analyses of local ancestry did not consider males and females separately, it would only result in loss of precision if we split the reference panel in this way. However, we did check whether there are significant differences between males and females in global ancestry and did not find significant results. This was reported in Results.

– KD/over-expression experiments, which were done in HepG2 cells from a male donor. We have added this information in the Methods and discussed future steps in cell lines from male and female donors in the Discussion.

“The human liver hepatocellular HepG2 cell line was obtained from ATCC, a cell line derived from the liver tissue of a 15-year-old male donor”

“Given our ADCY9-KD were done in liver cell lines from male donors, future work to fully understand how rs1967309 and rs158477 interact will focus on additional experiments in cells from both male and female donors in these relevant tissues.”

5. Please consider the imputation quality across the different datasets, and especially how it differs across populations to address the concern that imputation from non-representative reference panel populations could be problematic.

Indeed, we had reported that the INFO score for our two SNPs imputed in LIMAA and NAGD with the Haplotype Reference Consortium (HRC) was lower than our 0.8 initial threshold in Appendix, section “Pre-processing of the LIMAA cohort”. This can be expected in an admixed population, when the reference panel does not include a close enough population, and even more for differentiated SNPs as it is the case here, for which the allele frequencies would be misaligned with the reference panel frequencies. Therefore, our approach was to lower the threshold to 0.7 (which is still considered a high score in most studies).

However, based on this comment and comment R3.3, we performed additional analyses to make sure that the lower INFO score was due to low confidence because of the factors described above, and not to incorrect imputed genotypes. We thus redid the imputation of LIMAA and NAGD with the TOPMED reference panel. For our two SNPs, rs1967309 and rs158477, the r2 score for imputation with TOPMED is now higher (r2>0.9 for both), meaning that a more diversified reference panel helped the imputation of this admixed population. Furthermore, when we look at the allele frequencies and genotypes, they are highly similar (see Author response table 2, only 5% and 2% of individual allele mismatches in LIMAA for rs1967309 and rs158477, respectively). We recomputed the genotype correlation (LRLD) with these new data in LIMAA, and the value obtained is r2 = 0.0047 (vs. 0.0046 before).

Author response table 2
DatabaseSNP - VariantTOPMED imputationSanger Imputation Server - Haplotype Reference Consortium (in manuscript)
LIMAArs1967309 A77%76%
rs158477 G79%79%
NAGD-Andeanrs1967309 A77%77%
rs158477 G74%73%

Given these small differences, and the time it would require redoing all figures and null distributions with the new imputed data, we decided not to change our analyses in the manuscript and keep imputation with HRC, while adding these considerations to Appendix 1 in the section “Pre-processing of the LIMAA cohort” and “Genotype association between rs1967309 and rs158477 in LIMAA”.

Appendix 1: “To make sure imputation quality did not impact our results because of incorrectly imputed genotypes, we redid the imputation of LIMAA with the TOPMED reference panel (https://imputation.biodatacatalyst.nhlbi.nih.gov/#!). The imputation r2 score with TOPMED is higher than 0.9 for both, and only very limited differences in imputed genotypes are seen (only 5% and 2% of individual allele mismatches in LIMAA for rs1967309 and rs158477, respectively for the 3,243 individuals)”

Appendix 1: “Finally, we also considered how the imputation quality in LIMAA could affect the main result of LRLD, because of imputation from non-representative reference panel populations is known to be problematic. We recomputed the genotype correlation (r2) in LIMAA with our two SNPs imputed with the TOPMED panel, a more representative panel than the Haplotype Reference Consortium initially used. The value obtained is 0.0047 compared to 0.0046 before, showing that imputation quality is unlikely to have affected our results.”

6. Please discuss in more detail the proposed interpretation for sex-specific selection. First, the type of selection is not stated in the abstract (only "natural selection"). Second, in the discussion two possibilities are proposed (lines 353-358) but it is not clear if any are supported under the presented evidence. Does the lack of correlation with age rule out a mechanism affecting survival? How strong would selection need to be in a single generation to produce this level of LRLD?

We are delighted that the editors and reviewers agree with us about the importance of the sex-specific result we present here. We preferred to be cautious in our first version in our interpretation of this result, in order not to oversell it. But in the light of the newest gene expression results presented in favor of the sex-specific effects, and given the positive feedback received, we are now more comfortable on expanding on this aspect.

We agree that our interpretation of the age-dependant result in the discussion was not clear, as the analysis conducted indeed does not really allowed us to conclude: we did this analysis with what we had at hand in LIMAA, and if we had seen an age-dependent effect, it would have been support for the differential survival hypothesis, but in the case of a non-significant result, it does not necessarily exclude this possibility. Indeed, we may be underpower to detect an age-effect, or simply not looking at the appropriate timeframe, and a pediatric cohort may be needed. To test the in-utero hypothesis, a cohort of Peruvian newborns would allow to evaluate if the genotype correlation is already present at birth, or a cohort of stillborn babies would allow to evaluate if a genotype combination is more frequent in those. We now expanded in the text on how this question could be further resolved in the future.

Finally, we thought a lot about the important question of how strong selection would need to be in a single generation to produce this level of LRLD. Because this is, to the best of our knowledge, the first instance reported of such a phenomenon, there is no theoretical ground on which to base a formal answer to this question and we would need to perform a carefully designed simulation study to answer this question. This is beyond the scope of the present work, but it is an exciting avenue of research. In our opinion however, selection probably needs to be quite strong in order to produce this effect and, despite searching for outstanding causes for infant/children mortality in Peru and looking at male/female ratio statistics for children born in Peru, we did not find any hints on what could be happening. On the other hand, in utero and sperm-specific selection are thought to result from stronger selection, which makes us think that this is the most likely explanation. But why would it differentially impact boys and girls in this case (sex hormones?), and why specifically in Peru (pregnancy in altitude?), remain mysteries.

The changes made in the text are as follow:

– In the abstract:

“rs1967309 region exhibits signatures of positive selection in several human populations”

“We propose that ADCY9 and CETP coevolved during recent human evolution due to sex-specific selection.”

– In the discussion:

“Such two-gene selection signature, where only males show strong LRLD, can happen if a specific genotype combination is beneficial in creating males (through differential gamete fitness or in utero survival, for example) or if survival during adulthood is favored with a specific genotype combination compared to other genotypes. In the case of age-dependent differential survival, the genotypic association is expected to be weaker at younger ages, however the LRLD signal between rs1967309 and rs158477 in the LIMAA cohort did not depend on age neither in males nor in females (Appendix 1). Since very few individuals were younger than 20 years old, it is likely that the age range in this cohort is not appropriate to distinguish between the two possibilities. This age-dependent survival therefore remains to be tested in comparison with pediatric cohorts of Peruvians: if the LRLD signal is absent in newborns for example, it will suggest a strong selective pressure acts early in life on boys. To specifically test the in-utero hypothesis, a cohort of stillborn babies with genetic information could allow to evaluate if the genotype combination is more frequent in these. Lastly, it may be that the evolutionary pressure is linked to the sex chromosomes (69,70), and a three-way interaction between ADCY9, CETP and Y chromosome haplotypes or mitochondrial haplogroups remains to be explored.”

“Furthermore, not much is known about the strength of this type of selection, and simulations would help evaluate how strong selection would need to be in a single generation to produce this level of LRLD.”

7. Please improve figure visibility and naming as suggested in the specific recommendations below. Also, where possible indicate on the figures where the annotated genes are located, as is only currently shown in Figure 1B. Use consistent naming, e.g., for populations in Figure 1 and 3 (are 3a and b only PEL although the figure title says 1000G?).

We improved the quality of all main figures, added gene annotation plots for the several figures (iHS, PBS and LRLD) and used consistent naming throughout. We also switch the wording men/women to males/females to be more consistent in figures and text. We changed all the naming of “Figure” to “Figure” for main and Appendix 1-figures. We also moved several Appendix figures to “child” figures (“figure supplement”).

We also followed the suggestion from reviewers below, and made the following changes:

Figures

– Added a new Figure 1 (see response to E.1 above)

– Figure 1 (now Figure 2)

– Figure 1a (now Figure 2a): We changed the name of the Native Amerind to NAGD to be more consistent in our naming;

– Figure 1b (now Figure 2b) : We kept only the recombination rate of the AMR population and removed the others, since that was confusing (and didn’t give much information since they are all very similar). This rate was plotted mainly to show the LD block around rs1967309;

We added a sentence to specified that the LD block is consistent:

“The genetic variant rs1967309 is located in intron 2 of ADCY9, in a region of high linkage disequilibrium (LD), in all subpopulations in the 1000 Genomes Project (1000G), and harbors heterogeneous genotype frequencies across human populations (Figure 2a).”

– Figure 1c (now Figure 2c): In an attempt to be clearer, we change the representation with lines to a scatter plot (points) and separated by subpopulation. We emphasized rs1967309 values and changed the color, since we added PBS in another figure that used those colors for another meaning. We also removed the different shape of the point for populations other than the Peruvian, and, for the Peruvian, we also removed the special shape to put a black circle to emphases it. (in response to the figure suggestions of R3)

– We also updated the information figure legend (see track changes)

– Figure 2 (now Figure 3):

Figure 2c (now Figure 3c): We changed the name of the Native Amerind to NAGD;

Added the meaning of the black symbols and the blue lines/box under the figures.

We added 2 child figures

– Figure 3 (now Figure 4):

We added the number of individuals in the title

Figure 3a-b (now Figure 4a,b):

We changed the identification 1000G to PEL. We also updated the information in the figure legend of the figure (see track changes)

– Added the meaning of the black symbols and the blue lines/box under the figures.

– We added 3 child figures

– We added the figures of genotype frequency distribution by sex for rs1967309 and rs158477 as a child Figure 4—figure supplement 1 (in response to the figure suggestions of R3)

– Figure 4 (now Figure 5)

We added two plots to this figure (see response to E.4 above).

We added 2 child figures

We added new figures as a child Figure5—figure supplement 2 of a 2-way and 3-way interaction between our SNPs (and sex) for the two tissues that we added in the parent Figure 5.

– Figure 5 (now Figure 6)

Figure 5a (now Figure 6a): We enlarged the size of the plot to make them easier to read.

We added 2 child figures

Tables:

– Added the ‘Key Resources Table’ at the beginning of the Material and Methods section

– Added a table containing information for each cohort as a Table 1 in the main manuscript (in response to E1 and R3.2)

Supplementary figures:

– We removed/modified/added Supplementary Figures:

– To the Appendix 1-figure 2, we added the PCA and UMAP figures to the RFMix result, previously Supplementary Figure 7. For the PCA/UMAP plots, we changed the color to be colorblind friendly.

– From the Appendix 1-figure 3, we moved the subfigure b,d to be a child Figure 4—figure supplement 2 (in response to the figure suggestions of R2). For each PBS plot, we also changed the color to avoid confusion since the initial color were used to identified SNPs in the parent Figure 4.

– To Appendix 1-figure 4, we added gene under LRLD plot and changed labels of the legend (sex True sex labels; Shuffled sex Shuffled sex labels), previously (Supplementary Figure 8)

– From the Supplementary Figure 4 and 6 (LRLD in Andean of NAGD and LIMAA respectively), we moved the subfigure a,b (sex-combined) to become a child Figure 3—figure supplement 1. We also moved d,f of Supplementary Figure 4 (sex-stratified of ADCY9 in Andean of NAGD) to become a child Figure 4—figure supplement 3. We merged the other subfigures to become the new Appendix 1-figure 5.

– We moved the previous Supplementary Figure 9 to a child Figure 5—figure supplement 1. To this figure, we also added the CETP-KD result and CETP and ADCY9 overexpression in response to (E3 and R2.4)

– Appendix 1-figure 6, we added a new figure showing the correlation between ADCY9-CETP across tissues in GTEx (see response to R3.4)

– To the Supplementary Figure 14 (now Appendix 1-figure 7), we added the number of individuals in the title (which represents the number after filtering out individuals missing the genotype of rs1967309 and/or rs158477). We also added results of the analyses of the interaction between SNPs when adding sPEER factors for the stratifying by sex analysis.

– Appendix 1-figure 8: Previously Supplementary Figure 15

– We added a new Appendix 1-figure 9 which are the results from the analysis of the interaction on CETP stratified by sex depending on the number of sPEER added in the regression (to answer E4 and R3.4)

– We moved the previously Supplementary Figure 16 and 17 as child figures Figure 6—figure supplement 1 and 2 respectively.

Appendix – tables:

– We added a new Appendix – Table 1 (see response to E.2 above)

– We moved the previous Supplementary Table 1 to Appendix – table 2, and Supplementary Table 2 to Appendix – Table 3, in concordance with the order mentioned in the main text.

Reviewer #1 (Recommendations for the authors):Because the experimental design of the authors is complex, I suggest that they provide a Figure representing and the different steps of the rationales as well as their experiments. It may also help if they number the different steps of the experimental design both in the Figure and the Main Text.

We thank the reviewer for highlighting this necessity for clarification. We agree with the reviewer that it may be difficult to follow all our steps and so, we added main Figure 1 showing for each step (1 to 4), the analyses performed, the datasets used, and the key results obtained (needed to move forward to next steps). We also identified the steps where we did consider sex in the analysis (green coloring). We also modified the main text to add references to the different steps throughout. This is also in response to essential comment E.1 above.

Specific comments:

– the authors should explicitly state wich array was used for the genotyping of the LIMAA replication cohort.

Indeed, we did not mention this information in our methods. We have now added this text in the Methods section:

– “This cohort was genotyped with a customized Affymetric LIMAAray containing marker optimized for Peruvian-specific rare and coding variants.”

Reviewer #2 (Recommendations for the authors):

a) I think having a summary figure that goes through and connects the different analyses could potentially be very helpful for the flow of this manuscript. I think the collection of analyses are very impressive, but I find myself continually taking a step back to keep track of the different levels presented. You move from selection analyses, to LD analyses, to gene expression analyses, and finally to phenotype analyses. In addition, at multiple points, there is a further stratification by sex. It may not be immediately intuitive on what a summary figure would be for this, but having something to help readers keep track of how or why these different analyses are connected would be very useful.

We thank the reviewer for highlighting this necessity for clarification. We agree with the reviewer that it may be difficult to follow all our steps and so, we added main Figure 1 showing for each step (1 to 4), the analyses performed, the datasets used, and the key results obtained (needed to move forward to next steps). We also identified the steps where we did consider sex in the analysis (green coloring). We also modified the main text to add references to the different steps throughout. This is also in response to essential comment E.1 above.

b) In the introduction the authors highlight that one of the open questions to be addressed is: "However, the underlying mechanisms linking CETP and ADCY9, located 50 Mb apart on chromosome 16, as well as the relevance of the rs1967309 non-coding genetic variant are still unclear." However, the authors do not return to this specific point regarding rs1967309 later in the discussion. I think, since the authors do indeed identify and explore some of the relevance of rs1967309, it would benefit the manuscript to reconnect to this point later at the end of the paper.

We thank the reviewer for this comment. Despite our study demonstrating that rs1967309 is likely involved in a functional mechanism related to CETP expression, and that this mechanism implicates sex as a modulator, the exact mechanism remains unknown. Therefore, we avoided to speculate too much in our first version of the manuscript, but we agree with the reviewer that we should return to this point to wrap up the article. We have now added a paragraph that goes back to the initial discovery of rs1967309 and discusses what our results bring to this important ongoing research.

In Discussion:

“Despite these limitations, our results support a functional role for the intronic SNP rs1967309, likely involved in a molecular mechanism related to CETP expression, but this mechanism seems to implicate sex as a modulator in a tissue-specific way, which complicates greatly its understanding. In the dal-OUTCOMES clinical trial, the partial inhibitor of CETP, dalcetrapib, did not decrease the risk of cardiovascular outcomes in the overall population, but rs1967309 in the ADCY9 gene was associated to the response to the drug, which benefitted AA individuals (4). Interestingly, rs1967309 AA is found in both the male and female beneficial combinations of genotypes for CAD, the same that are enriched in Peruvians, but without taking rs158477 and sex into account, this association was masked. The modulation of CETP expression by rs1967309 could impact CETP’s functions that are essential for successfully reducing cardiovascular events. The rs158477 locus could be a key player for these functions, and dalcetrapib may be mimicking its impact, hence explaining the pharmacogenomics association. Furthermore, in the light of our results, some of these effects could differ between men and women (73), which may need to be taken into consideration in the future precision medicine interventions potentially implemented for dalcetrapib.”

c) At the start of the 'sex-specific long-range linkage disequilibrium' section, the authors state the following: "We explored the effect of sex on the LRLD association found between rs1967309 and rs158477. The allele frequencies at rs1967309 were suggestively different between males and females and sex-stratified PBS analyses suggest that the LD block around rs1967309 is differentiated between sexes in the Peruvians…" It is unclear to me if the second sentence is a result of exploring the effect of sex on LRLD associations, or if it is meant to be the justification for exploring the effect of sex. I think clarifying this would be important since it is not otherwise clear why you would move to doing sex-specific analyses, though it does produce an interesting result.

Thanks for asking to clarify this aspect, we now see how exploring the effect of sex was not well motivated in the text. Yes, the reviewer is right that the reason why we did these analyses was because we observed suggestive differences in allele frequencies at rs1967309 in Peru. We had computed FST between males and females, which was high in the rs1967309 region compared to the rest of the gene (it was in the top 1% of values for PEL, see Author response image 1, but not genome-wide significant after accounting for allele frequency, p=0.11), but the sex-stratified PBS result was striking: the differentiation signal completely disappeared in females. This is what motivated the LRLD sex-stratified analyses that followed. We have now clarified this in the text.

Author response image 1
Weir and Cockerham FST between males and females in PEL.

Horizontal line represents the 99th percentile value for this population (for chromosome 16).

In “Sex-specific long-range linkage disequilibrium signal” section in Results:

“Because the allele frequencies at rs1967309 were suggestively different between males and females (Figure 4—figure supplement 1), we performed sex-stratified PBS analyses, which suggested that the LD block around rs1967309 is differentiated between sexes in the Peruvians (Figure 4—figure supplement 2, Appendix 1). We therefore explored further the effect of sex on the LRLD association found between rs1967309 and rs158477 and performed sex-stratified LRLD analyses.”

d) I'm wondering why the data regarding the lack of impact CETP knockdowns have on ADCY9 expression is not shown? I think this is an interesting and important result. It helps demonstrate a directionality to the interaction that continually shows up elsewhere. If possible, showing this data would be helpful.

The reviewer is right that this is an important result that led us to the hypothesis of modulation of CETP expression through ADCY9 SNP. Our initial knockdowns (KD) experiments included only one replicate, and in that first step, significant results were only detected for ADCY9 KD, which led us to validate this result further (by adding N=5 replicates, performing western blot, and ultimately RNAseq). In the CETP-KD, we did not see any significant ADCY9 expression changes, but as we did not pursue the negative results further, we initially chose not to include a figure with N=1, hence the “data not shown” mention.

We have now increased the number of replicates to N=4 for CETP-KD and to N=4 for CETP overexpression, to report these results more appropriately. The results remain unchanged, the differences in ADCY9 expression between CETP-KD/overexpressed and controls are still not significant and are now reported in Figure 5—figure supplement 1. We also repeated with N=4 an experiment for ADCY9 overexpression, which is also not significant (Figure 5—figure supplement 1). We modified the text accordingly, as reported in essential comment E.3.

e) Now that you have identified some more relevant tissue types and phenotypes from your epistasis analyses, I am wondering if there are any links that can be made back to your mouse knockouts or any other ADCY9 or CETP knockouts? It seems like with a new spectrum of traits to consider, previous knockout results may make more sense now. Are there any OMIM conditions that now may have some connection to these results?

Among tissues showing an interaction between ADCY9 and CETP genes, there is the hypothalamus which regulates the central nervous system, including respiratory and pulse rates. In the mice experiment in Rautureau et al., 2018, we previously saw a modulation of both these phenotypes, suggesting a role of both genes in regulation of the autonomic nervous system, which we will prioritize in future work. We did not find any condition in OMIM linked to tissues where the interaction is present for CETP expression. The only condition linked with CETP is the hypercholesterolemia (mainly HDL), which can be associated with cardiovascular disease, but, in our phenotype association, the interaction is not significant for HDL (Figure 6).

– We added a few sentence in the discussion to link with our previous mice KD:

“In a previous study, we showed that inhibition of both Adcy9 and CETP impacted many phenotypes linked to the ANS in male mice (6), but in the light of our results, these experiments should be repeated in female mice. The function of ANS is important in a number of pathophysiological states involving the cardiovascular system, like myocardial ischemia and cardiac arrhythmias, with significant sex differences reported (62-64).”

f) In regards to the CAD outcome findings, I am wondering if you could make use of some of the CAD polygenic risk scores that exist now to further show a link. Specifically, I am wondering whether you could stratify individuals into high and low risk for CAD, and see whether the protective genotype combinations are overrepresented among the low risk individuals vs. high risk (and between the sexes as well). Some example CAD PRSs can be found in the PGS Catalogue: https://www.pgscatalog.org/trait/EFO_0001645/

Thank you for sharing this idea, this is a very interesting comment that we will consider in our follow-up work. We however feel this is beyond the scope of the present paper.

Reviewer #3 (Recommendations for the authors):

I have a few comments and questions below on the approach and results.

1. Population/cohort specific effects. Is there any known ascertainment bias of the cohorts? Skimming the source papers, it looks like for one cohort, they all presented to a clinic for tuberculosis?

The LIMAA cohort is indeed from a tuberculosis study but includes many individuals who did not have tuberculosis. To make sure there was no ascertainment bias in terms of genetic ancestry, we performed PCA and UMAP analyses as QC and saw no separation according to disease state. If there was signal for genetic predisposition to tuberculosis here, this would affect a restricted number of loci, and as there’s no evidence that our loci contribute to tuberculosis, we do not think this is a problem.

We added the following text to supplementary text, section “Comparison between Peruvian cohorts”

– Appendix 1: “Also, the LIMAA cohort is initially from a tuberculosis study (32), but our PCA and UMAP analysis showed no separation according to disease state.”

I'm not sure if I missed it – but is the sex-effect replicated in the UK biobank data? And the linkage missing or apparent in the UK biobank data? One result that is glossed over (again apologies if I missed it in the supplement), but the effect looks reversed/stronger in females in the Andean population (sup Figure 4 d/f)? Is this true or just not significant/too noisy? Also, if this is testable with the other datasets available (since it is a much larger cohort), they could be used as negative controls of sorts. Ie the UK biobank and/or the other native cohorts.

We did test the LRLD in the UKB (Author response image 2), but no significant signal was seen, similarly to the GBR population (or any European population) in the 1000G. The AA genotype frequency at rs1967309 is in line with GBR in UKB White British participants (see details in response to R3.5). We added an Appendix – table 1 containing the r2 values and their p-values mentioned in the main text, and we also added a few other populations from 1000G and NAGD as negative controls.

Author response image 2
LRLD of rs1967309 with CETP gene and rs158477 in ADCY9 gene in UKB for SNPs with a MAF>5%.

The horizontal line is the 99th percentile of all pairs of SNPs between ADCY9 and CETP genes.

2. Speculation/comments on why there would be a male/female difference in these population (or any). This circles back to the UK biobank question I had earlier. Are frequencies linked to Mt or Y chromosome haplotypes? Is there a way to measure or view this? I understand the RFMix analysis was meant to test for admixture, but would it also show something like this?

Yes, the reviewer is right that Y chromosome and Mt haplotypes are potentially at play here, and this would be presumably testable in the LIMAA cohort, but it is beyond the scope of the present study to evaluate this. It is however something we plan to investigate in follow-up analyses and we added a mention to the discussion:

“Lastly, it may be that the evolutionary pressure is linked to the sex chromosomes (69,70), and a three-way interaction between ADCY9, CETP and Y chromosome haplotypes or mitochondrial haplogroups remains to be explored.”

As for the male/female differences, we now discuss in more detail the proposed interpretation for sex-specific selection in this new version (see response to essential comment E.6). About the sex-effects on UKB phenotypes, we prefer not to speculate in the manuscript since more investigation is needed, but cardiovascular disease is known to be sexually dimorphic, and some of these differences are driven by genetics and interaction with sex hormones, so this constitute a possible hypothesis to be tested.

As it is hard to say definitively or exclude all potential counterfactuals for these results, without including additional datasets from the same region, along with geographic and historical accounts. I wouldn't want to speculate on this of course, but perhaps characterizing the samples a little further in the supplement might help (like a table of sorts, with totals, split by male/female, and any other known demographics). I found it a little difficult to follow all the different populations tested; and which is the "discovery" and which is the "replication" cohort. There are also some discrepancies in the numbers in the text, supplement and captions.

PEL (1000G) LIMAA (?) Andean (NAGD) UK Biobank?

Males 41 2076 ? 2078 54 ?

Females 44 1433 ? 1434 34 ?

Total 85 3509 ? 88 413083

We thank you for pointing out these discrepancies in the numbers in the text. This was due to sometimes reporting sample size before or after QC, and to problems when merging different versions of the initial manuscript. We have clarified whether the numbers are reported pre- or post-QC and verified them multiple times. We hope no more discrepancies remain.

Concerning better characterization of the samples, we added two tables to our manuscript:

1. Table 1 contains the name of each cohort and its abbreviation, the number of samples with percentage of females, the ethnicity, and the age (if this information was available). This is also in response to essential comment E.1 above.

2. Appendix – Table 1 displays information about the number of samples for each LRLD analysis, with their value of r2 and p-values. This is also in response to essential comment E.2 above.

3. Data quality and artefacts. Although this is probably unlikely (but a potential concern/caveat) – wouldn't imputing data from non-represented reference genomes lead to false positives or incorrect phasing? Just curious if there was any check using a different reference panel? I understand this might be out of the scope of this particular analysis to delve into more details about it, but maybe as a sanity check? I'm coming to this from looking through this paper that evaluated potential influences/causes for LRLD:

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080754

Indeed, we had reported that the INFO score for our two SNPs imputed in LIMAA and NAGD with the Haplotype Reference Consortium (HRC) was lower than our 0.8 initial threshold in Appendix, section “Pre-processing of the LIMAA cohort”. Based on this comment and essential comment E.5, we performed additional analyses to make sure that the lower INFO score was due to low confidence because of the factors described above, and not to incorrect imputed genotypes. We thus redid the imputation of LIMAA and NAGD with the TOPMED reference panel and we recomputed the genotype correlation (LRLD) with these new data in LIMAA, and the value obtained is r2 = 0.0047 (vs. 0.0046 before). Given these very tiny difference, and the time it would require redoing all figures and null distributions with the new imputed data, we decided not to change our analyses in the manuscript and keep imputation with HRC, while adding these considerations to Appendix 1 in the section “Pre-processing of the LIMAA cohort” and “Genotype association between rs1967309 and rs158477 in LIMAA”.

See response to essential comment E.5 above for details on changes made.

4. Sex-biased gene expression and eQTLs. Was there any sex-biased expression in the assessed data? Noted sample sizes are still limited here, so I agree this might not be powered. But for some tissues, if you repeat the PEER analysis not only by tissue, but by sex? Correct me if I am wrong, but including sex as a covariate would regress out the sex effects? I'd be interested in seeing the split analysis, if there was a modulating effect that was also sex specific.

We thank the reviewer for this comment, as this resulted in the most significant addition to the new version of the manuscript. Initially, we performed a preliminary sex-stratified analysis in our RNAseq discovery cohort (GEUVADIS) which were not very conclusive due to low sample size and did not pursue this in other cohorts (GTEx and CaG). But thanks to this comment, we calculated PEER factors by stratifying the datasets by sex (sPEER factors) as suggested and we performed further analyses in GTEx and CaG that produced very interesting new results, in line with the sex-specific nature of our other results, that we are happy to report.

See response to essential comment E.4 above for summary of results and details on changes made.

Is the expectation that ADCY9 and CETP are negatively correlated with a particular genotype but only in males? Or is there no expectation for sex-specific functional differences too (ie it is either genetic or phenotypic/functional). One way to check could be to look to see if these genes correlated in the GTEx/GEUVADIS/CARTaGENE data.

To answer your question, we looked at the correlation between expression of ADCY9 and CETP across tissues from GTEx. In the Author response image 3 (left panel) we show the p-value of the correlation ADCY9 and CETP expression (correcting for sex, age, PCs, collection site, sequencing platform and total ischemic time) combining males and females. The genes indeed correlate negatively in most of the tissues of GTEx, which concords with the results in ADCY9-KD experiment. We added Appendix 1-figure 6 to report this.

Author response image 3
Correlation between ADCY9 and CETP genes across tissues of GTEx.

(left) sex-combined, the shape represents the direction of the correlation. (right) Comparison of β of the correlation between both genes between male and female. Bars represent the standard error obtained from the summary of the lm() in R.

We then stratified the dataset by sex and present (Author response image 3, right panel) the comparison of the estimated betas in males and females. The direction of effect is generally consistent and most of the tissues show that ADCY9 and CETP expressions are negatively correlated in both sexes. The tissue which is an outlier is the breast mammary tissue. We do not include this sex-stratified comparison of betas in the manuscript as we feel that it will need to be better characterized in the context of how CETP expression correlates with other genes of the transcriptome. Because it is independent of our SNPs, this is beyond the scope of this study.

“When looking across tissues in GTEx, ADCY9 and CETP expressions negatively correlate in almost all the tissues (Appendix 1-figure 6, Appendix), which is consistent with the effect observed during the ADCY9-KD experiment, showing increased expression of CETP expression when ADCY9 is lowly expressed (Figure 5a, Figure 5—figure supplement 1a,b).”

Appendix 1: “To test if ADCY9 and CETP expression is correlated across tissues in humans, we used data GTEx and performed a linear regression correcting for the first 5 PCs, age, sex, the collection site (SMCENTER), the sequencing platform (SMGEBTCHT) and total ischemic time (TRISCHD). We find that ADCY9 and CETP gene expression levels are negatively correlated (and significantly(p<0.05) so for Adipose-Subcutaneous, Adrenal Gland, Artery Coronary, Artery Tibial, Brain-Cerebellar Hemisphere/Cerebellum/Cortex/Frontal Cortex/Putamen (basal ganglia), Breast-Mammary Tissue, Esophagus-Gastro esophageal Junction/Muscularis, Heart-Left Ventricle, Lung, Prostate, Small Intestine-Terminal, Spleen, Stomach, Uterus, Whole blood), except in skin tissues and cells cultured fibroblast, for which a significant positive correlation is found (Appendix 1-figure 6).”

5. Epistatic/phenotypic data associations. The difference in genotype and MI association is a very interesting effect, but because the data is from a different population, I believe the message/point is kind of lost in the narrative. What are the distributions of the genotypes in the UK biobank? And how do these differ by sex? And what are the distributions of the two genotype combinations? Furthermore, a lot of these traits are correlated, I'm curious to see what those correlations are for these selected traits? Those that aren't associated with the traits but are associated when split by genotype, is that also a numbers effect? I'm not sure I caught that in the tests.

The data being from a different population is a clear limitation of our results here, as stated in the Discussion. We looked at the distribution of the genotypes in the UKB for both of our SNPs, for sex-combined and stratified by sex and we present them in the Author response table 3 and report them in the Appendix 1. We also present the number of individuals by genotypes combination (Author response figure 4), showing that the frequencies and distribution of the combination between our SNPs are higher similar in sex-combined and sex-stratified groups.

Author response table 3
AllMaleFemale
rs1967309-A0.3928180.3929350.392719
rs158477-G0.4716280.4717970.471484

– Appendix 1: “Individually, SNP rs1967309 (fA=39%) is nominally associated with heart rate, and rs158477 (fG=47%) with the systolic blood pressure […]”

Author response table 3. Allelic frequencies by sex and sex combined of rs1967309 and rs158477 in the UKb.

Author response image 4
Number of samples by genotype combination, in sex-combined and stratified by sex.

About the correlation between phenotypes, we used a simple cor.test() in R between our selected phenotypes (without adjusting for any covariates, Author response image 5). As expected, many phenotypes are highly correlated, and when we stratified by sex, the correlation is similar for both sexes. We note that the correlation between FEV and FVC and CAD (ischaemic heart disease) is very low (r2 = -0.03 and -0.05, respectively for the sex combined).

Author response image 5
Correlation between phenotypes in the UKB in sex-combined and stratified by sex.

Figure suggestions/comments:

Figure 1 and consistency with labels/IDs. Would it be possible to show the LIMAA cohort here? It is not the "First nation" set -which I believe is the NAGD? The AND subpopulation (Andean) is the one carried on in the analyses, correct? It would be good to highlight these/specify in the caption and/or the figure. Is there a way to also show these frequencies by sex in those subpopulations you later work with? Ie for PEL, AND and LIMAA?

In Figure 1b – Is the recombination rate the lines? Would it be possible to only plot the AMR? It is hard to see/interpret the results here. The others could be in a supplementary Figure

In Figure 1 c – the PBS is also hard to see clearly here. One option would be to smooth the lines out or again, only show the PEL population.

The sex specific results are also super interesting (sup Figure 3 b/d) -> I think moving these also to this main figure shows a clearer point on sex effects.

iHS and recombination rates. I'm not very aware of this measure, but I was curious as to whether the other SNPs that have highly significant iHS values are of relevance, and what the interpretation would be here? Ie the one near the start of the gene seems to have the highest iHS value for the AMR pop. Is that also in high LD with the main SNP of interest in the ADCY9 gene?

Figure 2:

Figure 2c, also show the LIMAA cohort as suggested for figure 1?

Figure 4: Could you repeat panel b with the GTEx data at all in the well-powered tissues?

Figure 5:

Points in panel a are a little small, perhaps increasing the point size as in panel b/c?

We thank the reviewer for these suggestions, we integrated them in the revised version and describe the changes to figures in response to essential comment E.7 above. About the question on iHS and recombination rates, we have the following sentence in Appendix 1:

“Other SNPs in ADCY9 are found to have absolute iHS values higher than 2, especially in the long intron 1 and around the last exon, but characterizing these signals is beyond the scope of this study.”

https://doi.org/10.7554/eLife.69198.sa2

Article and author information

Author details

  1. Isabel Gamache

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0613-0979
  2. Marc-André Legault

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    3. Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Montréal, Canada
    Contribution
    Data curation, Formal analysis
    Competing interests
    No competing interests declared
  3. Jean-Christophe Grenier

    Montreal Heart Institute, Montréal, Canada
    Contribution
    Data curation, Methodology, Resources, Software, Writing – original draft
    Competing interests
    No competing interests declared
  4. Rocio Sanchez

    Montreal Heart Institute, Montréal, Canada
    Contribution
    Formal analysis, Investigation, Resources, Validation
    Competing interests
    No competing interests declared
  5. Eric Rhéaume

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    Contribution
    Investigation, Project administration, Resources
    Competing interests
    No competing interests declared
  6. Samira Asgari

    1. Center for Data Sciences, Brigham and Women’s Hospital, Harvard Medical School, Boston, United States
    2. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, United States
    Contribution
    Data curation, Formal analysis
    Competing interests
    No competing interests declared
  7. Amina Barhdadi

    1. Montreal Heart Institute, Montréal, Canada
    2. Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Montréal, Canada
    Contribution
    Data curation, Formal analysis
    Competing interests
    No competing interests declared
  8. Yassamin Feroz Zada

    Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Montréal, Canada
    Contribution
    Data curation, Formal analysis
    Competing interests
    No competing interests declared
  9. Holly Trochet

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    Contribution
    Software
    Competing interests
    No competing interests declared
  10. Yang Luo

    1. Center for Data Sciences, Brigham and Women’s Hospital, Harvard Medical School, Boston, United States
    2. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  11. Leonid Lecca

    1. Socios En Salud, Lima, Peru
    2. Harvard Medical School, Boston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  12. Megan Murray

    Center for Data Sciences, Brigham and Women’s Hospital, Harvard Medical School, Boston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  13. Soumya Raychaudhuri

    1. Center for Data Sciences, Brigham and Women’s Hospital, Harvard Medical School, Boston, United States
    2. Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, United States
    3. Centre for Genetics and Genomics Versus Arthritis, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
    4. Department of Biomedical Informatics, Harvard Medical School, Boston, United States
    5. Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, United States
    Contribution
    Resources
    Competing interests
    No competing interests declared
  14. Jean-Claude Tardif

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    Contribution
    Conceptualization, Funding acquisition, Resources
    Competing interests
    reports grants from Government of Quebec, National Heart, Lung, and Blood Institute of the U.S. National Institutes of Health (NIH), the MHI Foundation, from Bill and Melinda Gates Foundation, Amarin, Esperion, Ionis, Servier, RegenXBio; personal fees from Astra Zeneca, Sanofi, Servier; and personal fees and minor equity interest from Dalcor. Has a patent (US20190070178A1) Methods for Treating or Preventing Cardiovascular Disorders and Lowering Risk of Cardiovascular Events issued to Dalcor, no royalties received, a patent (US20170233812A1) Genetic Markers for Predicting Responsiveness to Therapy with HDL-Raising or HDL Mimicking Agent issued to Dalcor, no royalties received, and a patent (US Provisional Applications No. 62/935,751 and 62/935,865) Methods for using low dose colchicine after myocardial infarction with royalties paid to Invention assigned to the Montreal Heart Institute
  15. Marie-Pierre Dubé

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    3. Université de Montréal Beaulieu-Saucier Pharmacogenomics Centre, Montréal, Canada
    Contribution
    Conceptualization, Funding acquisition, Project administration, Resources, Supervision
    Competing interests
    has a patent (US20190070178A1) Methods for Treating or Preventing Cardiovascular Disorders and Lowering Risk of Cardiovascular Events issued to Dalcor, no royalties received, a patent (US20170233812A1) Genetic Markers for Predicting Responsiveness to Therapy with HDL-Raising or HDL Mimicking Agent issued to Dalcor, no royalties received, and a patent (US Provisional Applications No. 62/935,751 and 62/935,865) Methods for using low dose colchicine after myocardial infarction with royalties paid to Invention assigned to the Montreal Heart Institute. M.P.D. reports personal fees and other from Dalcor and personal fees from GlaxoSmithKline, other from AstraZeneca, Pfizer, Servier, Sanofi.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8442-4393
  16. Julie Hussin

    1. Université de Montréal, Montréal, Canada
    2. Montreal Heart Institute, Montréal, Canada
    Contribution
    Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review and editing
    For correspondence
    julie.hussin@umontreal.ca
    Competing interests
    has received speaker honoraria from Dalcor and District 3 Innovation Centre
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4295-3339

Funding

Institut de Cardiologie de Montréal

  • Isabel Gamache
  • Marc-André Legault
  • Jean-Christophe Grenier
  • Rocio Sanchez
  • Eric Rhéaume
  • Holly Trochet
  • Jean-Claude Tardif
  • Marie-Pierre Dubé
  • Julie Hussin

Université de Montréal

  • Isabel Gamache

Canadian Institutes of Health Research

  • Marc-André Legault

Canada Research Chairs

  • Jean-Claude Tardif
  • Marie-Pierre Dubé

Fonds de Recherche du Québec - Santé

  • Julie Hussin

Institut de Valorisation des Données IVADO

  • Isabel Gamache
  • Jean-Christophe Grenier
  • Julie Hussin

Ministère de l'Économie, de la Science et de l'Innovation - Québec (Health collaboration acceleration fund)

  • Jean-Claude Tardif
  • Marie-Pierre Dubé

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank all members of the Hussin lab for their constructive comments and feedback throughout this project, as well as the insightful input from three reviewers and reviewing and senior editors at eLife. This work was completed thanks to computational resources provided by Compute Canada clusters Graham and Beluga. This work was funded by the Institut de Valorisation des Données (IVADO), Health Collaboration Acceleration Fund from the Ministère de l’Économie et de l’Innovation of the Government of Quebec and the Montreal Heart Institute (MHI) Foundation. JGH is a Fonds de la Recherche en Santé (FRQS) Junior one fellow. IG receives a PhD scholarship from the MHI Foundation and MAL holds a PhD scholarship from Canadian Institutes of Health Research. MPD holds the Canada Research Chair in Precision Medicine Data Analysis. JCT holds the Canada Research Chair in Personalized Medicine and the Université de Montréal endowed research chair in atherosclerosis.

Ethics

Human subjects: The analyses done in this study were approved by the different cohorts used. Participants in these cohorts gave their general consent for their data to be used for research purposes. All individual-level data was anonymized and no efforts were made by the authors to deanonymize or recontact any of the participants from the cohorts, in keeping with our agreements with the UK Biobank, CARTaGENE, dbGAP, 1000Genomes and the Native American Genetics dataset (Universidad de Antioquia).

Senior Editor

  1. George H Perry, Pennsylvania State University, United States

Reviewing Editor

  1. Ellen Leffler, University of Utah, United States

Reviewer

  1. Eduardo Tarazona-Santos, Universidade Federal de Minas Gerais, Brazil

Publication history

  1. Received: April 7, 2021
  2. Preprint posted: May 14, 2021 (view preprint)
  3. Accepted: October 4, 2021
  4. Accepted Manuscript published: October 5, 2021 (version 1)
  5. Version of Record published: November 16, 2021 (version 2)

Copyright

© 2021, Gamache et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 626
    Page views
  • 70
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Isabel Gamache
  2. Marc-André Legault
  3. Jean-Christophe Grenier
  4. Rocio Sanchez
  5. Eric Rhéaume
  6. Samira Asgari
  7. Amina Barhdadi
  8. Yassamin Feroz Zada
  9. Holly Trochet
  10. Yang Luo
  11. Leonid Lecca
  12. Megan Murray
  13. Soumya Raychaudhuri
  14. Jean-Claude Tardif
  15. Marie-Pierre Dubé
  16. Julie Hussin
(2021)
A sex-specific evolutionary interaction between ADCY9 and CETP
eLife 10:e69198.
https://doi.org/10.7554/eLife.69198
  1. Further reading

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Maryam H Mofrad et al.
    Tools and Resources

    Sleep is generally considered to be a state of large-scale synchrony across thalamus and neocortex; however, recent work has challenged this idea by reporting isolated sleep rhythms such as slow oscillations and spindles. What is the spatial scale of sleep rhythms? To answer this question, we adapted deep learning algorithms initially developed for detecting earthquakes and gravitational waves in high-noise settings for analysis of neural recordings in sleep. We then studied sleep spindles in non-human primate electrocorticography (ECoG), human electroencephalogram (EEG), and clinical intracranial electroencephalogram (iEEG) recordings in the human. Within each recording type, we find widespread spindles occur much more frequently than previously reported. We then analyzed the spatiotemporal patterns of these large-scale, multi-area spindles and, in the EEG recordings, how spindle patterns change following a visual memory task. Our results reveal a potential role for widespread, multi-area spindles in consolidation of memories in networks widely distributed across primate cortex.

    1. Computational and Systems Biology
    2. Stem Cells and Regenerative Medicine
    Genki N Kanda et al.
    Research Article

    Induced differentiation is one of the most experience- and skill-dependent experimental processes in regenerative medicine, and establishing optimal conditions often takes years. We developed a robotic AI system with a batch Bayesian optimization algorithm that autonomously induces the differentiation of induced pluripotent stem cell-derived retinal pigment epithelial (iPSC-RPE) cells. From 200 million possible parameter combinations, the system performed cell culture in 143 different conditions in 111 days, resulting in 88% better iPSC-RPE production than that obtained by the pre-optimized culture in terms of the pigmentation scores. Our work demonstrates that the use of autonomous robotic AI systems drastically accelerates systematic and unbiased exploration of experimental search space, suggesting immense use in medicine and research.