A variant-centric perspective on geographic patterns of human allele frequency variation

Abstract
Introduction
Materials and methods
Results
Discussion
Appendix 1
Appendix 2
Appendix 3
Data availability
References
Article and author information
Metrics

Abstract

A key challenge in human genetics is to understand the geographic distribution of human genetic variation. Often genetic variation is described by showing relationships among populations or individuals, drawing inferences over many variants. Here, we introduce an alternative representation of genetic variation that reveals the relative abundance of different allele frequency patterns. This approach allows viewers to easily see several features of human genetic structure: (1) most variants are rare and geographically localized, (2) variants that are common in a single geographic region are more likely to be shared across the globe than to be private to that region, and (3) where two individuals differ, it is most often due to variants that are found globally, regardless of whether the individuals are from the same region or different regions. Our variant-centric visualization clarifies the geographic patterns of human variation and can help address misconceptions about genetic differentiation among populations.

Introduction

Understanding human genetic variation, including its origins and its consequences, is one of the long-standing challenges of human biology. A first step is to learn the fundamental aspects of how human genomes vary within and between populations. For example, how often do variants have an allele at high frequency in one narrow region of the world that is absent everywhere else? For answering many applied questions, we need to know how many variants show any particular geographic pattern in their allele frequencies.

In order to answer such questions, one needs to measure the frequencies of many alleles around the world without the ascertainment biases that affect genotyping arrays and other probe-based technologies (International HapMap Consortium, 2005; Li et al., 2008). Recent whole-genome sequencing studies (Auton et al., 2015; Mallick et al., 2016; Bergström et al., 2019; Fairley et al., 2020) provide these data, and thus present an opportunity for new perspectives on human variation.

However, large genetic data sets present a visualization challenge: how does one show the allele frequency patterns of millions of variants? Plotting a joint site frequency spectrum (SFS) is one approach that efficiently summarizes allele frequencies and can be carried out for data from two or three populations (Gutenkunst et al., 2009). For more than three populations, one must resort to showing multiple combinations of two or three-population SFSs. This representation becomes unwieldy to interpret for more than three populations and cannot represent information about the joint distribution of allele frequencies across all populations. Thus, we need visualizations that intuitively summarize allele frequency variation across several populations.

New visualization techniques also have the potential to improve population genetics education and research. Many commonly used analysis methods, such as principal components analysis (PCA) or admixture analysis, do a poor job of conveying absolute levels of differentiation (McVean, 2009; Lawson et al., 2018). Observing the genetic clustering of individuals into groups can give a misleading impression of ‘deep’ differentiation between populations, even when the signal comes from subtle allele frequency deviations at a large number of loci (Patterson et al., 2006; McVean, 2009; Novembre and Peter, 2016). Related misconceptions can arise from observing how direct-to-consumer genetic ancestry tests apportion ancestry to broad continental regions. One may mistakenly surmise from the output of these methods that most human alleles must be sharply divided among regional groups, such that each allele is common in one continental region and absent in all others. Similarly, one might mistakenly conclude that two humans from different regions of the world differ mainly due to alleles that are restricted to each region. Such misconceptions can impact researchers and the broader public alike. All these misconceptions potentially can be avoided with visualizations of population genetic data that make typical allele frequency patterns more transparent.

Here, we develop a new representation of population genetic data and apply it to the New York Genome Center deep coverage sequencing data of the 1000 Genomes Project (1KGP) samples (Auton et al., 2015). In essence, our approach represents a multi-population joint SFS with coarsely binned allele frequencies. It trades precision in frequency for the ability to show several populations on the same plot. Overall, we aimed to create a visualization that is easily understandable and useful for pedagogy. As we will show, the visualizations reveal with relative ease many known important features of human genetic variation and evolutionary history.

This work follows in the spirit of Rosenberg, 2011 who used an earlier dataset of microsatellite variation to create an approachable demonstration of major features in the geographic distribution of human genetic variation (as well as earlier related papers such as Lewontin, 1972; Mountain and Ramakrishnan, 2005; Witherspoon et al., 2007). Our results complement several recent analyses of single-nucleotide variants (SNVs) in whole-genome sequencing data from humans (Auton et al., 2015; Mallick et al., 2016; Bergström et al., 2019). We label the approach taken here a variant-centric view of human genetic variation, in contrast to representations that focus on individuals or populations and their relative levels of similarity.

Materials and methods

To introduce the approach, we begin with considering 100 randomly chosen SNVs sampled from Chromosome 22 of the 1KGP high coverage data (Box 1, Fairley et al., 2020). Figure 1 shows the allele frequency of each variant (rows) in each of the 26 populations of the 1KGP (columns, see Supplementary file 1 for labels). As a convention throughout this paper, we use darker shades of blue to represent higher allele frequency, and we keep track of the globally minor allele, that is, the rarer (<50% frequency) allele within the full sample. The figure shows that variants seem to fall into a few major descriptive categories: variants with alleles that are localized to single populations and rare within them, and variants with alleles that are found across all 26 populations and are common within them.

Figure 1

Download asset Open asset

Allele frequencies at 100 randomly chosen variants from Chromosome 22.

Frequencies of the globally minor allele are shown across 26 populations (columns) from the 1KGP for 100 randomly chosen variants from Chromosome 22. Note that the allele frequency bin spacing is nonlinear to capture variation at low as well as high frequencies. Populations are ordered by broad geographic region (horizontal labels, see Figure 2A for legend). Definitions of abbreviations for the 26 1KGP populations are given in Supplementary file 1.

To investigate whether such patterns hold genome-wide, we devise a scheme that allows us to represent the >90 million SNVs in the genome-wide data (Figure 2). First, we follow the 1KGP study in grouping the samples from the 26 populations into five geographical ancestry groups: African (AFR), European (EUR), South Asian (SAS), East Asian (EAS), and Admixed American (AMR) (Figure 2A, Box 1). For clarity, we modify the original 1KGP groupings slightly for this project (by including several samples from the Americas in the AMR grouping, see Box 1). While human population structure can be dissected at much finer scales than these groups (e.g. Leslie et al., 2015; Novembre and Peter, 2016), the regional groupings we use are a practical and instructive starting point—as we will show, several key features of human evolutionary history become apparent, and many misconceptions about human differentiation can be addressed efficiently with this coarse approach (see Discussion). As any such groupings are necessarily arbitrary, we also show results without using regional groupings to calculate frequencies (see section ‘Finer-scale resolution of variant distributions’ below).

Figure 2 with 1 supplement see all

Download asset Open asset

A simple coding system to represent geographic distributions of variants.

(A) Regional groupings of the 26 populations in the 1KGP Project. (B) Legend for minor allele frequency bins. (C) Two examples of how a verbal description of an allele frequency map can be communicated equivalently with a five-letter code (yellow signifies the major allele frequency, blue signifies the minor allele frequency in the pie charts).

To represent the geographic distributions of alleles compactly, we give every variant a five-letter code according to its allele frequencies across regions (Figure 2A). More precisely, for each bi-allelic SNV, we identify the global rarer (minor) allele. Then for each region, we code the allele’s frequency as ‘u’, ‘R’, or ‘C’, based on whether the allele is ‘(u)ndetected,’ ‘(R)are,’ or ‘(C)ommon’ (Figure 2B). To distinguish between ‘rare’ and ‘common’ alleles, we used a threshold of 5% frequency. Finally, we concatenate the allele’s regional frequency codes in the fixed (and arbitrary) order: AFR, EUR, SAS, EAS, and AMR. This procedure generates a ‘geographic distribution code’ for each variant. For example, the code ‘CCCCC’ represents a variant that is common across every region, while ‘uuRuu’ represents a variant that is rare in South Asia and unobserved elsewhere (Figure 2C). To display the relative abundance of codes within a set of variants, we use a vertical stack from the most abundant code at the bottom to the least abundant at the top, with the height of each code proportional to its abundance, so that the cumulative proportions of the rank-ordered codes are easily readable (Figure 3).

Figure 3 with 4 supplements see all

Download asset Open asset

A summary of geographic distributions in human SNVs.

(A) We observe variants at ~3.1% of the measurable sites in the reference human genome (GRCh38). A measurable site is one at which it is possible to detect variation with current sequencing technologies (currently approximately 2.9 Gb out of 3.1 Gb in the human genome; ). (**B and C**) The relative abundance of different geographic distributions for 1KGP variants, (B) including singletons, and (C) excluding singletons. In panels B and C, the right-hand rectangles show the number and percentage of variants that fall within the corresponding geographic code on the left-hand side; distribution patterns are sorted by their abundance, from bottom-to-top. See Figure 2 for an explanation of the five-letter ‘u’, ’R’, ’C’ codes. The proportion of the genome with variants that have a given geographic distribution code can be calculated from the data above (for example, with the ‘Ruuuu’ code, as 17% × 3.1% = 0.53%). The gray box represents geographic distribution codes whose abundances are too rare to effectively display at the given figure resolution.

Box 1.

Dataset descriptions and groupings.

We use bi-allelic single-nucleotide variants from the New York Genome Center high-coverage sequencing of the 1000 Genomes Project (1KGP) Phase 3 samples (Auton et al., 2015) (see key resources table, accessed July 22nd, 2019, only variants with PASS in the VCF variant filter column). Most of the samples are from an ethnic group in an area (e.g. the ‘Yoruba of Ibadan,’ YRI, or the ‘Han Chinese from Beijing,’ CHB), so the sampling necessarily represents a simplification of the diversity present in any locale (e.g. Beijing is home to several ethnic groups beyond the Han Chinese). For each grouping, the 1KGP typically required each individual to have at least three of four grandparents who identified themselves as members of the group being sampled.

The 1KGP further defined five geographical ancestry groups: African (AFR), European (EUR), South Asian (SAS), East Asian (EAS), and Admixed American (AMR). Differing from the 1KGP, we include in the ‘Admixed in the Americas’ (AMR) regional grouping the following populations: ‘Americans of African Ancestry in SW USA’, ‘African-Caribbeans in Barbados (ACB)', and the ‘Utah Residents (CEPH) with Northern and Western European Ancestry’. We chose this grouping because it is a more straightforward representation of current human geography. See Supplementary file 1 for a full list of the 26 populations and the grouping into five regions. We note challenges and caveats of these alternate decisions in the Discussion. Also, Figure 6 and Figure 5—figure supplements 1–3 provide a complementary view to Figure 3B, C and Figures 4 and 6, where the analysis is not based on the five groupings, but instead all 26 populations.

Results

Using the encoding scheme just described, we generated geographic distribution codes for all ~92 million biallelic SNVs in the 1000 Genomes dataset and display their relative proportions (Figure 3). The distribution of codes is heavily concentrated, with 85% of variants falling into just eight codes out of the 242 that are possible (3⁵–1: three frequency categories in each of five regional groupings, subtracting the code ‘UUUUU’ as each variant has been observed by definition). Of the top eight codes, the top four codes represent rare variants that are localized in a single region. The fifth most abundant code, ‘RuuuR’, represents rare variants found in Africa and the Admixed Americas (which includes African American individuals, for example). The sixth code is another set of localized rare variants (‘uRuuu’, i.e. variants rare in EUR). The seventh code is ‘CCCCC’ or ‘globally common variants.’ The eighth most abundant category, ‘uRuuR’, represents rare variants found in Europe and the Admixed Americas. Conspicuously infrequent in the distribution are variants that are common in only one region outside of Africa and absent in others (e.g. ‘uCuuu’, ‘uuCuu’, ‘uuuCu’, ‘uuuuC’). Instead, when a variant is found to be common (>5% allele frequency) in one population, the modal pattern (37.3%) is that it is common across the five regions (‘CCCCC’). Further, 63% of variants common in at least one region are also globally widespread, in the sense of being found across all five regions. This number rises to 82% for variants common in at least one region outside of Africa (Figure 3—figure supplements 1 and 2).

Singleton variants—alleles found in a single individual—are the most abundant type of variant in human genetic data and are necessarily found in just one geographic region. To focus on the distributions of non-singleton variants, we removed singletons and tallied again the relative abundance of patterns (Figure 3C). Removing singletons reduces the absolute number of variants observed by 48.2% (91,784,637 vs. 44,290,364). Without singletons, we see more clearly the abundance of patterns that have rare variants shared between two or more regions (codes with two ‘R’s and one ‘u’, such as ‘uuRRu’ or ‘RRuuu’).

The scheme for geographic distribution codes requires a few choices. For comparison, we show results using a 1% minor allele frequency threshold to define ‘common’ variants (Figure 3—figure supplement 3A). We also produced results tracking the derived (younger) rather than the globally minor allele (Figure 3—figure supplement 3C; for 96.6% of variants in the dataset with high-quality ancestral allele calls [Box 1], the globally minor allele is the derived allele). Neither changing the frequency threshold to 1% nor tracking the derived allele meaningfully affects the major patterns observed.

The patterns observed here are interpretable in light of some basic principles of population genetics. Rare variants are typically the result of recent mutations (Mathieson and McVean, 2014; Kiezun et al., 2013; Kimura and Ohta, 1973; Albers and McVean, 2020). Thus, we interpret the localized rare variants (such as ‘Ruuuu’ or ‘uuuRu’) as mostly young mutations that have not had time to spread geographically. The code ‘CCCCC’ (globally common variants), likely comprises mostly older variants that arose in Africa and were spread globally during the Out-of-Africa migration and other dispersal events (see Box 2). The appearance of rare variants shared between two or more regions (codes with two ‘R’s and three ‘u’s, such as ‘uuRRu’ or ‘RRuuu’) is likely the signature of recent gene flow between those regions (Box 2; Platt et al., 2019; Mathieson and McVean, 2014; Gutenkunst et al., 2009). In particular, the abundant ‘RuuuR’ and ‘uRuuR’ codes likely represent young variants that are shared between the Admixed Americas and Africa (‘RuuuR’) or Europe (‘uRuuR’) because of the population movements during the last 500 years that began with European colonization of the Americas and the subsequent slave trade from Africa. We interpret the 10th most abundant code (‘CuuuR’, Figure 3B) as mostly variants that were lost in the Out-of-Africa bottleneck and subsequently carried to the Americas by African ancestors. There is a relative absence of variants that are common in only one region outside of Africa and absent across all others (e.g. ‘uCuuu’, ‘uuCuu’, ‘uuuCu’, ‘uuuuC’). These patterns are consistent with human populations having not diverged deeply, in the sense that there has not been sufficient time for genetic drift to greatly shift allele frequencies among them (Box 2). To help make this clear, consider the alternative scenario—a model with very ancient population splits (Coon, 1962). In such a model, one would expect many more variants to be common to one region and absent in others (‘Cuuuu’ or ‘uuuCu’ for example, see Box 2). Overall, these results reflect a timescale of divergence consistent with the Recent-African-Origin model of human evolution as well as subsequent gene flow among regions (Cann et al., 1987; Stringer and Andrews, 1988; Thomson et al., 2000; Ramachandran et al., 2005; Pickrell and Reich, 2014).

Box 2.

Theoretical modeling.

We can use theoretical models to estimate what our visualizations would look like for two populations in simple contrasting cases of 'deep' divergence, 'shallow' divergence, and 'shallow' divergence with gene flow. The shallow case is calibrated to be qualitatively consistent with the Recent-African-Origin model with subsequent gene flow. The deep case mimics inaccurate models of human evolution with very ancient population splits (e.g. Coon, 1962). For each case, we computed the expected abundances of distribution codes in a simple model of population divergence: two modern populations of $N$ individuals each that diverged $T$ generations ago from a common population of $N$ individuals (see Appendix 1 for information about this calculation). We model gene flow by including recent admixture: individuals in Population A derive an average fraction $α$ of their ancestry from Population B and vice versa. This simplified model neglects many of the complications of human population history, including population growth, continuous historical migration, and natural selection, but it captures the key features of common origins, divergence, and subsequent contact (see Figure 3—figure supplement 4 to compare with simulation results from more complex published models of human population history).

In this model, the key control parameter is $T / 2 N$ , the population-scaled divergence time. Human pairwise nucleotide diversity (~1 × 10⁻³) and per-base-pair per-generation mutation rate (~1.25 × 10⁻⁸) imply a Wright-Fisher effective population size of $N$ = 2 × 10⁴ individuals. The Out-of-Africa divergence is estimated to have occurred approximately 60,000 years ago (Nielsen et al., 2017). Assuming a 30-year generation time (Fenner, 2005) gives $T / 2 N$ = 0.05. We compare this scenario with $T / 2 N$ = 0.5, corresponding to a deeper divergence of approximately 600,000 years ago.

Box 2—figure 1A shows the expected patterns in a sample of 100 individuals from each population for deep divergence ( $T / 2 N$ = 0.5), shallow divergence ( $T / 2 N$ = 0.05) without admixture, and shallow divergence with admixture ( $α$ = 0.02). The shallow divergence model with or without admixture reproduces the preponderance of ‘Ru’ and ‘CC’ mutations seen in the data, while the deep divergence model shows many more ‘Cu’ and many fewer ‘CC’ mutations. The case with admixture shows a slight increase in variant sharing (‘RR’ alleles increase from 1.3% of variants to 4.2%; ‘RC’ and ‘CR’ alleles increase from 6% to 10%; ‘CC’ alleles comprise 23% in both cases).

We can understand the relationship between the split time and geographic distribution abundances heuristically as follows. During an interval of $Δ t$ generations, the frequency of a neutral mutation starting at frequency $f$ changes randomly by a typical amount $Δ f \sim \sqrt{\frac{f (1 - f)}{2 N} Δ t}$ . Consider a mutation that is at 25% frequency, that is, common, in the ancestral population at the time of the split (Box 2—figure 1B). At time $Δ t / 2 N$ = 0.05 after the split, the frequency of the mutation is likely to be in the interval (15%, 35%) in both populations and will be assigned the code ‘CC’. On the other hand, by time $Δ t / 2 N$ = 0.5 after the split, the mutation has a significant chance of going extinct in one or both populations (Box 2—figure 1C). Mutations that go extinct in one population but not the other will typically be assigned a code ‘Cu’ or ‘uC’.

At the same time, new mutations are constantly entering the evolving populations. These new mutations will be private to one population (‘Ru’ or ‘Cu’) and the overwhelming majority will go extinct before reaching detectable frequencies. Conditional on non-extinction, the expected frequency of a neutral mutation increases linearly with time (see Appendix 2). As a result, the frequencies of new mutations since the split time $Δ t$ will mostly be contained in a triangular envelope $f < Δ t / 2 N$ (Box 2—figure 1B). For recent divergence, the new mutations will be assigned code ‘Ru’ or ‘uR’, while in deeply diverged populations they may be categorized as ‘Cu’ or ‘uC’.

Box 2—figure 1

Download asset Open asset

Allele frequency patterns depend on the time since population divergence and levels of admixture.

(A) Expected geographic distribution code abundances in a sample of 100 diploid individuals from each of two populations, for deep divergence ( $T / 2 N$ = 0.5, $α$ = 0), recent divergence without admixture ( $T / 2 N$ = 0.05, $α$ = 0), and recent divergence with admixture ( $T / 2 N$ = 0.05, $α$ = 0.02). (B) Simulated allele frequency time series for mutations starting at 25% frequency (blue) and new mutations entering the population since the split (orange). (C) The probability of extinction of a mutation starting at 25% frequency (see Appendix 2).

The variants that differ between a pair of individuals

While Figure 3 illustrates genetic variants found in a large, global sampling of human diversity, it does not show what to expect for the variants that differ between pairs of individuals. Are the variants that differ between two individuals more often geographically widespread or spatially localized?

To address this question, we considered the variants carried by pairs of individuals from the whole-genome sequencing data of the Simons Genome Diversity Project (SGDP) (Mallick et al., 2016; Figure 4). The SGDP sampled 300 individuals from 142 diverse populations. We use the SGDP data to avoid ascertainment biases that might arise from looking at individuals within the same dataset we use to measure allele frequencies. Figure 4 shows a representative subset with six pairs chosen from three populations (Figure 4—figure supplement 1, shows a larger set of examples). For each pair, we see some variants that were undiscovered in the 1KGP data (denoted $S_{u}$ in the figure). These account for 17–20% of each set of pairwise SNVs and are likely rare variants. We see that the variants that differ between each pair of individuals are typically globally widespread (i.e. codes with no ‘u’s, with proportions out of the total S varying from 54% to 76% for the pairs in Figure 4). The observation of mostly globally common variants in pairwise comparisons may seem counterintuitive considering the abundance of rare, localized variants overall. However, precisely because rare variants are rare, they are not often carried by either individual in a pair. Instead, pairs of individuals mostly differ because one of them carries a common variant that the other does not; and as Figure 3 already showed, common variants in any single location are often common throughout the world (also see Figure 6 and Figure 3—figure supplement 3).

Figure 4 with 1 supplement see all

Download asset Open asset

The geographic distributions of SNVs between pairs of individuals.

(A) Definition of a pairwise SNV. (B) The abundance of geographic distribution codes for different pairs of individuals from the SGDP dataset. Above each plot, we show the total number of variants that differ between each individual (S) and the number that were unobserved completely in the 1KGP data (*S_U*). Across the bottom, we show the proportion of variants with globally widespread alleles for each pair. We calculate this as the fraction of variants with no ‘u’ encodings over the total number of variants (S). (Note: by doing so, we make the assumption that if a variant is not found in the 1KGP data it is not globally widespread). For this analysis, as in Mallick et al., 2016, we include only autosomal biallelic SNVs for variants that pass ‘filter level 1’.

From the example pairwise comparisons (Figure 4, and Figure 4—figure supplement 1), one also observes evidence for higher diversity in Africa, which is typically interpreted in terms of founder effects reducing diversity outside of Africa (Cann et al., 1987; Harpending and Eller, 2000; Harpending and Rogers, 2000; Ramachandran et al., 2005; Prugnolle et al., 2005), although other models, especially ones including substantial subsequent admixture, can also produce this pattern (DeGiorgio et al., 2009; Pickrell and Reich, 2014). For example, the two Yoruba individuals have more pairwise SNVs (S = 4,897,091) than the French/French (S = 3,525,519) and Han/Han (S = 3,358,497) pairs. Pairs involving one or both of the sample Yoruba individuals have more variants with alleles common in Africa and rare or absent elsewhere (e.g. ‘CuuuR’,’ RuuuR’). Finally, a more subtle, but expected, impact of founder effects is that the sample Yoruba/Yoruba comparison is expected to have higher numbers of pairwise variants than the sample Yoruba/Han or Yoruba/French comparison, which we observe.

The geographic distributions of variants typed on genotyping arrays

Targeted genotyping arrays are a cost-effective alternative to whole-genome sequencing. In contrast to whole-genome sequencing, genotyping arrays use targeted probes to measure an individual’s genotype only at preselected variant sites. The process of discovering and selecting these target sites typically enriches the probe sets toward common variants (Clark et al., 2005), underrepresents geographically localized variants (Albrechtsen et al., 2010; Lachance and Tishkoff, 2013), and can affect genotype imputation and genetic risk prediction (Howie et al., 2012; Martin et al., 2017).

Figure 5 shows the geographic distributions of bi-allelic SNVs included on five popular array products. In stark contrast with the SNVs identified by whole-genome sequencing (Figure 3B), a large fraction of the variants on genotyping arrays are globally common. This is especially true for the Affy6, Human Origins, and OmniExpress arrays, which were designed using polymorphisms ascertained from a smaller number of sequenced individuals, and primarily capture more common variants due to this ascertainment. The Omni2.5Exome and MEGA arrays in contrast exhibit many more rare variants. In both these arrays, the second and third most abundant codes are ‘CuuuR’ and ‘RuuuR’ variants. The MEGA array was uniquely designed to capture rare variation in undersampled continental groups, including African ancestries (Bien et al., 2016; Bien et al., 2019). Wojcik et al., 2019 found that this design improved African and African American imputation accuracy, leading to greater power to map population-specific disease risk.

Figure 5

Download asset Open asset

Geographic distribution for variants found on genotyping array products.

(A) Genotyping arrays consist of probes for a fixed set of variants chosen during the design of the array product. (B) For each array product, we extracted the genomic position of variants found on the array and kept variants that are also found within the 1KGP to highlight their geographic distributions. The arrays considered are the Affymetrix 6.0 (Affy6) genotyping array, the Affymetrix Human Origins array (HumanOrigins), the Illumina HumanOmniExpress (OmniExpress) array, the Illumina Omni2.5Exome, and the Illumina MEGA array. This plot is analogous to Figure 3B but rather than calculating frequencies with the five regional groupings, we compute them within each of the 26 1KGP populations. The total number of variants represented is the same as in Figure 3B (S = 91,784,367). See Figure 2 for an explanation of the ‘u’,’R’,’C’ codes.

Finer-scale resolution of variant distributions

While the use of five regional groupings above allows us to describe variant distributions compactly with a five-digit encoding, the basic principle of grouping allele frequencies can be extended to build a 26-digit encoding for the 1KGP variants (Figure 6, Figure 6—figure supplements 1–3). Doing so with the set of ~92 million variants found in the 1KGP project (Figure 6), we find a consistent pattern with Figure 3B, in that the majority of variants are seen to be rare and geographically localized (1 ‘R’, and the remainder ‘u’s), and when a variant is common in any one population, it is typically common across the full set of populations (Figure 6, pattern with all ‘C’s). This view reveals that the five-digit encodings with 1 ‘R’ and 4 ‘u’s are often due to variants that are rare even within a single population. This is not unexpected given many of them are singletons. When we remove singletons (Figure 6—figure supplement 1B), we again see more clearly rare allele sharing indicative of recent gene flow, although at finer-scale resolution.

Figure 6 with 4 supplements see all

Download asset Open asset

A finer-scale summary of geographic distributions in human SNVs from the 1KGP.

This plot is analogous to Figure 3B but rather than calculating frequencies with the five regional groupings, we compute them within each of the 26 1KGP populations. The total number of variants represented is the same as in Figure 3B (S = 91,784,367). See Figure 2 for an explanation of the ‘u’,’R’,’C’ codes.

Discussion

By encoding the geographic distributions of the ~92 million biallelic SNVs in the 1KGP data and tallying their abundances, we have provided a new visualization of human genetic diversity. We term our figures ‘GeoVar’ plots as they help reveal the geographic distribution of sets of variants. GeoVar plots can complement other methods of visualizing population structure, including: plots of pairwise genetic distance, dimensionality-reduction approaches such as PCA, admixture proportion estimates such as STRUCTURE, and explicitly spatial methods that use the sampling locations of individuals (Guillot et al., 2009; Novembre and Peter, 2016; Bradburd and Ralph, 2019). These previously developed methods help reveal population structure, infer genetic ancestry, and measure historical migration patterns. However, they do a poor job of showing how alleles are distributed geographically. To minimize confusion about levels of differentiation among populations, researchers and educators can consider complementing PCA or STRUCTURE-like outputs with a variant-centric visualization like the ones presented here. To that end, we provide source code to replicate our figures and to generate similar plots for other datasets (the ‘GeoVar’ software package; see key resources table).

A goal of our work was to build a visualization that can help correct common misconceptions about human genetic variation. First, because many existing methods to describe population structure emphasize between-group or between-individual differentiation, they can convey a misleading impression of ‘deep’ divergence between populations when it may not exist. Comparing Figure 1 to outputs of models with ‘deep’ or ‘shallow’ divergence can help teach how patterns of human variation are consistent with shallow divergence and the Recent African Origins model (Box 2). Second, because personal ancestry tests can identify ancestry to broad continental regions, it is possible to incorrectly conclude human alleles are typically found exclusively in a single region and at high frequency within that region (e.g. patterns such as ‘uuCuu’.) As our figures show, this is not the case. It should be kept in mind that most fine-scale personal ancestry tests use genotyping arrays and combine evidence from subtle fluctuations in the allele frequencies of many common variants (Novembre and Peter, 2016). Finally, another related misconception is that two humans from different regions of the world differ mainly due to alleles that are typical of each region. As we show in Figure 4, most of the variants that differ between two individuals are variants with alleles that are globally widespread. (Our awareness of these misconceptions comes from personal experiences in teaching and outreach. However, there is a growing body of formal research on misconceptions regarding human genetic variation, e.g., Bowling et al., 2008; Phelan et al., 2014; Hubbard, 2017; Roth et al., 2020).

Our method requires computing allele frequencies within predefined groupings. Grouping and labeling strategies vary between genetic studies and are determined by the goals and constraints of a particular study (Race, Ethnicity, and Genetics Working Group, 2005; Panofsky and Bliss, 2017; Mathieson and Scally, 2020). While we chose deliberately coarse grouping schemes to address the misconceptions described above, the key facts we derive about human genetic variation are robust and appear in finer-grained 26-population versions of the plot (Figure 6). We recommend that any application of the GeoVar approach needs to be interpreted with the choice of groupings in mind.

The visualization method developed here is also useful for comparing the geographic distributions of different subsets of variants, (e.g. Figure 4, Figure 5). For example, when applied to the list of variants targeted by a genotyping array (Figure 5), the approach quickly reveals the relative balance of common versus rare variants and the geographical patterns of those variants.

Interpreting the results of this visualization approach does have some caveats. First, we estimate the frequency of alleles from samples of local populations. We expect that as sample sizes increase many alleles called as unobserved ‘u’ will be reclassified as rare ‘R’. The average sample size across all of our geographic regions is approximately 500 individuals (AFR: 504, EUR: 404, SAS: 489, EAS: 504, AMR: 603). Assuming regions are internally well-mixed, we have ~80% power to detect alleles with a frequency of ~0.2% in a region (Figure 2—figure supplement 1). For alleles with lower frequencies, we would require larger sample sizes to ensure similar detection power (Figure 2—figure supplement 1). An implication is that in large samples, we should observe more rare variant sharing. Thus, we expect the figures here to underrepresent the levels of rare variant sharing between human populations. In general, one must keep in mind that the GeoVar plot is a visualization of the joint SFS for the sample, rather than for the complete population.

A second caveat is that our encoding groups a wide range of variants into the ‘(C)ommon’ category (i.e. all variants where the frequency of the globally minor allele is greater than 5%). For some applications, such as population screening for carriers, it may be enough to know that a variant falls in the ‘rare’ or ‘common’ bins we have described, and more detail is inconsequential. For other applications, the detailed fluctuations in allele frequency across populations are relevant—for example, differences in allele frequencies at common variants (Figure 6—figure supplement 4) are regularly used to infer patterns of population structure and relatedness (Li et al., 2008; Pickrell and Pritchard, 2012; Patterson et al., 2012).

Third, one must interpret our results with the sampling design of the 1KGP study design in mind. In particular, the 1KGP filtered for individuals of a single ethnicity within each locale. However, in our current cosmopolitan world, the genetic diversity in any location or broad-based sampling project will be considerably higher than implied by the geographic groupings above. For example, the UK Biobank, while predominantly of European ancestry, has representation of individuals with ancestry from each of the five regions used here (Bycroft et al., 2018). The 1KGP also sampled South Asian ancestry from multiple locations outside of South Asia, and whether those individuals show excess allele sharing due to recent admixture in those contexts is unclear. While we expect overall similar patterns to those seen here using emerging alternative datasets (Bergström et al., 2019), there may be subtle differences due to sampling and study design considerations.

Prior representations of human genetic variation data similar to the one presented here can be found in Zietkiewicz et al., 1998, who showed patterns of absence/presence/fixation at seven sites in the dys44 locus using a gray-scale, in a manner similar to Figure 1 here. Other previous examples depict the proportion of variants with different geographic distributions resolved at the level of presence/absence (e.g. Rosenberg et al., 2002, Supp Figure 1 [pie chart]; Szpiech et al., 2008, Table 1, [circular bar]; Rosenberg, 2011, Table 2, Figure 4 [pie chart] for microsatellites; and Jakobsson et al., 2008, Figure 1A [Venn diagram] for SNPs, haplotypes and copy number variants). Publications on recent whole-genome sequence data from humans have several related and relevant figures for understanding the geographic distribution of variants (e.g. 1000 Genomes 2012, Figure 2B; Auton et al., 2015, Figures 1A and 3A; Bergström et al., 2019, Figure 3A and Visual Abstract). The GeoVar plots provide a complementary view to these previous figures. Specifically, they provide more fine-grained representation than dichotomizations into private vs. shared variants and assessments of sharing based on presence versus absence. The GeoVar plots also complement plots of doubleton sharing or alternative normalized metrics that lose interpretability in terms of absolute allele frequency patterns and the numbers of variants with particular patterns.

The visualizations provided here help reinforce the conclusions of a long history of empirical studies in human genetics (Lewontin, 1972; Ramachandran et al., 2005; Conrad et al., 2006; Li et al., 2008; Auton et al., 2015; Mallick et al., 2016; Bergström et al., 2019). The results show how the human population has an abundance of localized rare variants and broadly shared common variants, with a paucity of private, locally common variants. Together these are footprints of the recent common ancestry of all human groups. As a consequence, human individuals most often differ from one another due to common variants that are found across the globe. Finally, although not examined explicitly above, the large abundance of rare variants observed here is another key feature of human variation and a consequence of recent human population growth (Slatkin and Hudson, 1991; Di Rienzo and Wilson, 1991; Keinan and Clark, 2012; Nelson et al., 2012; Tennessen et al., 2012).

The well-established introgression of archaic hominids (e.g. Neandertals, Denisovans) into modern human populations (Wolf and Akey, 2018) is not apparent in the GeoVar plots we produced. We believe that there are two broad reasons for this: (1) The clearest signal of archaic introgression will come from sites where archaic hominids differed from modern humans, and we expect that these sites are only a very small fraction of variants found in humans today. The average human–Neandertal and human–Denisovan sequence divergence are both less than 0.16% (using observations from Prüfer et al., 2014), and a recent study estimates that there are fewer than 70 Mb (2.3% of the genome) of Neanderthal introgressed segments per individual for all individuals in the 1KGP (Chen et al., 2020). (2) We do not expect SNVs from archaic introgression to be concentrated in a single GeoVar category. For example, introgressed variants occupy a wide range of allele frequencies (Bergström et al., 2019). Archaic introgression events are believed to be old: >30,000 years ago, allowing time for substantial genetic drift and admixture among human populations (Chen et al., 2020). Negative selection (Harris and Nielsen, 2016; Juric et al., 2016) and, in some cases, strong positive selection Racimo et al., 2015 have also shaped the patterns of introgressed SNVs. For these reasons, we expect low levels of archaic introgression not to create a striking visual deviation in our GeoVar plots from the background patterns of a Recent African Origin model with subsequent migration (Box 2). To highlight the contributions of archaic hominids to human variation, more targeted approaches are needed (e.g. Green et al., 2010; Durand et al., 2011). Future work could also naturally extend the approach here to include archaic sequence data.

The geographic distributions of genetic variants visualized here are relevant for a number of applications, including studying geographically varying selection (Yi et al., 2010; Key et al., 2018), human demographic history (Gutenkunst et al., 2009), and the genetics of disease risk. For instance, due to ascertainment bias in arrays (Figure 5) and power considerations, common variants are often found in genome-wide association studies of disease traits (Manolio et al., 2009). The patterns shown above make it clear that most common variants are shared across geographic regions. Indeed, many common variant associations replicate across populations (Marigorta and Navarro, 2013; though see Martin et al., 2017; Mostafavi et al., 2020 for complications). More recently, due to increasing sample sizes and sequencing-based approaches, disease mapping studies are finding more associations with rare variants (Bomba et al., 2017). As our work here emphasizes, rare variants are likely to be geographically restricted, and so one can expect the rare variants found in one population will not be useful for explaining trait variation in other populations, although they may identify relevant biological pathways that are shared across populations.

A future direction for the work here would be to apply our approach to other classes of genetic variants such as insertions, deletions, microsatellites, and structural variants. We note that in studies with sample sizes similar to or smaller than the 1KGP, nearly all SNVs arise from single mutation events. For other variants that arise from single mutation events (e.g. indels that arise from single mutations), we expect similar patterns to those observed for SNVs here. In contrast, for highly mutable loci we expect independently derived alleles will be distributed in disjoint regions of the world due to multiple mutational origins (Ralph and Coop, 2010).

Another future direction would be to shift from visualizing patterns of allele sharing to the patterns of sharing of ancestral lineages in coalescent genealogies. Recent advances in the inference of genome-wide tree sequences (Kelleher et al., 2019; Speidel et al., 2019) and allele ages (Albers and McVean, 2020) allow for quantitative summaries of ancestral lineage sharing. Such quantities have a close relationship to the multi-population SFS properties that are studied here, yet are more fundamental in a sense and less subject to the stochasticity of the mutation process. That said, the conceptual simplicity of visualizing allele frequency patterns may be an advantage in educational settings.

Most importantly, future applications of the approach to humans will ideally use datasets that include a greater sampling of the world’s genetic diversity (Bustamante et al., 2011; Popejoy and Fullerton, 2016; Martin et al., 2017; Peterson et al., 2019). A related point is that the application of our method to genotyping array variants (Figure 5) reinforces the importance of considering the ancestry of study populations in genotype array design and selection (Peterson et al., 2019).

While we have focused here on human diversity at a global scale, GeoVar plots may be a useful tool for population geneticists working at other scales and with other species. The input to the visualization is simple: a table of allele frequencies in a set of populations. In the GeoVar software package, we provide python code for generating this table from a vcf file and a table of population labels, but the user could generate the input from other data instead. For studying population structure, it is best to use an unbiased estimate of allele frequencies from, for example, whole-genome or reduced-representation sequencing.

Applied to new data sets, GeoVar may be used for exploratory data analysis, allowing users to see some important features of population structure without fitting explicit models. For example, hierarchical structure (Figure 6, rare variants shared within regional groupings) and recent admixture (Figure 3, rare variants shared between AFR and AMR) show up as distinctive patterns in the plots. Box 2 shows that when the cutoff frequency separating Rare from Common mutations is close to the population split time (measured in units of 2N), an enrichment of ‘RU’ and ‘CC’ codes is expected. For example, in populations that split 0.1 × N generations ago, mutations at local frequencies below 0.1 will tend to be private and those at higher frequencies will tend to be shared. In spatially distributed populations with limited dispersal, we expect that a similar relationship exists between cutoff frequencies, variant sharing patterns, and the geographic distance between populations. In an exploratory setting, users could generate plots with multiple cutoff frequencies to reveal varying levels of structure among populations. GeoVar plots may also serve as an informal goodness-of-fit check for parametric models of population history (as in Figure 3—figure supplement 2). In such exploratory and model-checking applications, attention to sample sizes and their configuration across sampling units is important, as larger sample sizes will allow the detection of more rare variants (e.g. contrast Figure 3—figure supplement 2, panel A and B). For the application to humans shown here, a preliminary approach to account for varying sample size did not substantially change the results (results not shown); that said, developing such an approach more fully or taking rarefaction approaches (Szpiech et al., 2008) may be essential for future applications with more uneven sample sizes.

Overall, the visualizations produced here provide an interpretable way to depict geographic patterns of human genetic variation. With personal genomic technologies and ancestry testing becoming commonplace, there is increasing importance in fostering the understanding of human population genetics. To this end, human genetics researchers must develop interpretable materials on patterns of genetic variation for use in educational and outreach settings (Donovan et al., 2019). The variant-centric approach detailed here complements existing visualizations of population structure, facilitating a clearer understanding of the major patterns of human genetic diversity.

Appendix 1

Theoretical geographic distribution code abundances

The relative abundances of geographic distribution codes derive from human population history (Box 2). Here, we use a simple population genetic model to develop intuition about the relationship between the divergence time of a pair of populations and the expected two-letter code abundances. To isolate the effect of population divergence from other factors such as population growth, we consider the simplest possible model of divergence: two constant-size populations of N individuals descended from a single N-individual source population T generations ago (Box 2—figure 1A). We incorporate recent contact between populations via a symmetric admixture coefficient $α$ . Individuals in Population 1 derive a fraction $α$ of their ancestry from Population 2 and vice versa. Human population history is much more complex than our model, but it captures the essential features of common ancestry, subsequent isolation, and modern admixture.

Python source code implementing the calculation and producing Box 2—figure 1 is available in the project’s Git repository (https://github.com/aabiddanda/geovar_rep_paper; Biddanda, 2020b; copy archived at swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145).

Wright-Fisher diffusion of allele frequencies

In our model, allele frequencies in the two source populations are initially identical because they derive from the same source population. After the populations split, allele frequencies evolve independently according to a Wright-Fisher diffusion with symmetric mutations at rate $θ$ new mutations per population per generation. At time $t = T / 2 N$ generations after the split, the joint density of mutations at frequency x₁ in Population 1 and x₂ in Population 2 is given by,

f (t; x_{1}, x_{2}) = \int_{0}^{1} f (0; x_{0}) p (t; x_{0}, x_{1}) p (t; x_{0}, x_{2}) 𝑑 x_{0},

where $f (0; x_{0})$ is the density of mutations at frequency x₀ in the source population and $p (t; \cdot, \cdot)$ is the Wright-Fisher transition density function. Assuming that the source population was at mutation-drift equilibrium, $f (0; x_{0}) = π (x_{0}) \propto {(x_{0} (1 - x_{0}))}^{θ - 1}$ , the stationary measure of the Wright-Fisher diffusion.

We use the spectral decomposition of Song and Steinrücken, 2012 to represent the Wright-Fisher transition density as an infinite sum of modified Jacobi polynomials, $B_{i} (x)$ :

p (t; x, y) = \sum_{i = 0}^{\infty} e^{- Λ_{i} t} π (y) \frac{B_{i} (x) B_{i} (y)}{⟨ B_{i}, B_{i} ⟩},

where the inner product $⟨ g, h ⟩$ is given by $\int_{0}^{1} f (x) g (x) π (x) 𝑑 x$ . The Jacobi polynomials are orthogonal with respect to this inner product. That is, $⟨ B_{i}, B_{j} ⟩ = 0$ for $i \neq j$ . Substituting (2) into (1) and using orthogonality, we have:

f (t; x_{1}, x_{2}) = π (x_{1}) π (x_{2}) \sum_{i = 0}^{\infty} e^{- 2 Λ_{i} t} \frac{B_{i} (x_{1}) B_{i} (x_{2})}{⟨ B_{i}, B_{i} ⟩} .

In practice, we can only compute partial sums on the right-hand side, which we can re-write as

f (t; x_{1}, x_{2}) = π (x_{1}) π (x_{2}) (S_{m} (x_{1}, x_{2}) + R_{m} (x_{1}, x_{2})),

where $S_{m}$ is the partial sum of terms up to order m and $R_{m}$ is the remainder, which represents the error from truncating the series. We can control this error by choosing a large enough m (see Numerical Integration.)

Sampling probabilities

The abundances of two-population distribution codes is a simple transformation of the cumulative distribution function (CDF) of the joint allele counts $(K_{1}, K_{2})$ . Conditioning on allele frequencies at time t, but before admixture, the CDF is given by

𝒫 {K_{1} \leq k_{1}, K_{2} \leq k_{2}} = \int_{0}^{1} \int_{0}^{1} 𝒫 {K_{1} \leq k_{1} | x_{1}, x_{2}} 𝒫 {K_{2} \leq k_{2} | x_{1}, x_{2}} f (t; x_{1}, x_{2}) d x_{1} d x_{2}

For n randomly sampled haploid individuals from each population, and admixture coefficient $α$ , we have:

K_{1} | x_{1}, x_{2} \sim Binomial (n, (1 - α) x_{1} + α x_{2}),

K_{2} | x_{1}, x_{2} \sim Binomial (n, (1 - α) x_{2} + α x_{1}) .

Writing $P_{n}^{(k)} (x_{1}, x_{2})$ for the binomial cumulative distribution function $𝒫 {K_{i} \leq k | x_{1}, x_{2}}$ , and substituting (5) into (4) yields:

𝒫 {K_{1} \leq k_{1}, K_{2} \leq k_{2}} = ⟨ P_{n}^{(k_{1})} P_{n}^{(k_{2})}, S_{m} ⟩ + ⟨ P_{n}^{(k_{1})} P_{n}^{(k_{2})}, R_{m} ⟩

where the inner product now represents the double integral weighted by $π (x_{1}) π (x_{2})$ .

Numerical integration

We compute the integrals in (6) by two-dimensional Gauss-Jacobi quadrature. The left argument of the inner product is a polynomial of degree n in both x₁ and x₂. As a result, we can choose $m = 2 n$ , so that $⟨ P_{n}^{(k_{1})} P_{n}^{(k_{2})}, R_{2 n} ⟩ = 0$ due to the orthogonality of the Jacobi polynomials. Because $S_{2 n}$ is also a polynomial, the integrand is a polynomial of degree $4 n$ . Thus, fixed-order tensor-product Gauss-Jacobi quadrature is guaranteed to yield the exact integral with $4 n^{2}$ evaluations of the integrand.

Appendix 2

Extinction probability and conditional mean frequency

The extinction probability $℘ (p, t)$ , the probability that a mutation that was at frequency p at time $t = 0$ is extinct at time $t = T / 2 N$ , obeys the Kolmogorov backward equation Ewens, 2004 :

\frac{\partial}{\partial t} ℘ (p, t) = \frac{1}{2} p (1 - p) \frac{\partial^{2}}{\partial p^{2}} ℘ (p, t)

with boundary conditions

℘ (p, 0) = {\begin{matrix} 1 & if p = 0 \\ 0 & otherwise \end{matrix}

℘ (0, t) = 1

℘ (1, t) = 0 .

For short times and rare alleles (i.e. $t, p ≪ 1$ ), we can use the approximation $p (1 - p) \approx p$ , to get a simpler diffusion equation:

\frac{\partial}{\partial t} ℘ = \frac{1}{2} p \frac{\partial^{2}}{\partial p^{2}} ℘

with modified boundary conditions

℘ (p, 0) = {\begin{matrix} 1 & if p = 0 \\ 0 & otherwise \end{matrix}

℘ (0, t) = 1

lim_{p \to \infty} ℘ (p, t) = 0

Because we are neglecting the $(1 - p)$ term, fixation is not possible in this approximation, and it is natural to move the upper boundary condition from $p = 1$ to $p \to \infty$ . (This approximation is equivalent to replacing the Wright-Fisher diffusion with a continuous-state critical branching process, which is guaranteed to go extinct for all finite sizes). Accordingly, we expect the approximation to break down when the minor allele has a substantial probability of fixation.

We can solve (11) in closed form to find the time-dependent extinction probability,

℘ (p, t) \approx \exp (- \frac{2 p}{t}),

For $t ≪ 2 p$ , this probability is exponentially small, while for $t > 2 p$ it behaves like $1 - 2 p / t$ (Box 2—figure 1C).

We can use (15) to find the expected frequency of a new mutation conditional on its survival to time t. By the law of total probability, we have

𝔼 [X (t) | X (t) > 0] = \frac{𝔼 [X (t)]}{ℙ [X (t) > 0]} = \frac{1 / 2 N}{1 - ℘ (1 / 2 N, t)},

where in the last equality we used the fact that for a new neutral mutation $𝔼 [X (t)] = p = 1 / 2 N$ . Thus, to leading order in $1 / N$ , we have $𝔼 [X (t) | X (t) > 0] \sim t / 2$ .

Appendix 3

Appendix 3—key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Other	1000 Genomes High-Coverage Data (1 KG)	https://doi.org/10.1093/nar/gkz836	RRID:SCR_006828	http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/
Other	Simons Genome Diversity Project Data (SGDP)	https://doi.org/10.1038/nature18964		https://reichdata.hms.harvard.edu/pub/datasets/sgdp/
Other	Ancestral allele calls	https://doi.org/10.1093/nar/gkz966	RRID:SCR_002344	ftp.ensembl.org/pub/release-90/fasta/ancestral_alleles/homo_sapiens_ ancestor_GRCh38_e86.tar.gz
Other	GrCH38 Genome Masks	https://doi.org/10.1093/nar/gkz836	RRID:SCR_006828	http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/
Commercial assay or kit	Human Origins Array; Human Origins	other		https://sec-assets.thermofisher.com/TFS-Assets/LSG/Support-Files/Axiom_GW_%20HuOrigin.na35.annot.csv.zip
Commercial assay or kit	Affymetrix GenomeWide 6.0 Array (Affy6)	other		http://www.affymetrix.com/Auth/analysis/downloads/na35/genotyping/GenomeWideSNP_6.na35.annot.csv.zip
Commercial assay or kit	Illumina MEGA Array (MEGA)	other		https://support.illumina.com/array/array_kits/infinium-multi-ethnic-amr-afr-8-kit/downloads.html
Commercial assay or kit	Illumina Human Omni Express Array (OmniExpress)	other		ftp://ussd-ftp.illumina.com/Downloads/ProductFiles/HumanOmniExpress-24/v1-0/HumanOmniExpress-24-v1-0-B.csv
Commercial assay or kit	Illumina Omni2.5Exome Array (Omni2.5Exome)	other		ftp://ussd-ftp.illumina.com/Downloads/ProductFiles/HumanOmni2-5Exome-8/Product_Files_v1-1/HumanOmni2-5Exome-8-v1-1-A.csv
Other	Reproducible analysis pipeline for this paper	This paper		https://github.com/aabiddanda/geovar_rep_paper; Biddanda, 2020a (copy archived at swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145)
Software, algorithm	GeoVar software	This paper		https://aabiddanda.github.io/geovar/

Data availability

The GeoVar assignments for each variant have been deposited to Dryad (https://doi.org/10.5061/dryad.rjdfn2z7v). The code for replicating the analyses is available at: https://github.com/aabiddanda/geovar_rep_paper (copy archived at https://archive.softwareheritage.org/swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145/). A python package (https://aabiddanda.github.io/geovar/) allows users to make GeoVar plots from frequency tables or VCF files.

The following data sets were generated

(2020) Dryad Digital Repository
Geographic allele frequency variation in the 1000 Genomes hg38 NYGC dataset.

https://doi.org/10.5061/dryad.rjdfn2z7v

References

1. Adrion JR
2. Cole CB
3. Dukler N
4. Galloway JG
5. Gladstein AL
6. Gower G
7. Kyriazis CC
8. Ragsdale AP
9. Tsambos G
10. Baumdicker F
11. Carlson J
12. Cartwright RA
13. Durvasula A
14. Gronau I
15. Kim BY
16. McKenzie P
17. Messer PW
18. Noskova E
19. Ortega-Del Vecchyo D
20. Racimo F
21. Struck TJ
22. Gravel S
23. Gutenkunst RN
24. Lohmueller KE
25. Ralph PL
26. Schrider DR
27. Siepel A
28. Kelleher J
29. Kern AD
(2020) A community-maintained standard library of population genetic models
eLife 9:e54967.

https://doi.org/10.7554/eLife.54967
- PubMed
- Google Scholar
1. Albers PK
2. McVean G
(2020) Dating genomic variants and shared ancestry in population-scale sequencing data
PLOS Biology 18:e3000586.

https://doi.org/10.1371/journal.pbio.3000586
- PubMed
- Google Scholar
(2010) Ascertainment biases in SNP chips affect measures of population divergence
Molecular Biology and Evolution 27:2534–2547.

https://doi.org/10.1093/molbev/msq148
- PubMed
- Google Scholar
(2015) A global reference for human genetic variation
Nature 526:68–74.

https://doi.org/10.1038/nature15393
- PubMed
- Google Scholar
Preprint
1. Bergström A
2. McCarthy SA
3. Hui R
4. Almarri MA
5. Ayub Q
6. Danecek P
7. Chen Y
8. Felkel S
9. Hallast P
10. Kamm J
11. Blanché H
12. Deleuze J-F
13. Cann H
14. Mallick S
15. Reich D
16. Sandhu MS
17. Skoglund P
18. Scally A
19. Xue Y
20. Durbin R
21. Tyler-Smith C
(2019) Insights into human genetic variation and population history from 929 diverse genomes
bioRxiv.

https://doi.org/10.1101/674986
- Google Scholar
Software
1. Biddanda A
(2020a) geovar_rep_paper, version swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145
Software Heritage.

https://archive.softwareheritage.org/swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145/
Software
1. Biddanda A
(2020b) geovar_rep_paper, version swh:1:rev:db3ca8faeecf8697973f803bc05c5a3d0a187145
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:eb7458b7e7697b1c86c8ae0dd228796778171e57/
1. Bien SA
2. Wojcik GL
3. Zubair N
4. Gignoux CR
5. Martin AR
6. Kocarnik JM
7. Martin LW
8. Buyske S
9. Haessler J
10. Walker RW
11. Cheng I
12. Graff M
13. Xia L
14. Franceschini N
15. Matise T
16. James R
17. Hindorff L
18. Le Marchand L
19. North KE
20. Haiman CA
21. Peters U
22. Loos RJ
23. Kooperberg CL
24. Bustamante CD
25. Kenny EE
26. Carlson CS
27. PAGE Study
(2016) Strategies for enriching variant coverage in candidate disease loci on a multiethnic genotyping array
PLOS ONE 11:e0167758.

https://doi.org/10.1371/journal.pone.0167758
- PubMed
- Google Scholar
1. Bien SA
2. Wojcik GL
3. Hodonsky CJ
4. Gignoux CR
5. Cheng I
6. Matise TC
7. Peters U
8. Kenny EE
9. North KE
(2019) The future of genomic studies must be globally representative: perspectives from PAGE
Annual Review of Genomics and Human Genetics 20:181–200.

https://doi.org/10.1146/annurev-genom-091416-035517
- PubMed
- Google Scholar
(2017) The impact of rare and low-frequency genetic variants in common disease
Genome Biology 18:77.

https://doi.org/10.1186/s13059-017-1212-4
- PubMed
- Google Scholar
1. Bowling BV
2. Acra EE
3. Wang L
4. Myers MF
5. Dean GE
6. Markle GC
7. Moskalik CL
8. Huether CA
(2008) Development and evaluation of a genetics literacy assessment instrument for undergraduates
Genetics 178:15–22.

https://doi.org/10.1534/genetics.107.079533
- PubMed
- Google Scholar
1. Bradburd GS
2. Ralph PL
(2019) Spatial population genetics: it's about time
Annual Review of Ecology, Evolution, and Systematics 50:427–449.

https://doi.org/10.1146/annurev-ecolsys-110316-022659
- Google Scholar
(2011) Genomics for the world
Nature 475:163–165.

https://doi.org/10.1038/475163a
- PubMed
- Google Scholar
1. Bycroft C
2. Freeman C
3. Petkova D
4. Band G
5. Elliott LT
6. Sharp K
7. Motyer A
8. Vukcevic D
9. Delaneau O
10. O'Connell J
11. Cortes A
12. Welsh S
13. Young A
14. Effingham M
15. McVean G
16. Leslie S
17. Allen N
18. Donnelly P
19. Marchini J
(2018) The UK biobank resource with deep phenotyping and genomic data
Nature 562:203–209.

https://doi.org/10.1038/s41586-018-0579-z
- PubMed
- Google Scholar
(1987) Mitochondrial DNA and human evolution
Nature 325:31–36.

https://doi.org/10.1038/325031a0
- PubMed
- Google Scholar
1. Chen L
2. Wolf AB
3. Fu W
4. Li L
5. Akey JM
(2020) Identifying and interpreting apparent neanderthal ancestry in african individuals
Cell 180:677–687.

https://doi.org/10.1016/j.cell.2020.01.012
- PubMed
- Google Scholar
(2005) Ascertainment Bias in studies of human genome-wide polymorphism
Genome Research 15:1496–1502.

https://doi.org/10.1101/gr.4107905
- PubMed
- Google Scholar
1. Conrad DF
2. Jakobsson M
3. Coop G
4. Wen X
5. Wall JD
6. Rosenberg NA
7. Pritchard JK
(2006) A worldwide survey of haplotype variation and linkage disequilibrium in the human genome
Nature Genetics 38:1251–1260.

https://doi.org/10.1038/ng1911
- PubMed
- Google Scholar
Book
1. Coon CS
(1962)
The Origin of Races

New York: Alfred K Knopf.
- Google Scholar
(2009) Out of africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from africa
PNAS 106:16057–16062.

https://doi.org/10.1073/pnas.0903341106
- PubMed
- Google Scholar
1. Di Rienzo A
2. Wilson AC
(1991) Branching pattern in the evolutionary tree for human mitochondrial DNA
PNAS 88:1597–1601.

https://doi.org/10.1073/pnas.88.5.1597
- PubMed
- Google Scholar
1. Donovan BM
2. Semmens R
3. Keck P
4. Brimhall E
5. Busch KC
6. Weindling M
7. Duncan A
8. Stuhlsatz M
9. Bracey ZB
10. Bloom M
11. Kowalski S
12. Salazar B
(2019) Toward a more humane genetics education: learning about the social and quantitative complexities of human genetic variation research could reduce racial Bias in adolescent and adult populations
Science Education 103:529–560.

https://doi.org/10.1002/sce.21506
- Google Scholar
(2011) Testing for ancient admixture between closely related populations
Molecular Biology and Evolution 28:2239–2252.

https://doi.org/10.1093/molbev/msr048
- PubMed
- Google Scholar
Book
1. Ewens WJ
(2004) Applications of Diffusion Theory
In: Ewens W. J, editors. Mathematical Population Genetics: I. Theoretical Introduction. Interdisciplinary Applied Mathematics. Springer. pp. 156–200.

https://doi.org/10.1007/978-0-387-21822-9_5
- Google Scholar
(2020) The international genome sample resource (IGSR) collection of open human genomic variation resources
Nucleic Acids Research 48:D941–D947.

https://doi.org/10.1093/nar/gkz836
- PubMed
- Google Scholar
1. Fenner JN
(2005) Cross-cultural estimation of the human generation interval for use in genetics-based population divergence studies
American Journal of Physical Anthropology 128:415–423.

https://doi.org/10.1002/ajpa.20188
- PubMed
- Google Scholar
1. Green RE
2. Krause J
3. Briggs AW
4. Maricic T
5. Stenzel U
6. Kircher M
7. Patterson N
8. Li H
9. Zhai W
10. Fritz MH
11. Hansen NF
12. Durand EY
13. Malaspinas AS
14. Jensen JD
15. Marques-Bonet T
16. Alkan C
17. Prüfer K
18. Meyer M
19. Burbano HA
20. Good JM
21. Schultz R
22. Aximu-Petri A
23. Butthof A
24. Höber B
25. Höffner B
26. Siegemund M
27. Weihmann A
28. Nusbaum C
29. Lander ES
30. Russ C
31. Novod N
32. Affourtit J
33. Egholm M
34. Verna C
35. Rudan P
36. Brajkovic D
37. Kucan Ž
38. Gušic I
39. Doronichev VB
40. Golovanova LV
41. Lalueza-Fox C
42. de la Rasilla M
43. Fortea J
44. Rosas A
45. Schmitz RW
46. Johnson PLF
47. Eichler EE
48. Falush D
49. Birney E
50. Mullikin JC
51. Slatkin M
52. Nielsen R
53. Kelso J
54. Lachmann M
55. Reich D
56. Pääbo S
(2010) A draft sequence of the neandertal genome
Science 328:710–722.

https://doi.org/10.1126/science.1188021
- PubMed
- Google Scholar
(2009) Statistical methods in spatial genetics
Molecular Ecology 18:4734–4756.

https://doi.org/10.1111/j.1365-294X.2009.04410.x
- PubMed
- Google Scholar
(2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data
PLOS Genetics 5:e1000695.

https://doi.org/10.1371/journal.pgen.1000695
- PubMed
- Google Scholar
1. Harpending HC
2. Eller E
(2000) Human diversity and its history
The Biology of Biodiversity 1:301–314.

https://doi.org/10.1007/978-4-431-65930-3_20
- Google Scholar
1. Harpending H
2. Rogers A
(2000) Genetic perspectives on human origins and differentiation
Annu Rev Genomics Hum Genet. 1:361–385.

https://doi.org/10.1146/annurev.genom.1.1.361
- Google Scholar
1. Harris K
2. Nielsen R
(2016) The genetic cost of neanderthal introgression
Genetics 203:881–891.

https://doi.org/10.1534/genetics.116.186890
- PubMed
- Google Scholar
(2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing
Nature Genetics 44:955–959.

https://doi.org/10.1038/ng.2354
- PubMed
- Google Scholar
1. Hubbard AR
(2017) Testing common misconceptions about the nature of human racial variation
The American Biology Teacher 79:538–543.

https://doi.org/10.1525/abt.2017.79.7.538
- Google Scholar
1. International HapMap Consortium
(2005) A haplotype map of the human genome
Nature 437:1299–1320.

https://doi.org/10.1038/nature04226
- PubMed
- Google Scholar
1. Jakobsson M
2. Scholz SW
3. Scheet P
4. Gibbs JR
5. VanLiere JM
6. Fung HC
7. Szpiech ZA
8. Degnan JH
9. Wang K
10. Guerreiro R
11. Bras JM
12. Schymick JC
13. Hernandez DG
14. Traynor BJ
15. Simon-Sanchez J
16. Matarin M
17. Britton A
18. van de Leemput J
19. Rafferty I
20. Bucan M
21. Cann HM
22. Hardy JA
23. Rosenberg NA
24. Singleton AB
(2008) Genotype, haplotype and copy-number variation in worldwide human populations
Nature 451:998–1003.

https://doi.org/10.1038/nature06742
- PubMed
- Google Scholar
(2016) The strength of selection against neanderthal introgression
PLOS Genetics 12:e1006340.

https://doi.org/10.1371/journal.pgen.1006340
- PubMed
- Google Scholar
1. Keinan A
2. Clark AG
(2012) Recent explosive human population growth has resulted in an excess of rare genetic variants
Science 336:740–743.

https://doi.org/10.1126/science.1217283
- PubMed
- Google Scholar
1. Kelleher J
2. Wong Y
3. Wohns AW
4. Fadil C
5. Albers PK
6. McVean G
(2019) Inferring whole-genome histories in large population datasets
Nature Genetics 51:1330–1338.

https://doi.org/10.1038/s41588-019-0483-y
- PubMed
- Google Scholar
1. Key FM
2. Abdul-Aziz MA
3. Mundry R
4. Peter BM
5. Sekar A
6. D'Amato M
7. Dennis MY
8. Schmidt JM
9. Andrés AM
(2018) Human local adaptation of the TRPM8 cold receptor along a latitudinal cline
PLOS Genetics 14:e1007298.

https://doi.org/10.1371/journal.pgen.1007298
- PubMed
- Google Scholar
(2013) Deleterious alleles in the human genome are on average younger than neutral alleles of the same frequency
PLOS Genetics 9:e1003301.

https://doi.org/10.1371/journal.pgen.1003301
- PubMed
- Google Scholar
1. Kimura M
2. Ohta T
(1973)
The age of a neutral mutant persisting in a finite population

Genetics 75:199–212.
- Google Scholar
1. Lachance J
2. Tishkoff SA
(2013) SNP ascertainment Bias in population genetic analyses: why it is important, and how to correct it
BioEssays 35:780–786.

https://doi.org/10.1002/bies.201300014
- PubMed
- Google Scholar
(2018) A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots
Nature Communications 9:3258.

https://doi.org/10.1038/s41467-018-05257-7
- PubMed
- Google Scholar
(2015) The fine-scale genetic structure of the british population
Nature 519:309–314.

https://doi.org/10.1038/nature14230
- PubMed
- Google Scholar
Book
1. Lewontin RC
(1972) The Apportionment of Human Diversity
In: Dobzhansky T, Hecht M, Steere W, editors. Evolutionary Biology. Springer. pp. 381–398.

https://doi.org/10.1007/978-1-4684-9063-3_14
- Google Scholar
1. Li JZ
2. Absher DM
3. Tang H
4. Southwick AM
5. Casto AM
6. Ramachandran S
7. Cann HM
8. Barsh GS
9. Feldman M
10. Cavalli-Sforza LL
11. Myers RM
(2008) Worldwide human relationships inferred from Genome-Wide patterns of variation
Science 319:1100–1104.

https://doi.org/10.1126/science.1153717
- PubMed
- Google Scholar
1. Mallick S
2. Li H
3. Lipson M
4. Mathieson I
5. Gymrek M
6. Racimo F
7. Zhao M
8. Chennagiri N
9. Nordenfelt S
10. Tandon A
11. Skoglund P
12. Lazaridis I
13. Sankararaman S
14. Fu Q
15. Rohland N
16. Renaud G
17. Erlich Y
18. Willems T
19. Gallo C
20. Spence JP
21. Song YS
22. Poletti G
23. Balloux F
24. van Driem G
25. de Knijff P
26. Romero IG
27. Jha AR
28. Behar DM
29. Bravi CM
30. Capelli C
31. Hervig T
32. Moreno-Estrada A
33. Posukh OL
34. Balanovska E
35. Balanovsky O
36. Karachanak-Yankova S
37. Sahakyan H
38. Toncheva D
39. Yepiskoposyan L
40. Tyler-Smith C
41. Xue Y
42. Abdullah MS
43. Ruiz-Linares A
44. Beall CM
45. Di Rienzo A
46. Jeong C
47. Starikovskaya EB
48. Metspalu E
49. Parik J
50. Villems R
51. Henn BM
52. Hodoglugil U
53. Mahley R
54. Sajantila A
55. Stamatoyannopoulos G
56. Wee JT
57. Khusainova R
58. Khusnutdinova E
59. Litvinov S
60. Ayodo G
61. Comas D
62. Hammer MF
63. Kivisild T
64. Klitz W
65. Winkler CA
66. Labuda D
67. Bamshad M
68. Jorde LB
69. Tishkoff SA
70. Watkins WS
71. Metspalu M
72. Dryomov S
73. Sukernik R
74. Singh L
75. Thangaraj K
76. Pääbo S
77. Kelso J
78. Patterson N
79. Reich D
(2016) The simons genome diversity project: 300 genomes from 142 diverse populations
Nature 538:201–206.

https://doi.org/10.1038/nature18964
- PubMed
- Google Scholar
1. Manolio TA
2. Collins FS
3. Cox NJ
4. Goldstein DB
5. Hindorff LA
6. Hunter DJ
7. McCarthy MI
8. Ramos EM
9. Cardon LR
10. Chakravarti A
11. Cho JH
12. Guttmacher AE
13. Kong A
14. Kruglyak L
15. Mardis E
16. Rotimi CN
17. Slatkin M
18. Valle D
19. Whittemore AS
20. Boehnke M
21. Clark AG
22. Eichler EE
23. Gibson G
24. Haines JL
25. Mackay TF
26. McCarroll SA
27. Visscher PM
(2009) Finding the missing heritability of complex diseases
Nature 461:747–753.

https://doi.org/10.1038/nature08494
- PubMed
- Google Scholar
1. Marigorta UM
2. Navarro A
(2013) High trans-ethnic replicability of GWAS results implies common causal variants
PLOS Genetics 9:e1003566.

https://doi.org/10.1371/journal.pgen.1003566
- PubMed
- Google Scholar
1. Martin AR
2. Gignoux CR
3. Walters RK
4. Wojcik GL
5. Neale BM
6. Gravel S
7. Daly MJ
8. Bustamante CD
9. Kenny EE
(2017) Human demographic history impacts genetic risk prediction across diverse populations
The American Journal of Human Genetics 100:635–649.

https://doi.org/10.1016/j.ajhg.2017.03.004
- PubMed
- Google Scholar
1. Mathieson I
2. McVean G
(2014) Demography and the age of rare variants
PLOS Genetics 10:e1004528.

https://doi.org/10.1371/journal.pgen.1004528
- PubMed
- Google Scholar
1. Mathieson I
2. Scally A
(2020) What is ancestry?
PLOS Genetics 16:e1008624.

https://doi.org/10.1371/journal.pgen.1008624
- PubMed
- Google Scholar
1. McVean G
(2009) A genealogical interpretation of principal components analysis
PLOS Genetics 5:e1000686.

https://doi.org/10.1371/journal.pgen.1000686
- PubMed
- Google Scholar
(2020) Variable prediction accuracy of polygenic scores within an ancestry group
eLife 9:e48376.

https://doi.org/10.7554/eLife.48376
- PubMed
- Google Scholar
1. Mountain JL
2. Ramakrishnan U
(2005) Impact of human population history on distributions of individual-level genetic distance
Human Genomics 2:4–19.

https://doi.org/10.1186/1479-7364-2-1-4
- PubMed
- Google Scholar
1. Nelson MR
2. Wegmann D
3. Ehm MG
4. Kessner D
5. St Jean P
6. Verzilli C
7. Shen J
8. Tang Z
9. Bacanu SA
10. Fraser D
11. Warren L
12. Aponte J
13. Zawistowski M
14. Liu X
15. Zhang H
16. Zhang Y
17. Li J
18. Li Y
19. Li L
20. Woollard P
21. Topp S
22. Hall MD
23. Nangle K
24. Wang J
25. Abecasis G
26. Cardon LR
27. Zöllner S
28. Whittaker JC
29. Chissoe SL
30. Novembre J
31. Mooser V
(2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people
Science 337:100–104.

https://doi.org/10.1126/science.1217876
- PubMed
- Google Scholar
(2017) Tracing the peopling of the world through genomics
Nature 541:302–310.

https://doi.org/10.1038/nature21347
- PubMed
- Google Scholar
1. Novembre J
2. Peter BM
(2016) Recent advances in the study of fine-scale population structure in humans
Current Opinion in Genetics & Development 41:98–105.

https://doi.org/10.1016/j.gde.2016.08.007
- PubMed
- Google Scholar
1. Panofsky A
2. Bliss C
(2017) Ambiguity and scientific authority: population classification in genomic science
American Sociological Review 82:59–87.

https://doi.org/10.1177/0003122416685812
- Google Scholar
(2006) Population structure and eigenanalysis
PLOS Genetics 2:e190.

https://doi.org/10.1371/journal.pgen.0020190
- PubMed
- Google Scholar
1. Patterson N
2. Moorjani P
3. Luo Y
4. Mallick S
5. Rohland N
6. Zhan Y
7. Genschoreck T
8. Webster T
9. Reich D
(2012) Ancient admixture in human history
Genetics 192:1065–1093.

https://doi.org/10.1534/genetics.112.145037
- PubMed
- Google Scholar
1. Peterson RE
2. Kuchenbaecker K
3. Walters RK
4. Chen CY
5. Popejoy AB
6. Periyasamy S
7. Lam M
8. Iyegbe C
9. Strawbridge RJ
10. Brick L
11. Carey CE
12. Martin AR
13. Meyers JL
14. Su J
15. Chen J
16. Edwards AC
17. Kalungi A
18. Koen N
19. Majara L
20. Schwarz E
21. Smoller JW
22. Stahl EA
23. Sullivan PF
24. Vassos E
25. Mowry B
26. Prieto ML
27. Cuellar-Barboza A
28. Bigdeli TB
29. Edenberg HJ
30. Huang H
31. Duncan LE
(2019) Genome-wide association studies in ancestrally diverse populations: opportunities, methods, pitfalls, and recommendations
Cell 179:589–603.

https://doi.org/10.1016/j.cell.2019.08.051
- PubMed
- Google Scholar
1. Phelan JC
2. Link BG
3. Zelner S
4. Yang LH
(2014) Direct-to-Consumer racial admixture tests and beliefs about essential racial differences
Social Psychology Quarterly 77:296–318.

https://doi.org/10.1177/0190272514529439
- PubMed
- Google Scholar
1. Pickrell JK
2. Pritchard JK
(2012) Inference of population splits and mixtures from genome-wide allele frequency data
PLOS Genetics 8:e1002967.

https://doi.org/10.1371/journal.pgen.1002967
- PubMed
- Google Scholar
1. Pickrell JK
2. Reich D
(2014) Toward a new history and geography of human genes informed by ancient DNA
Trends in Genetics 30:377–389.

https://doi.org/10.1016/j.tig.2014.07.007
- PubMed
- Google Scholar
1. Platt A
2. Pivirotto A
3. Knoblauch J
4. Hey J
(2019) An estimator of first coalescent time reveals selection on young variants and large heterogeneity in rare allele ages among human populations
PLOS Genetics 15:e1008340.

https://doi.org/10.1371/journal.pgen.1008340
- PubMed
- Google Scholar
1. Popejoy AB
2. Fullerton SM
(2016) Genomics is failing on diversity
Nature 538:161–164.

https://doi.org/10.1038/538161a
- PubMed
- Google Scholar
1. Prüfer K
2. Racimo F
3. Patterson N
4. Jay F
5. Sankararaman S
6. Sawyer S
7. Heinze A
8. Renaud G
9. Sudmant PH
10. de Filippo C
11. Li H
12. Mallick S
13. Dannemann M
14. Fu Q
15. Kircher M
16. Kuhlwilm M
17. Lachmann M
18. Meyer M
19. Ongyerth M
20. Siebauer M
21. Theunert C
22. Tandon A
23. Moorjani P
24. Pickrell J
25. Mullikin JC
26. Vohr SH
27. Green RE
28. Hellmann I
29. Johnson PL
30. Blanche H
31. Cann H
32. Kitzman JO
33. Shendure J
34. Eichler EE
35. Lein ES
36. Bakken TE
37. Golovanova LV
38. Doronichev VB
39. Shunkov MV
40. Derevianko AP
41. Viola B
42. Slatkin M
43. Reich D
44. Kelso J
45. Pääbo S
(2014) The complete genome sequence of a neanderthal from the altai mountains
Nature 505:43–49.

https://doi.org/10.1038/nature12886
- PubMed
- Google Scholar
(2005) Geography predicts neutral genetic diversity of human populations
Current Biology 15:R159–R160.

https://doi.org/10.1016/j.cub.2005.02.038
- PubMed
- Google Scholar
1. Race, Ethnicity, and Genetics Working Group
(2005) The use of racial, ethnic, and ancestral categories in human genetics research
The American Journal of Human Genetics 77:519–532.

https://doi.org/10.1086/491747
- Google Scholar
(2015) Evidence for archaic adaptive introgression in humans
Nature Reviews Genetics 16:359–371.

https://doi.org/10.1038/nrg3936
- PubMed
- Google Scholar
1. Ralph P
2. Coop G
(2010) Parallel adaptation: one or many waves of advance of an advantageous allele?
Genetics 186:647–668.

https://doi.org/10.1534/genetics.110.119594
- PubMed
- Google Scholar
(2005) Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa
PNAS 102:15942–15947.

https://doi.org/10.1073/pnas.0507611102
- PubMed
- Google Scholar
(2002) Genetic structure of human populations
Science 298:2381–2385.

https://doi.org/10.1126/science.1078311
- PubMed
- Google Scholar
1. Rosenberg NA
(2011) A population-genetic perspective on the similarities and differences among worldwide human populations
Human Biology 83:659–684.

https://doi.org/10.3378/027.083.0601
- PubMed
- Google Scholar
(2020) Do genetic ancestry tests increase racial essentialism? findings from a randomized controlled trial
PLOS ONE 15:e0227399.

https://doi.org/10.1371/journal.pone.0227399
- PubMed
- Google Scholar
1. Slatkin M
2. Hudson RR
(1991)
Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations

Genetics 129:555–562.
- Google Scholar
1. Song YS
2. Steinrücken M
(2012) A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection
Genetics 190:1117–1129.

https://doi.org/10.1534/genetics.111.136929
- PubMed
- Google Scholar
1. Speidel L
2. Forest M
3. Shi S
4. Myers SR
(2019) A method for genome-wide genealogy estimation for thousands of samples
Nature Genetics 51:1321–1329.

https://doi.org/10.1038/s41588-019-0484-x
- PubMed
- Google Scholar
1. Stringer CB
2. Andrews P
(1988) Genetic and fossil evidence for the origin of modern humans
Science 239:1263–1268.

https://doi.org/10.1126/science.3125610
- PubMed
- Google Scholar
(2008) ADZE: a rarefaction approach for counting alleles private to combinations of populations
Bioinformatics 24:2498–2504.

https://doi.org/10.1093/bioinformatics/btn478
- PubMed
- Google Scholar
1. Tennessen JA
2. Bigham AW
3. O'Connor TD
4. Fu W
5. Kenny EE
6. Gravel S
7. McGee S
8. Do R
9. Liu X
10. Jun G
11. Kang HM
12. Jordan D
13. Leal SM
14. Gabriel S
15. Rieder MJ
16. Abecasis G
17. Altshuler D
18. Nickerson DA
19. Boerwinkle E
20. Sunyaev S
21. Bustamante CD
22. Bamshad MJ
23. Akey JM
24. Broad GO
25. Seattle GO
26. NHLBI Exome Sequencing Project
(2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes
Science 337:64–69.

https://doi.org/10.1126/science.1219240
- PubMed
- Google Scholar
(2000) Recent common ancestry of human Y chromosomes: evidence from DNA sequence data
PNAS 97:7360–7365.

https://doi.org/10.1073/pnas.97.13.7360
- PubMed
- Google Scholar
(2007) Genetic similarities within and between human populations
Genetics 176:351–359.

https://doi.org/10.1534/genetics.106.067355
- PubMed
- Google Scholar
1. Wojcik GL
2. Graff M
3. Nishimura KK
4. Tao R
5. Haessler J
6. Gignoux CR
7. Highland HM
8. Patel YM
9. Sorokin EP
10. Avery CL
11. Belbin GM
12. Bien SA
13. Cheng I
14. Cullina S
15. Hodonsky CJ
16. Hu Y
17. Huckins LM
18. Jeff J
19. Justice AE
20. Kocarnik JM
21. Lim U
22. Lin BM
23. Lu Y
24. Nelson SC
25. Park SL
26. Poisner H
27. Preuss MH
28. Richard MA
29. Schurmann C
30. Setiawan VW
31. Sockell A
32. Vahi K
33. Verbanck M
34. Vishnu A
35. Walker RW
36. Young KL
37. Zubair N
38. Acuña-Alonso V
39. Ambite JL
40. Barnes KC
41. Boerwinkle E
42. Bottinger EP
43. Bustamante CD
44. Caberto C
45. Canizales-Quinteros S
46. Conomos MP
47. Deelman E
48. Do R
49. Doheny K
50. Fernández-Rhodes L
51. Fornage M
52. Hailu B
53. Heiss G
54. Henn BM
55. Hindorff LA
56. Jackson RD
57. Laurie CA
58. Laurie CC
59. Li Y
60. Lin DY
61. Moreno-Estrada A
62. Nadkarni G
63. Norman PJ
64. Pooler LC
65. Reiner AP
66. Romm J
67. Sabatti C
68. Sandoval K
69. Sheng X
70. Stahl EA
71. Stram DO
72. Thornton TA
73. Wassel CL
74. Wilkens LR
75. Winkler CA
76. Yoneyama S
77. Buyske S
78. Haiman CA
79. Kooperberg C
80. Le Marchand L
81. Loos RJF
82. Matise TC
83. North KE
84. Peters U
85. Kenny EE
86. Carlson CS
(2019) Genetic analyses of diverse populations improves discovery for complex traits
Nature 570:514–518.

https://doi.org/10.1038/s41586-019-1310-4
- PubMed
- Google Scholar
1. Wolf AB
2. Akey JM
(2018) Outstanding questions in the study of archaic hominin admixture
PLOS Genetics 14:e1007349.

https://doi.org/10.1371/journal.pgen.1007349
- PubMed
- Google Scholar
1. Yi X
2. Liang Y
3. Huerta-Sanchez E
4. Jin X
5. Cuo ZX
6. Pool JE
7. Xu X
8. Jiang H
9. Vinckenbosch N
10. Korneliussen TS
11. Zheng H
12. Liu T
13. He W
14. Li K
15. Luo R
16. Nie X
17. Wu H
18. Zhao M
19. Cao H
20. Zou J
21. Shan Y
22. Li S
23. Yang Q
24. Asan
25. Ni P
26. Tian G
27. Xu J
28. Liu X
29. Jiang T
30. Wu R
31. Zhou G
32. Tang M
33. Qin J
34. Wang T
35. Feng S
36. Li G
37. Huasang
38. Luosang J
39. Wang W
40. Chen F
41. Wang Y
42. Zheng X
43. Li Z
44. Bianba Z
45. Yang G
46. Wang X
47. Tang S
48. Gao G
49. Chen Y
50. Luo Z
51. Gusang L
52. Cao Z
53. Zhang Q
54. Ouyang W
55. Ren X
56. Liang H
57. Zheng H
58. Huang Y
59. Li J
60. Bolund L
61. Kristiansen K
62. Li Y
63. Zhang Y
64. Zhang X
65. Li R
66. Li S
67. Yang H
68. Nielsen R
69. Wang J
70. Wang J
(2010) Sequencing of 50 human exomes reveals adaptation to high altitude
Science 329:75–78.

https://doi.org/10.1126/science.1190371
- PubMed
- Google Scholar
1. Zietkiewicz E
2. Yotova V
3. Jarnik M
4. Korab-Laskowska M
5. Kidd KK
6. Modiano D
7. Scozzari R
8. Stoneking M
9. Tishkoff S
10. Batzer M
11. Labuda D
(1998) Genetic structure of the ancestral population of modern humans
Journal of Molecular Evolution 47:146–155.

https://doi.org/10.1007/PL00006371
- PubMed
- Google Scholar

Article and author information

Author details

Arjun Biddanda

Department of Human Genetics, University of Chicago, Chicago, United States

Contribution
Conceptualization, Data curation, Software, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1861-1523
Daniel P Rice

Department of Human Genetics, University of Chicago, Chicago, United States

Contribution
Conceptualization, Software, Formal analysis, Methodology, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-9509-2694
John Novembre

Department of Human Genetics, University of Chicago, Chicago, United States

Contribution
Conceptualization, Supervision, Funding acquisition, Visualization, Writing - original draft, Project administration, Writing - review and editing

For correspondence
jnovembre@uchicago.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-5345-0214

Funding

National Institute of General Medical Sciences (R01 GM132383)

Arjun Biddanda
John Novembre

Chicago Fellows Program of the University of Chicago

Daniel P Rice

National Institute of General Medical Sciences (T32 GM07197)

Arjun Biddanda

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The 1KGP data downloaded and used here were generated by the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1. We thank members of the Novembre Lab, especially as this project was initiated in a group hackathon with contributions from Hussein Al-Asadi, Kushal Dey, Evan Koch, Joe Marcus, Ben Peter, Mark Reppell, and Joel Smith. We also thank Jeremy Berg, Jedidiah Carlson, Anna Di Rienzo, Joe Marcus, Aaron Panofsky, Molly Przeworski, Harald Ringbauer, Noah Rosenberg, Mashaal Sohail, Matthias Steinrücken, Paul Strode, Danny Townsend, and Xin He for comments on the manuscript draft, and Brian Donovan for additional helpful conversations. We thank Chi-Chun Liu and Vivaswat Shastry for comments on the GeoVar software package. This work was completed in part with resources provided by the University of Chicago’s Research Computing Center and was supported by NIH training grant T32 GM07197 (AB), the University of Chicago ‘Chicago Fellows’ program (DPR), and NIH grant R01 GM132383.

Ethics

Human subjects: This work analyzes anonymized publicly available data consented for studies of population genetic variation.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.