Figures and data in Haplotypes spanning centromeric regions reveal persistence of large blocks of archaic DNA

Figures
Additional files

5 figures and 1 additional file

Figures

Figure 1 with 2 supplements

Download asset Open asset

Strong LD across centromeric gaps forms large-scale centromere-spanning haplotypes, or *cenhaps*.

A full resolution version of this figure is available as Figure 1—source data 2. (a) The predicted patterns of the magnitude of linkage disequilibrium (LD) (triangle at top) for a Centromere Proximal Region (CPR) in a metacentric human chromosome (bottom) in a large outbreeding population. Central blue bands represent clustered haplotypes expected if crossing-over declines to zero in and around the highly repeated α-satellite DNA (central assembly gap) and the SNP-rich flanking regions (light blue). (b) Triangle (top) shows the LD between pairs of 17702 SNPs (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) flanking the centromere and α-satellite assembly gap (red vertical line) from 1231 human male X chromosomes from the 1000 Genomes Project. The color maps (see adjacent legend) to the -*log*₁₀(p) where the p value derives from the 2×2 χ² for independence of alleles at each pair of SNPs. Below, a broad haplotypic representation of these same data. SNPs were filtered for minor allele count (MAC) ≥ 60, but not by *4gt_dco*. Minor alleles shown in black. Poorly genotyped SNPs near edges of the gap (red line) were masked. Superpopulation (SP; **AFR**ica, AMeRicas, East ASia, **EUR**ope, South ASia) and scaled estimate of chrX-specific α-satellite array size (AS) indicated at left side. Approximate position of HuRef chrX indicated by black asterisk at right of the tree. Dendrogram represents UPGMA clustering based on the hamming distance between haplotypes comprised of 800 filtered SNPs immediately flanking the centromere (Left: chrX:58374895–58563685, Right: chrX:61725513–61921419; hg19), indicated by red bar at bottom and shown in detail in c. The three most common X cenhaps are highlighted with colored vertical bars. (d) A UPGMA tree based on the synonymous divergence in 21 genes (see Figure 1—source data 1) in the three major chrX cenhaps (indicated in c), assuming the TMRCA of humans and chimps is 6.5MY. The bars at each node represent ±two standard deviations of distributions of estimated TMRCAs across the genes. Widths of the triangles are proportional to the *log*₁₀ of number of members of each cenhap, and the height is proportional to the average divergence within each cenhap.

https://doi.org/10.7554/eLife.42989.002

Figure 1—source data 1 The 21 chrX coding genes in the CPR (8 left and 13 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs. Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.: https://doi.org/10.7554/eLife.42989.007
Download elife-42989-fig1-data1-v1.tds
Figure 1—source data 2 Full resolution version of Figure 1.: https://doi.org/10.7554/eLife.42989.008
Download elife-42989-fig1-data2-v1.pdf

Figure 1—figure supplement 1

Download asset Open asset

X chromosome cenhaps from phased female data align with those from haploid males.

A full resolution version of this figure is available as Figure 1—figure supplement 1—source data 1. (a) Haplotypic representation of 17702 SNPs flanking the gap in the assembly where the centromere typically forms (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) in 2542 phased human female X chromosomes (1271 individuals) from the 1000 Genomes Project. SNPs were filtered for minor allele count (MAC) ≥ 60. Minor alleles shown in black. The assembly gap is indicated by the red line. Poorly genotyped SNPs near edges of the gap were masked (see Materials and methods). Superpopulation (SP; **AFR**ica, AMeRicas, East ASia, **EUR**ope, South ASia) is indicated on the left side. Tree represents UPGMA clustering based on the hamming distance for haplotypes comprised of 800 SNPs immediately flanking the centromere, indicated by red bar at bottom and shown in detail in b.

https://doi.org/10.7554/eLife.42989.003

Figure 1—figure supplement 1—source data 1 Full resolution version of Figure 1—figure supplement 1.: https://doi.org/10.7554/eLife.42989.004
Download elife-42989-fig1-figsupp1-data1-v1.zip

Figure 1—figure supplement 2

Download asset Open asset

Filtering of chrX CPR recombinants for CDS divergence, expected heterozygosity and TMRCAs.

A full resolution version of this figure is available as Figure 1—figure supplement 2—source data 1. To more reliably infer the average divergence in the CDSs in the region, the male X chromosome haplotypes in Figure 1b with apparent ancestral exchange in the CPR were filtered to yield a subset of 620. Haplotypic representation of 12458 SNPs flanking the gap in the assembly (Left: chrX:55623011–58563685, Right: chrX:61725513–68381787; hg19) in these 620 male X chromosomes. Minor alleles shown in black, assembly gap is indicated by red line. The three most common X cenhaps highlighted with colored vertical bars at right. The tree is based on the UPGMA clustering of hamming distance of 800 SNPs immediately flanking the centromere, as indicated by the red bar at bottom.

https://doi.org/10.7554/eLife.42989.005

Figure 1—figure supplement 2—source data 1 Full resolution version of Figure 1—figure supplement 2.: https://doi.org/10.7554/eLife.42989.006
Download elife-42989-fig1-figsupp2-data1-v1.pdf

Figure 2

Download asset Open asset

Figure 2—source data 1 Centromere-Proximal Regions examined. The hg19 coordinates (p_begin to p_end and q_begin to q_end) of the CPRs in which SNPs in the 1000 Genomes (Phase 3) were investigated, panel b in Figure 2. Imputed haplotypes were UMPGA clustered based on filtered SNPs in a symmetrical central region immediately flanking the centromeric gap in the assembly (p_c to p_end and q_begin to p_c).: https://doi.org/10.7554/eLife.42989.010
Download elife-42989-fig2-data1-v1.tds
Figure 2—source data 2 Full resolution version of Figure 2.: https://doi.org/10.7554/eLife.42989.011
Download elife-42989-fig2-data2-v1.pdf

Figure 3 with 3 supplements

Download asset Open asset

Archaic cenhaps are found in AMH populations.

A full resolution version of this figure is available as Figure 3—source data 3. (a) Haplotypic representation of 8816 SNPs from 5008 imputed chr11 genotypes from the 1000 Genomes Project (Left: chr11:50509493–51594084, Right: chr11:54697078–55326684; hg19). SNPs were filtered for MAC ≥ 35 and passing the *4gt_dco* with a tolerance of three (see Materials and methods). Minor alleles shown in black and assembly gap indicated by red line. Haplotypes were clustered with UPGMA based on the hamming distance between haplotypes comprised of 1000 SNPs surrounding the gap (Left: chr11:51532172–51594084, Right: chr11:54697078–54845667; hg19, indicated by red bar at bottom). Superpopulation and cenhap partitioning are indicated by bars at far left. Log₂ counts of DM (derived in archaic, shared by haplotype), DN (derived in archaic, not shared by haplotype) and AN (ancestral in archaic, not shared by haplotype) for each cenhap relative to Altai Neanderthal (NEA) and Denisovan (DEN) at left. Gray horizontal bar (top) indicates region included in analysis of archaic content; black bars indicate SNPs with data for archaic and ancestral states. (b) Bar plots indicating the mean and 95% confidence intervals of DM, DN, AM (ancestral in archaic, shared by cenhap) and AN counts for cenhap groups (as partitioned in a. and c.) relative to Altai Neanderthal and Denisovan genomes, using chimpanzee as an outgroup (Speidel et al., 2019). (c) Haplotypic representation, as above, of 21950 SNPs from 5008 imputed chr12 genotypes from the 1000 Genomes Project (Left: chr12:33939700–34856380, Right: chr12:37856765–39471374; hg19). SNPs were filtered for MAC ≥ 35. Haplotypes were clustered with UPGMA based on 1000 SNPs surrounding the gap (Left: chr12:34821738–34856670, Right: chr12:37856765–37923684; hg19). Bars at side, top and bottom same as in a. (d) A UPGMA tree based on the synonymous divergence for 30 genes in the seven major chr11 cenhaps (see Figure 3—source data 2), assuming the TMRCA of humans and chimpanzee is 6.5MY (see Materials and methods and legend for Figure 1d). The error bars at each node represent ±two standard deviations of distributions of estimated TMRCAs across the genes.

https://doi.org/10.7554/eLife.42989.012

Figure 3—source data 1 The 37 chr11 coding genes in the CPR (2 left and 35 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs. Gene models and alignments from Ensembl release 92 (April 2018). Numbers of nonsynonymous differences in the two basal cenhaps (1, 2 and both, 1_&_2; see Figure 3) from the other cenhaps of the 5008 imputed chr11 CPR haplotypes (see Materials and methods). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.: https://doi.org/10.7554/eLife.42989.019
Download elife-42989-fig3-data1-v1.tds
Figure 3—source data 2 The eight chr8 coding genes in the CPR (8 left and 0 right of the centromere gap) used in the UPGMA clustering and estimation of TMRCAs. Gene models and alignments from Ensembl release 92 (April 2018). Numbers of sites divergent (human-chimp): div_sites. Numbers of sites polymorphic: polym_sites. Average nonsynomymous divergence: nonsyn_div. Average synonymous divergence: syn_div. Average nonsynonymous diversity: nonsyn_π. Average synonymous diversity: syn_π.: https://doi.org/10.7554/eLife.42989.020
Download elife-42989-fig3-data2-v1.tds
Figure 3—source data 3 Full resolution version of Figure 3.: https://doi.org/10.7554/eLife.42989.021
Download elife-42989-fig3-data3-v1.pdf