Dynamic molecular evolution of a supergene with suppressed recombination in white-throated sparrows

  1. Hyeonsoo Jeong
  2. Nicole M Baran
  3. Dan Sun
  4. Paramita Chatterjee
  5. Thomas S Layman
  6. Christopher N Balakrishnan
  7. Donna L Maney  Is a corresponding author
  8. Soojin V Yi  Is a corresponding author
  1. School of Biological Sciences, Georgia Institute of Technology, United States
  2. Department of Psychology, Emory University, United States
  3. Department of Ecology, Evolution, Marine Biology, University of California, Santa Barbara, United States
  4. Department of Medicine Huddinge, Karolinska Institutet, Sweden
  5. Department of Biology, East Carolina University, United States

Abstract

In white-throated sparrows, two alternative morphs differing in plumage and behavior segregate with a large chromosomal rearrangement. As with sex chromosomes such as the mammalian Y, the rearranged version of chromosome two (ZAL2m) is in a near-constant state of heterozygosity, offering opportunities to investigate both degenerative and selective processes during the early evolutionary stages of ‘supergenes.’ Here, we generated, synthesized, and analyzed extensive genome-scale data to better understand the forces shaping the evolution of the ZAL2 and ZAL2m chromosomes in this species. We found that features of ZAL2m are consistent with substantially reduced recombination and low levels of degeneration. We also found evidence that selective sweeps took place both on ZAL2m and its standard counterpart, ZAL2, after the rearrangement event. Signatures of positive selection were associated with allelic bias in gene expression, suggesting that antagonistic selection has operated on gene regulation. Finally, we discovered a region exhibiting long-range haplotypes inside the rearrangement on ZAL2m. These haplotypes appear to have been maintained by balancing selection, retaining genetic diversity within the supergene. Together, our analyses illuminate mechanisms contributing to the evolution of a young chromosomal polymorphism, revealing complex selective processes acting concurrently with genetic degeneration to drive the evolution of supergenes.

Editor's evaluation

In this important paper, the authors generate and analyze new genome and gene expression data to understand better the evolution of the white-throated sparrow supergene region, which contains 1000 genes and determines whether a bird has a tan or a white stripe. The study convincingly illustrates how the cessation of recombination that results from a chromosomal inversion can become a source of evolutionary novelty. The lack of recombination can result in the accumulation of deleterious variation leading to degeneration, but it can also (as here) facilitate genomic diversification and adaptation. The results will be of interest to a broad array of researchers studying genome architecture and phenotypic diversity and evolution.

https://doi.org/10.7554/eLife.79387.sa0

Introduction

Supergenes comprise closely linked genetic variants that are maintained due to suppressed recombination (Charlesworth, 2016; Thompson and Jiggins, 2014). Their evolution presents an interesting paradox, in that the suppression of recombination that occurs inside supergenes reduces the efficacy of natural selection, leading to genetic degeneration. At the same time, supergenes are associated with dramatically divergent, adaptive phenotypes. These divergent phenotypes, which include classic examples of Batesian mimicry and self-incompatibility in flowering plants (Charlesworth, 2016; Thompson and Jiggins, 2014; Otto and Lenormand, 2002) and striking polymorphisms in social behavior (Wang et al., 2013; Huang et al., 2018; Yan et al., 2020; Martinez-Ruiz et al., 2020; Farrell et al., 2013; Küpper et al., 2016; Lamichhaney et al., 2016), have long inspired both theoretical and empirical studies of their evolution. Recent genome-scale studies have illuminated wide-ranging impacts of supergene evolution on complex phenotypes across diverse taxa (e.g. Schwander et al., 2014; Pearse et al., 2019; Hager et al., 2022; Joron et al., 2011; Kunte et al., 2014; Kess et al., 2019; Lundberg et al., 2017; Roberts et al., 2009; Sanchez-Donoso et al., 2022; Funk et al., 2021). Currently, the mechanisms by which functionally divergent supergene haplotypes evolve in the face of multiple evolutionary forces remain poorly understood, presenting a critical gap in knowledge.

One notable example of a supergene associated with social behavior is found in white-throated sparrows (Zonotrichia albicollis), in which a large supergene co-segregates with parental behavior and aggression (Tuttle, 2003; Maney et al., 2015; Horton et al., 2013; Tuttle et al., 2016; Sun et al., 2018; Merritt et al., 2020). White-throated sparrows occur in two alternative plumage morphs, white- and tan-striped (Lowther, 1961). These morphs differ not only in their plumage coloration, but also in their social behavior, with white-striped birds exhibiting increased aggression and more frequent extra-pair copulations, and tan-striped birds engaging in more parental care compared with birds of the white-striped morph (Tuttle, 2003; Maney et al., 2015; Maney, 2008; Horton et al., 2012). These alternative morphs are linked to a large (~100 Mbp, >1 k genes) rearrangement on the second largest chromosome, called ZAL2m, so named because the rearranged chromosome is metacentric. White-striped birds are heterozygous for ZAL2m and the sub-metacentric chromosomal arrangement, ZAL2, whereas tan-striped birds are homozygous for ZAL2 (30, 31).

In addition to highly divergent social behavior, this relatively young supergene (estimated to have arisen 2–3 million years ago Tuttle et al., 2016; Thomas et al., 2008; Huynh et al., 2010), is also associated with a remarkable disassortative mating system that maintains the ‘balanced’ morph frequencies in the population. Almost all breeding pairs consist of one bird of each morph, earning the species the moniker ‘the bird with four sexes’ (Campagna, 2016). Breeding pairs consisting of two individuals of the same morph are estimated to occur less than 1% of the time (Tuttle et al., 2016; Thorneycroft, 1966) and only six ZAL2m homozygotes (i.e. ‘super-white’ birds) have ever been identified (Horton et al., 2013; Tuttle et al., 2016; Thorneycroft, 1975; Falls and Kopachena, 2020) out of thousands of birds karyotyped or genotyped. Given that ZAL2m exists in a near-constant state of heterozygosity, it is in a state of suppressed recombination, similar to the Y and W sex chromosomes in mammals and birds, respectively. The suppression of recombination on ZAL2m is expected to reduce the efficacy of natural selection, leading to reduced genetic diversity and the degeneration of the chromosome (Barton and Charlesworth, 1998; Charlesworth, 2012). On the other hand, the tight linkage of alleles within the ZAL2m supergene may contribute to adaptive phenotypes (Tuttle et al., 2016; Sun et al., 2018; Maney et al., 2020). Therefore, this system provides a unique opportunity to investigate the evolution of a supergene underlying social and mating behavior (Tuttle et al., 2016; Sun et al., 2018; Merritt et al., 2020; Sun et al., 2021).

Here we aim to better understand the evolutionary forces shaping the ZAL2 and ZAL2m chromosomes. Our goal was to address two unanswered questions. First, to what extent has ZAL2m degenerated? Early analyses of the rearrangement (Davis et al., 2011) did not show signals of degeneration, such as pseudogenization or the accumulation of repetitive sequences. However, Tuttle et al., 2016 found a weak signal of excess non-synonymous polymorphism for genes inside the rearranged region on ZAL2m and reduced allelic expression for ZAL2m genes, which could be consistent with functional degradation of ZAL2m (Tuttle et al., 2016). (Sun et al., 2018) similarly found a slightly higher number of non-synonymous substitutions and an increased ratio of non-synonymous to synonymous substitution rates (dN/dS) on ZAL2m compared with ZAL2. (Sun et al., 2018) also found reduced expression of ZAL2m alleles in brain tissue, perhaps suggesting that the accumulation of deleterious mutations has led to reduced expression of genes from ZAL2m. Their additional finding of reduced accumulation of mutations in functional regions suggested that ZAL2m has, in fact, experienced purifying selection to remove deleterious alleles. Thus, while there is some evidence that ZAL2m has degenerated, these results have been inconsistent and somewhat inconclusive.

Second, what are the selective forces shaping the genomic landscapes of both ZAL2 and the ZAL2m supergene? The signals of both purifying and positive selection have been relatively weak in previous genomic analyses of ZAL2 and ZAL2m (Tuttle et al., 2016; Sun et al., 2018). Yet, by definition, ZAL2m must contain variation that underlies the differences between the white- and tan-striped morphs (Fisher, 1931; Bull, 1983; Charlesworth and Charlesworth, 1980; Rice, 1987b; Rice, 1987a). There is already some evidence that this variation affects behavior; allelic differences in the promoter region of the gene encoding estrogen-receptor alpha (ESR1) are likely to alter expression (Merritt et al., 2020), and the expression of this gene was shown to be necessary for aggressive behavior typical of the white-striped morph (Merritt et al., 2020). ZAL2m is also associated with differential expression of a key neuromodulator, vasoactive intestinal peptide (Horton et al., 2020), known to be causal for aggression in songbirds (Goodson et al., 2012).

Investigations of young heteromorphic sex chromosomes suggest that the accumulation of sexually antagonistic genes (i.e. genes that are beneficial to one sex and harmful to the other) may in fact drive the evolution of sex chromosomes (Bachtrog, 2004; Bachtrog, 2006; Zhou and Bachtrog, 2012). For example, positive selection at a small number of antagonistic alleles was shown to be a potent force shaping evolution of the young Y chromosomes in Drosophila miranda (Bachtrog, 2004) even in the face of degeneration of other genes elsewhere on the chromosome. In white-throated sparrows, evidence of positive selection on both ZAL2 and ZAL2m has been quite limited (Tuttle et al., 2016; Sun et al., 2018). Nonetheless, the discovery of ZAL2- and ZAL2m-specific alleles that benefit the tan- and white-striped morphs, respectively (Merritt et al., 2020; Horton et al., 2020), suggests that antagonistic selection likewise contributes to the evolution of both ZAL2 and ZAL2m.

In addition to antagonistic selection, balancing selection may be implicated in the evolution of ZAL2m. The negative assortative mating system in white-throated sparrows, which maintains the chromosomal polymorphism, is a canonical example of balancing selection (Huynh et al., 2010). However, balancing selection is also a way of maintaining advantageous genetic diversity in populations, which may be especially critical in the context of a non-recombining chromosome. Indeed, balancing selection appears to be more common in self-fertilizing (selfing) vs non-selfing species, which are likewise characterized by reduced genetic diversity, increased linkage disequilibrium, and reduced efficacy of selection (Glémin, 2021; Glémin et al., 2019; Delph and Kelly, 2014; Gaut et al., 2015). Therefore, balancing selection may maintain multiple alleles inside non-recombining regions of chromosomes like ZAL2m.

Previous studies have been limited in the extent to which they could test directly for degeneration, adaptive changes on ZAL2m, and selection at the genome level. These limitations stemmed from low sample sizes of sequencing data, the reduced intraspecies variability, and a low-quality ZAL2m assembly that prevented detection of long-range haplotypes (Tuttle et al., 2016; Sun et al., 2018; Thomas et al., 2008). Here, we overcome these challenges by analyzing extensive genomic, transcriptomic, and population data, providing insight into the evolution of young supergenes.

Results

Novel and extensive genomic and population data from white-throated sparrows

To better understand the evolutionary history of the ZAL2m chromosomal rearrangement, we generated additional sequence data from a rare, ‘super-white’ (ZAL2m homozygote) bird (Horton et al., 2013; Sun et al., 2018). We generated variable fragment size libraries consisting of 150 bp paired-end reads (insert size of 300 bp and 500 bp) and 125 bp mate pair reads (insert size of 1 kb, 4–7 kb, 7–10 kb, and 10–15 kb). We performed whole-genome sequencing of an additional 62 birds (49 white-striped birds and 13 tan-striped birds sampled from a variety of locations around the U.S.) (Materials and methods, Supplementary file 1). White-striped birds, which are heterozygous for the rearrangement (ZAL2/2m), were sequenced at higher coverage than tan-striped birds (ZAL2 homozygotes) so that we could obtain sufficient reads to separate ZAL2 and ZAL2m alleles in white-striped individuals (average mean depth coverages were 41.5 × vs 28.4 × for white- and tan-striped birds, respectively, Supplementary file 2). Genomic variants were called according to the guidelines of Genome Analysis Toolkit (GATK) (ver. 4.1) (Materials and methods), leading to the discovery of a total of 11,382,994 single nucleotide polymorphisms (SNPs). None of the samples showed evidence of family relationships when we computed relatedness estimates between individuals. Consequently, we used all samples in the subsequent analyses. We found a significantly higher number of polymorphic sites within white-striped birds than tan-striped birds exclusively for ZAL2/2m chromosomal regions (Figure 1—figure supplement 1). Nucleotide diversity of the rearranged region of the ZAL2/2m chromosomes was elevated in white-striped birds compared with tan-striped birds, suggesting distinctive patterns between the two plumage morphs (Figure 1a).

Figure 1 with 2 supplements see all
Genomic data from newly sequenced tan- and white-striped birds.

(A) Nucleotide diversity of macro-chromosomes for tan-striped (TS) and white-striped (WS) birds. White-striped birds (ZAL2/2m) show elevated nucleotide diversity for the ZAL2/2m inverted (INV, i.e. rearranged) regions (ZAL2/2m inv), while TS birds (ZAL2/2) show overall reduced nucleotide diversity for the inverted regions compared with other chromosomes. Note that panel (A) shows the comparison across morph. The comparison across the ZAL2 and ZAL2m alleles is shown in Figure 2a. (B) Scatterplots of eigenvector 1 (PC1) and eigenvector 2 (PC2) from principal component analysis of all single-nucleotide variants (left panel). (C) Principal component analysis (PCA) excluding single nucleotide polymorphisms (SNPs) on the ZAL2 chromosomes (right panel). The sex chromosomes and the ZAL3 chromosome (which includes an additional chromosomal inversion) were excluded from both PCA analyses. Note that ‘location’ here refers to the site of collection or capture of the bird: Georgia (GA), Illinois (IL), or Maine (ME). Breeding locations for GA and IL birds are unknown.

Figure 1—source data 1

Nucleotide diversity between tan- and white-striped birds.

Figure 1B and C: Supplementary file 1 (PCAs performed using variant call format (vcf) data from whole genome sequencing).

https://cdn.elifesciences.org/articles/79387/elife-79387-fig1-data1-v1.txt

Among the total SNPs identified, 12.6% (N=1,439,991) resided on scaffolds we have previously assigned to the ZAL2/2m chromosome (Sun et al., 2018). Principal component analysis (PCA) of these ZAL2/2m SNPs revealed distinct clusters corresponding to the morphs (Figure 1b). The first principal component (PC1), which explained 6.7% of the variation in the data, clearly separated tan- and white-striped birds, with the lone super-white individual (ZAL2m/2m homozygote) as a clear outlier. In contrast, other available phenotypic information, including sex and geographic origin of samples, did not show meaningful variation with the principal components, and other PCs had little explanatory power (Figure 1b). Tests for admixture also failed to identify significant population substructures by geographical origin of samples (Figure 1—figure supplement 2). This lack of population structure is unsurprising, as 35 of the 63 samples (56%) were from birds that were migrating, and, thus, the breeding location of these birds is unknown.

Features of the ZAL2m chromosome consistent with reduced efficacy of natural selection and low levels of recombination

We examined several genomic features of the ZAL2m chromosome using the additional genomic resources we generated. We first performed a de novo genome assembly of the super-white bird, employing newly generated sequence data, to study the ZAL2m chromosome with an assembly derived entirely from a bird homozygous for the ZAL2m chromosome. The total assembly size was 1058 Mbp (N50 length of 3.1 Mbp, longest scaffold 27 Mbps), comparable to that of the ZAL2/2 reference assembly (1052 Mbp, N50 scaffold length of 4.86 Mbp, longest scaffold 45 Mbp) (see Supplementary file 3 for more details). There were 160 putatively ZAL2m-linked scaffolds (Materials and methods), with a total length (110.99 Mbp) comparable with that of ZAL2-linked scaffolds from the reference assembly (108.5 Mbp Tuttle et al., 2016). Despite this similarity in total length, however, the average length of the individual ZAL2m-linked scaffolds was significantly shorter than scaffolds on other chromosomes in the super-white assembly (p<0.001, Mann-Whitney U-test). It was also shorter than the average scaffold length on the ZAL2 chromosome in the ZAL2/2 reference assembly (Figure 2a). We did not observe such a pattern in the other chromosomes of similar size when comparing between the two assemblies (Figure 2a). This result was consistent with the presence of repetitive DNA sequences on ZAL2m causing more assembly breaks compared with the ZAL2/2 reference genome. We found evidence that the ZAL2m chromosome contained more repeat elements and was especially enriched for long terminal repeat elements (2.4 Mbp vs 2.1 Mbp) and interspersed repeats (5.8 Mbp vs 5.5 Mbp), compared with the ZAL2 chromosome. The number of these repeat elements is likely to be underestimated, given that the ZAL2m assembly is highly fragmented. Additionally, we found that ZAL2 and ZAL2m had accumulated a higher proportion of structural variants (insertions and deletions) compared with other chromosomes (Figure 2b).

Figure 2 with 1 supplement see all
Genetic divergence between ZAL2 and ZAL2m chromosomes.

(A) The scaffolds for the ZAL2m chromosome in the super-white (SWS) assembly tend to be fragmented compared with those for the ZAL2 chromosome in the tan-striped (TS) assembly. ** p<0.001 (Mann-Whitney U-test); ns, not significant (B) Fraction of structural variants (SV), both insertion and deletion events, for the 4 largest chromosomes, using the tan-striped assembly as a reference. The fraction of SV is computed as a total base affected by variants divided by the length of the chromosome. (C) Number of fixed mutations derived in ZAL2 and ZAL2m in protein-coding regions (D) Sliding window (window size of 20 genes with step size of 5 genes) analysis of the ratio of nonsynonymous to synonymous nucleotide diversity (πNS) within the ZAL2 and ZAL2m chromosomes. The ZAL2m outlier region is highlighted (colored background). (E) Site frequency spectrum of polymorphic sites. (F) Decay of linkage disequilibrium. (G) Proportion of the ZAL2m alleles expressed for each tissue set. The proportion of the ZAL2m alleles expressed is less than the null hypothesis of 0.5 for all tissues except nestling AMV using false discovery rate (FDR) correction. Hyp, hypothalamus; AMV, ventromedial arcopallium.

Figure 3 with 5 supplements see all
Genetic diversity and patterns of divergence across the rearranged region of the ZAL2m chromosome and in the ZAL2m outlier region.

(A) Tajima’s D and nucleotide diversity across the ZAL2 and ZAL2m chromosomes. The ZAL2m outlier region is highlighted (colored background). (B) Phylogenetic tree of randomly selected regions (left panel) and the ZAL2m outlier region (right panel). The ZAL2m chromosome shows multiple haplotype structures and has longer branch lengths within the population compared with ZAL2 chromosomes. (C) Single nucleotide polymorphism (SNP) genotype plot of a scaffold inside the ZAL2m outlier region (Scaffold NW_005189516.1, 1900001–1950001). The plot shows two haplogroups. Major allele SNPs (A, same genotype as the super-white ZAL2m/2m genome) are represented in purple, and minor allele SNPs (a, different from the super-white genome) in red. Tan indicates that there were no fixed SNPs to differentiate ZAL2 vs ZAL2m reads, resulting in missing data. (D) Genetic divergence (dXY) for a portion of the rearrangement. dXY between the ZAL2 chromosome and haplogroup 1 (H1) is plotted in light blue, between ZAL2 and haplogroup 2 (H2) in dark blue, and between H1 and H2 in light green.

Figure 3—source data 1

RAxML bipartitions for scaffold 5189516.

Figure 3A: Supplementary file 1 (Tajima’s D and nucleotide diversity plots created from variant call format (vcf) data from whole genome sequencing).

https://cdn.elifesciences.org/articles/79387/elife-79387-fig3-data1-v1.txt
Figure 3—source data 2

RAxML bipartitions for scaffold 5190802.

https://cdn.elifesciences.org/articles/79387/elife-79387-fig3-data2-v1.txt
Figure 3—source data 3

Genotype data for scaffold 5189516.

https://cdn.elifesciences.org/articles/79387/elife-79387-fig3-data3-v1.txt
Figure 3—source data 4

dXY between ZAL2m haplotypes and ZAL2.

https://cdn.elifesciences.org/articles/79387/elife-79387-fig3-data4-v1.txt

Next, using the large amount of newly generated population genomic data, we examined patterns of SNPs on ZAL2m alleles separately from those on ZAL2 via haplotype phasing using fixed differences between the two chromosome types (Materials and methods). We found that the total number of genetic variants was approximately 12-fold reduced on the ZAL2m alleles compared with ZAL2 alleles (29,420 vs 367,466 SNPs and 3,921 vs 32,479 indels on ZAL2m and ZAL2, respectively, after excluding singleton variants). The mean nucleotide diversity was similarly reduced on the ZAL2m chromosome compared with the ZAL2 chromosome (0.0104% vs 0.1141% for ZAL2m vs ZAL2, respectively).

Analyses of the genetic variants on the ZAL2 and ZAL2m alleles showed evidence of only weak genetic degeneration. The ratio of non-synonymous to synonymous fixed differences inside the rearranged region was slightly, but significantly, elevated for ZAL2m-derived compared with ZAL2-derived fixed differences (Figure 2c), which is consistent with either positive selection or inefficient purifying selection on ZAL2m. We found that the ratio of non-synonymous to synonymous nucleotide diversity (πNS) was significantly increased on ZAL2m compared with ZAL2 (p=1.3 × 10–13, Mann-Whitney U-test, Figure 2d). The minor allele site frequency spectrum (SFS) for the ZAL2m synonymous and non-synonymous sites showed a large proportion of singleton variants and an irregular decay of allele frequency as the minor allele count increases (Figure 2e), also suggesting reduced efficacy of purifying selection on ZAL2m.

Assuming the mutation rates of the two chromosomes are similar, the ratio of effective population size (Ne) between the ZAL2m and ZAL2 can be approximated by the ratio of nucleotide diversity of synonymous sites between the ZAL2 and ZAL2m. The proportion of Ne between ZAL2m and ZAL2 is 0.12 (±0.01), which is threefold lower than the expected ratio of 0.33 (because the ZAL2m chromosome is 1/3 as frequent as the ZAL2, given ‘balanced’ morph frequencies observed in the wild). This lower proportion suggests that Ne of ZAL2m has undergone further reduction than expected from the census size in the population, consistent with the effects of reduced recombination. We noted that the linkage disequilibrium between variants on ZAL2m exhibited the classic decay with distance (Figure 2f), indicating at least some level of recombination. Taken together, these results are consistent with reduced, but not entirely eliminated, recombination on the ZAL2m chromosome.

We next examined whether degeneration inside the rearranged region on ZAL2m has resulted in globally reduced expression of the alleles carried by the ZAL2m supergene variant. To do so, we used multiple large RNAseq datasets from a variety of tissues in birds sampled from different geographic locations and times of year (see Materials and methods, Table 1). As predicted and consistent with what was previously reported (Sun et al., 2018), we found evidence of consistently reduced expression of the ZAL2m alleles in 8/10 types of tissue (Figure 2g). We next tested for an association between the number of accumulated mutations (non-synonymous, synonymous, and in the promoter region) on ZAL2m and allelic bias (AB) in expression of the ZAL2m alleles within each tissue, which would link genetic degeneration within or near genes to reduced expression of ZAL2m. We found evidence that allelic bias in gene expression was associated with the rate of non-synonymous fixed differences (X2(1)=9.97, p=0.00159), although the effect size was exceedingly small (marginal r2=0.0020) (Figure 2—figure supplement 1a). Neither the rate of synonymous fixed differences (X2(1)=1.0098, p=0.315, Figure 2—figure supplement 1b) nor the number of fixed differences within 1 kb upstream of the transcription start site (X2(1)=0.8992, p=0.343, Figure 2—figure supplement 1c) were associated with allelic bias. Thus, the overall reduction in expression of the alleles carried by the ZAL2m supergene is weakly associated with an increased number of non-synonymous fixed nucleotide changes within genes. The limited and weak nature of the effect suggests, however, that the pattern of gene expression may have been affected also by other factors, for example ongoing selection (thus manifested as nucleotide polymorphism), selection at more distal sites, and/or epigenomic mechanisms, such as differences between ZAL2 and ZAL2m in DNA methylation or histone modification (see Sun et al., 2021).

Table 1
List of RNA sequencing data sets.
TissueSample size (WS/total)Collection detailsSource
Adult malesBrain (Hyp, AMV)9/20Collected early in the breeding seasonZinzow-Kramer et al., 2015; 
Sun et al., 2018
Accession: GSE77186
Adult femalesBrain (Hyp, AMV)6/11Collected early in the breeding seasonAccession: PRJNA657006
Nestlings (both sexes)Brain (Hyp, AMV)16/32Collected from nests during the breeding seasonAccession: PRJNA657006
Adult males (all white-striped)Heart and Liver20/20Collected during fall migration, then housed in captivity on either long or short days to simulate breeding vs non-breeding; collected at two time points during the dayHorton et al., 2019
Accession: GSE116989

Evidence of regional balancing selection on the ZAL2m chromosome

Although the level of genetic diversity was overall reduced on ZAL2m, it was exceptionally elevated in one region corresponding to ~3 Mbps spanning 5 scaffolds. This region, henceforth referred to as the ZAL2m outlier region (Figures 2 and 3), includes at least 15 protein-coding genes that are well conserved as single copy genes across 13 avian species (Table 2). On average, nucleotide diversity in ZAL2m across this region was 0.001, which is tenfold higher than the mean nucleotide diversity of ZAL2m and even exceeds the nucleotide diversity in the corresponding region within ZAL2 by 3.15-fold.

Table 2
List of protein-coding genes inside the ZAL2m outlier region.
GeneScaffoldStartEndπ ZAL2π ZAL2mTaDZAL2TaD ZAL2mDXY
KCNS3NW_005081621.1970891105122.75E-048.43E-04–1.2953–0.86580.011281
MSGN1NW_005081621.1160375160897NA6.10E-04NA–0.11380.003056
GEN1NW_005081621.11757911982453.52E-042.22E-03–1.13021.17280.011647
SMC6NW_005081621.11984522441363.94E-042.27E-03–1.00880.85480.011901
MYCNNW_005081621.1117949211847612.59E-046.18E-04–0.58980.21160.005173
DDX1NW_005081621.1143269714525353.41E-042.42E-03–1.58190.87520.014699
NBASNW_005081621.1145460116155802.93E-042.17E-03–1.62712.22910.012296
TRIB2NW_005081621.1259617826169503.45E-041.89E-04–1.0009–0.83760.011498
LPIN1NW_005081621.1301215330612032.73E-043.27E-04–1.7799–0.72520.015295
GREB1NW_005081621.1310081431651862.25E-041.72E-04–1.8299–0.73510.01458
E2F6NW_005081582.124475465771.90E-041.59E-04–1.5978–1.02191.4E-02
ROCK2NW_005081582.1509931551702.62E-046.34E-04–1.46–0.23621.4E-02
KCNF1NW_005081582.13314243339919.85E-052.41E-04–0.2519–0.86415.9E-03
PDIA6NW_005081582.14158534319193.74E-041.49E-04–1.1579–1.25211.2E-02
ATP6V1C2NW_005081582.14317664548862.56E-046.94E-04–1.7535–0.23911.4E-02

To first examine whether this region has recently experienced an introgression event that could have caused the observed patterns, we constructed a phylogenetic tree of this region. The resulting tree exhibited the same topology as those from other regions of the ZAL2m, thus providing no support for introgression (Figure 3b). We have not yet resolved the distribution of repetitive elements in the two chromosomes. Thus, to avoid errors in the phylogenetic analysis resulting from the rapid turnover of repetitive elements, we also performed the analysis using only the exons of single-copy orthologous genes, similar to analyses done by Stolle et al., 2022. The results (Figure 3—figure supplement 1) did not provide support for an introgression event and therefore did not change our conclusions. In addition, we calculated the D-statistic (ABBA-BABA test, which tests for genomic regions that are discordant with the species tree Martin et al., 2015) to test for evidence of gene flow from another species from the same genus, Harris’ sparrow (Zonotrichia querula). Sliding window estimates of the D-statistic did not show any differences in patterns between haplotypes (Figure 3—figure supplement 2), suggesting that the multiple haplotype structures in ZAL2m were not introduced by gene flow from this closely related species or from the ZAL2 chromosome. Nonetheless, the ZAL2m outlier region showed longer branch lengths than did both the corresponding region on the ZAL2 chromosome and a randomly selected region of ZAL2m, reflecting the high genetic diversity within this region (Figure 3b). Tajima’s D was significantly higher throughout the ZAL2m outlier region compared with the genomic background (p<0.001, permutation test) (Figure 3a).

The increase of nucleotide diversity, high Tajima’s D, and long branch length all suggest that balancing selection has impacted the ZAL2m outlier region. Concordantly, we identified a signature of balancing selection in the ZAL2m outlier region on the basis of allele frequency spectrum (β statistics) computed using BetaScan, which detects signatures of balancing selection by looking for an excess of genomically proximate SNPs with a similar allele frequency (Siewert and Voight, 2020). Visual examinations of the ZAL2m outlier region also revealed the presence of potential long-range haplotypes that likely reflect long-term balancing selection and several recombination events within the haplotypes (Figure 3c). Notably, we identified two major putative haplotypes, which we refer to here as 1 and 2, consisting of 17 and 30 ZAL2m chromosomes, respectively. A possible third haplogroup was also identified, although the sample size (n=2) was too small to warrant further analyses. The genetic differentiation between the two ZAL2m outlier haplogroups was lower than the differentiation between either haplogroup and ZAL2 (Figure 3d), indicating that the haplogroups have evolved since the split between ZAL2 and ZAL2m. This interpretation is also consistent with the inferred phylogeny (Figure 3b). Note that the sequencing depth for the two haplogroups did not differ significantly (p=0.78, Figure 3—figure supplement 3). The ZAL2m outlier regions exhibited an excess of non-synonymous SNPs compared with synonymous SNPs (Figure 2d), as well as an excess of intermediate frequency minor alleles (Figure 3—figure supplement 4). Neither geographic location of collection nor sex was associated with any distinct patterns by haplogroups in a PCA analysis using genetic variants in the region (Figure 3—figure supplement 5; although note the breeding location of the birds from two of the locations was unknown). Together, these results solidly implicate balancing selection in the evolution of the ZAL2m outlier region. In other words, multiple haplotypes of the ZAL2m chromosome are maintained within the population of white-striped birds via balancing selection.

We looked for phenotypic consequences of the putative ZAL2m haplogroups using the subset of white-striped birds from the sequencing samples for which we had phenotypic data (n=6 for haplogroup 1, n=12 for haplogroup 2, see Zinzow-Kramer et al., 2015; Horton et al., 2014 for details). White-striped birds with the H1 vs the H2 versions did not differ significantly in aggressive behavior (measured during simulated territorial intrusions), gonad size, tarsus length, cloacal protuberance volume, or any other measure when using the Benjamini-Hochberg correction for multiple testing to avoid false positives (more than 50 tests were performed).

We next tested for effects of haplogroup on the expression of genes inside the ZAL2m outlier region. In the subset of white-striped birds for which we had haplogroup information, we examined RNAseq data from two brain regions, the hypothalamus (Hyp) and the ventromedial arcopallium (AMV, the functional homolog of the medial amygdala, formally named the nucleus taeniae of the amygdala) (see Materials and methods and Supplementary file 4 for details). We found that the gene GREB1 (growth regulating estrogen receptor binding 1) was more highly expressed in haplogroup 2 (unadjusted p=0.0195) in AMV. Additionally, there was a trend for the gene KCNS3 to be more highly expressed in both Hyp (unadjusted p=0.0511) and AMV (unadjusted p = 0.1567). This gene encodes a voltage-gated channel subunit that in humans and mice is specific to fast-spiking parvalbumin (inhibitory) neurons (Georgiev et al., 2014; Miyamae et al., 2021). Although no genes were significantly differentially expressed at the genome-wide level, our current findings provide potentially interesting candidates to explore further in the context of behavioral differences between the morphs.

Candidates for positive selection in ZAL2 and ZAL2m

Our work has provided a refined map of genetic differentiation between the ZAL2 and ZAL2m chromosomes (Figure 4—figure supplement 1). Interestingly, three scaffolds near the chromosomal end (i.e. at the beginning of the p-arm, as well as at the end of the q-arm in the genomic coordinate of the ZAL2 chromosome), exhibited both decreased FST values and a reduced rate of fixed genetic differences (Df). One possible explanation for these findings is that these regions represent a younger evolutionary stratum, which is possible because ZAL2m contains a series of nested inversions that occurred at different times (Lahn and Page, 1999; Xu et al., 2019; Bergero et al., 2007). Another possibility is that the inflated within-ZAL2 nucleotide diversity (see Figure 3a) within the sampled population reduced the FST estimate.

By leveraging our resource of a haplotype map of both the ZAL2 and ZAL2m chromosomes, we investigated signatures of positive selection in each chromosome. We identified 216 20 kB regions on ZAL2 exhibiting a significantly elevated H-statistic (empirical p-values <0.05), which is a measure of the average length of the stretches of linkage disequilibrium between the haplotypes (Schlamp et al., 2016; Messer Lab — Resources, 2014, Materials and methods). One notable stretch, which included four top candidate 20 kB regions (Scaffold NW_005081582.1, 480–520 kb and 920–960 kb) showed evidence of positive selection on both the ZAL2 and ZAL2m chromosomes (blue shaded area in Figure 4—figure supplement 2). On the basis of previous literature (Thomas et al., 2008), this region on ZAL2m may be placed on the end of that chromosome. The elevated H-statistic in this region cannot be explained by the recombination rate, as recombination is low across the entire length of ZAL2m (Figure 4—figure supplement 3). On the ZAL2 chromosome, in contrast, recombination on the chromosome ends is increased (Figure 4—figure supplement 3), as has been reported in other species (Nachman and Churchill, 1996; Barton et al., 2008). We did not, however, observe an elevated H-statistic in these regions (Figure 4—figure supplement 2a). Note that a structurally resolved chromosome-level assembly does not yet exist for the white-throated sparrow. Thus, the precise location of these regions on the ZAL2m chromosome still need confirmation.

We also identified a long stretch (~6 Mbp) showing an overall elevated H-statistic in another region of ZAL2 (red shaded area in Figure 4—figure supplement 3a, Supplementary file 5). This region exhibited the lowest estimates of nucleotide diversity within ZAL2 and the highest estimates of FST between the two chromosome types. Overall, there were 68 genes located in regions with a significant (p<0.05) H-statistic for ZAL2m and 109 genes in regions with a significant H-statistic for ZAL2 (Supplementary file 5), meaning that these genes are inside regions showing evidence of a selective sweep. These observations suggest that selective sweeps took place on both ZAL2 and ZAL2m chromosomes after the rearrangement events.

Antagonistic selection influences ZAL2m gene expression

Some of the behaviors that differ between morphs, namely aggression, are predicted by the expression of several genes located inside the ZAL2/2m rearrangement, such as ESR1 and VIP (Merritt et al., 2020; Maney et al., 2020; Horton et al., 2020; Zinzow-Kramer et al., 2015; Horton et al., 2014). We hypothesize that the ZAL2m supergene region is enriched for alleles that have been shaped by antagonistic selection. Specifically, we predict alleles within the ZAL2m supergene region that are beneficial for birds of the white-striped morph, but disadvantageous for birds of the tan-striped morph. Similarly, we predict that the ZAL2 chromosome is enriched for genes with alleles that benefit birds of the tan-striped morph (Maney et al., 2020).

To test this hypothesis, we examined whether allelic bias in the expression of genes inside the rearrangement was associated with differential expression of the same genes between the tan- and white-striped morphs (lists of genes showing differential expression or allelic bias can be found in Supplementary file 6). Here, allelic bias is defined as the differential expression of the ZAL2 versus ZAL2m alleles in white-striped birds. Allelic bias in expression would indicate a role of cis-regulatory variants—i.e., variants in non-coding promotor, enhancer, or silencer regions regulating a gene—in functionally altering the expression of that gene. For these tests, we used data on gene expression in the brain only, as heart and liver tissue samples were limited to white-striped birds.

First, we found that among the genes that were differentially expressed between morphs, more of these genes are located inside the rearranged region on ZAL2/2m than expected by chance, pointing toward cis- (as opposed to trans-) regulatory differences underlying morph differences in expression (X2 tests showed highly significant effects for both brain regions in adults and nestlings of both sexes, with false discovery rate, FDR, correction to account for the multiple statistical comparisons performed using the Benjamini-Hochberg correction) (Figure 4a). To explore this further, we next tested whether the differentially expressed genes showed greater allelic bias than genes without differential expression. Consistent with what we previously reported using only males (Sun et al., 2018), we found that differential expression significantly predicted the degree of allelic bias in expression of that gene (X2(2)=664.16, p<2.2 × 10–16, controlling for sequencing batch and brain region, see Materials and methods). In addition, tan-biased genes showed greater ZAL2 allelic bias and white-biased genes showed greater ZAL2m allelic bias than did genes that were not differentially expressed between morphs (post-hoc T>W vs T=W: z=–16.87; post-hoc W>T vs T=W: z=19.63, p<2.2 × 10–16; post-hoc W>T vs T>W: z=26.35, p<2.2 × 10–16) (Figure 4b). These findings suggest that gene expression differences between the morphs are driven, in part, by evolutionary changes in the regulatory regions of genes captured by the rearrangement.

Figure 4 with 3 supplements see all
Evidence for antagonistic selection driving ZAL2 and ZAL2m gene expression in the brain.

(A) shows the percentage of differentially expressed genes that reside inside the rearranged region on ZAL2, vs elsewhere in the genome. The percentage of differentially expressed genes inside vs outside the rearranged region of ZAL2 is higher than expected by chance (padj <2.2 × 10–16 for all comparisons). (B) shows log2 ZAL2m expression ratios for genes that were more highly expressed in white-striped birds (W>T), genes more highly expressed in tan-striped birds (T>W) and those that that do not significantly differ between morphs (T=W). (C) Log2 ZAL2m expression ratios are plotted vs the Log2 ZAL2m H-statistic for each category of sample. Hypothalamus (Hyp), Ventromedial arcopallium (AMV). (D) Log2 ZAL2m expression ratio are plotted vs the Log2 ZAL2 H-statistic.

Figure 4—source data 1

Percent of Differentially Expressed genes on ZAL2 vs rest of genome.

https://cdn.elifesciences.org/articles/79387/elife-79387-fig4-data1-v1.txt
Figure 4—source data 2

RNAseq allele specific expression data for brain in long format merged with morph bias and H-scan values.

https://cdn.elifesciences.org/articles/79387/elife-79387-fig4-data2-v1.txt

We also looked for a relationship between gene expression and signatures of positive selection on both the ZAL2 and ZAL2m chromosomes. We predicted that positive selection on cis-regulatory alleles on ZAL2m would, on average, increase the expression of the ZAL2m alleles and that positive selection on ZAL2 would increase the expression of ZAL2. We therefore asked whether genes located in regions with evidence for a selective sweep on either ZAL2 or ZAL2m (as indicated by the H-statistics for 50 kb windows) were also likely to exhibit bias in gene expression in the brain. For the ZAL2m chromosome, we found that the ZAL2m H-statistic was a significant, positive predictor of bias in brain gene expression toward the ZAL2m allele, controlling for sequencing batch and brain region (X2(2)=40.17, p=1.893 × 10–9; t=5.114, p=3.24 × 10–7) (see Materials and methods) (Figure 4c). Likewise, we found that genes that exhibited evidence of positive selection on ZAL2 showed allelic bias toward the ZAL2 allele (X2(2)=24.231, p=5.475 × 10–6; t=–3.151, p=0.00164) (Figure 4d). Taken together, these results suggest that gene expression on ZAL2 and ZAL2m is driven by selection for cis-regulatory alleles that benefit each morph.

Candidate genes of interest

Our analyses uncovered many candidate genes that may contribute to the phenotypic differences between the tan- and white-striped morphs. To identify candidates, we performed gene ontology analysis on the set of genes that were either (1) within a region that exhibited evidence of positive selection on either ZAL2 (n=109 genes) or ZAL2m (n=68 genes) (See Supplementary file 7), (2) exhibited allelic bias in six or more tissues (of the eight total brain tissue analyses) (n=249 genes), or (3) exhibited consistent allelic bias across all samples from either one brain region, sex, or age group (i.e. tissue-, sex-, or age-specific expression) (See Supplementary file 7, n=117 genes). The rationale for the inclusion of this final group is that genes that show a tissue-specific pattern of allelic bias are strong candidates for genes with adaptive significance, as changes in gene regulation have long been shown to play a role in adaptive evolution (Wray, 2007).

Using this set of 409 unique genes, we found the strongest enrichment for genes in the categories ‘behavior’ (pFDR = 0.0252) and ‘metabolic process’ (pFDR = 0.0252). The behavior category included genes such as GRIK2 (glutamate receptor ionotropic, kainate 2), GRM1 (metabotropic glutamate receptor 1), NRXN1 (Neurexin-1), OPRM1 (µ-opioid receptor), HTR1B (5-hydroxytryptamine receptor 1B, i.e. serotonin receptor), MEIS1 (Meis homeobox 1), and PAK5 (serine/threonine-protein kinase PAK 5). Additionally, there was enrichment for several gene ontology categories related to metabolism, with 205/409 (50.1%) of these candidate genes involved in the broad category ‘metabolic processes’ (pFDR = 0.0210). We also found significant enrichment in the category ‘regulation of multicellular organismal process’ (pFDR = 0.0342). This category included several additional interesting candidate genes, including ESR1 (estrogen receptor alpha), NR2E1 (nuclear receptor subfamily 2 group E member 1), HCRTR2 (orexin receptor type 2), SYNDIG1 (synapse differentiation-inducing gene 1), DISC1 (disrupted in schizophrenia 1), and ESRRG (estrogen-related receptor gamma).

Allelic bias toward the ZAL2 allele could result from decreased expression of the ZAL2m allele due to degeneration. Allelic bias in the other direction, toward ZAL2m, is more consistent with adaptation (Martinez-Ruiz et al., 2020). We therefore examined relationships between allelic bias and positive selection. We found that 97/359 (27.8%) genes exhibiting a signal of allelic bias showed a bias toward the ZAL2m allele. Four of those 97 genes also overlapped regions exhibiting a significant H-statistic on ZAL2m: PDSS2 (all trans-polyprenyl-diphosphate synthase), XRN2 (5'–3' exoribonuclease 2), RPS27A (Ubiquitin-40S ribosomal protein S27a), and TCTE1 (T-complex-associated-testis-expressed 1).This result implicates these four genes in adaptive ZAL2m phenotypes, providing novel candidate genes for further exploration.

Looking at our results from a different perspective, of the 409 genes that exhibited evidence either of allelic bias or positive selection (by the H-statistic), 55 exhibited evidences of both. These genes are especially interesting candidates for mediating phenotypic differences between the morphs. Three of these genes were located in areas of elevated H-statistic on both the ZAL2 and ZAL2m chromosomes: PDSS2, ATP6V1C2 (ATPase H+Transporting V1 Subunit C2), and GHRL1 (Grainyhead Like Transcription Factor 1). Both ATP6V1C2 and GHRL1 were in the shared region highlighted in blue in Figure 4—figure supplement 2, suggesting that this region may be an especially important evolutionary hotspot in the divergence between the morphs. Overall, our analysis revealed a number of candidate genes for which positive selection appears to have affected the regulation of genes on both chromosomes.

Discussion

Understanding evolutionary processes that contribute to the origin and maintenance of supergenes can elucidate the links between genes and complex phenotypes such as behavior. One of the most critical processes in the evolution of supergenes is the suppression of recombination (Charlesworth, 2016), a consequence of which is genetic degeneration. Such processes have been extensively studied in the case of non-recombining sex chromosomes (Barton and Charlesworth, 1998; Charlesworth, 2012; Charlesworth and Charlesworth, 2000), but their role in the early evolution of non-recombining autosomes is not well understood. The ZAL2m supergene captures a snapshot of an early stage of supergene evolution. We and others have previously observed weak signatures of degeneration on the ZAL2m chromosome (Tuttle et al., 2016; Sun et al., 2018). In this study, using a large, newly generated genomic sequencing and population genomic data set, we demonstrate pervasive signatures of genetic degeneration at multiple levels, from genomic contigs to SNP density. Regardless, the degeneration is weak at most, and it appears that recombination has not entirely ceased on ZAL2m. It is likely that this low level of recombination occurs in rare ZAL2m homozygotes (Horton et al., 2013). Therefore, the case of the ZAL2m supergene illustrates that complex phenotypes including alternative plumage morphs and life history strategies, can evolve prior to the cessation of recombination and despite substantial genetic degeneration.

As is evident from research on non-recombining sex chromosomes, both positive and antagonistic selection can further the differentiation between those chromosomes (Bachtrog, 2004; Bachtrog and Charlesworth, 2002; Singh et al., 2014; Rice, 1984; Vicoso and Charlesworth, 2006; Mank, 2012). Previous studies of the ZAL2m supergene revealed no or only weak evidence of positive selection at the molecular level (Tuttle et al., 2016; Sun et al., 2018), despite evidence that allelic biases in expression may contribute causally to phenotypic differences between white- and tan-striped morphs (Merritt et al., 2020). Here, using extensive molecular data, we detected a strong signature of antagonistic selection based on gene expression profiles and identified several regions of positive selection on both ZAL2 and ZAL2m. We also found that allele-specific gene expression in the brain was associated with these signatures of positive selection on both the ZAL2 and ZAL2m chromosomes, suggesting that this selection has acted on regulatory regions of genes. The finding that genes within non-recombining chromosomes experience either positive or antagonistic selection has been reported for Drosophila sex chromosomes, as well as for the social supergene of fire ants (Solenopsis invicta) (Martinez-Ruiz et al., 2020; Chang et al., 2022; Pracana et al., 2017). This phenomenon is not universal, however, signatures of antagonistic selection were not found in the mating-type chromosomes of the anther-smut fungus (Microbotryum lychnidis-dioicae) (Bazzicalupo et al., 2019).

Our analyses identified several candidate genes that merit further exploration. For example, several genes that in other species interact with the actions of estrogen receptor alpha (ERα), an important player in the behavioral differences between white-throated sparrow morphs (Merritt et al., 2020), also emerged as candidate genes in our analyses. These genes included NCOA7 (Nuclear Receptor Coactivator 7), a transcriptional coactivator of ESR1, and ESRRG, a nuclear receptor that interacts with estrogen-responsive elements in the genome (Festuccia et al., 2018; Lazennec et al., 1997). NR2E1, a recently de-orphanized nuclear receptor that binds oleic acid to regulate neurogenesis (Kandel et al., 2022), is another exciting candidate gene. Additionally, we identified several genes–PDSS2, XRN2, RPS27A, TCTE1, ATP6V1C2, and GRHL1–that have not previously been linked to the morph differences, but convergent lines of evidence suggest that they may play a role in the evolution of the ZAL2m supergene. GHRL1, for example, is a transcription factor necessary for the development of epithelial tissues and is enriched in the neonatal ventromedial hypothalamus of mice (Kurrasch et al., 2007). Its function in the brain, however, remains unknown. Interestingly, the ZAL2m breakpoint has been mapped to ~12 kB downstream of GRHL1 (40), perhaps accounting for the observed patterns of divergence.

Indeed, we found evidence of selective sweeps in several regions near the predicted breakpoint on ZAL2m. Positive selection near inversion breakpoints has been found in other species including Drosophila (Evans et al., 2007). Furthermore, we observed signatures of positive selection on large regions of ZAL2, which was reminiscent of the finding that the X chromosome of primates has been targeted by strong positive selection (Nam et al., 2015). This positive selection on ZAL2 could indicate that recessive alleles on that chromosome experience increased selection in birds of the white-striped morph, and are thus in direct conflict with genes on ZAL2m. It is worth noting, however, that allelic bias is not necessarily indicative of selection; bias in expression toward the tan-allele could be caused by degeneration on ZAL2m, for example. Bias toward the ZAL2m allele could be caused by a transposable element in the promoter region, driving aberrant expression. Convergent lines of evidence point toward a role for positive selection throughout the ZAL2m supergene variant, but the mechanisms underlying allelic bias may vary for each individual gene.

Remarkably, we also discovered clear signatures of balancing selection maintaining two long-range haplotypes within the ZAL2m supergene. We were surprised by this finding, as similar long-range haplotypes have not, to our knowledge, been previously reported for non-recombining chromosomal polymorphisms. It has been speculated that balancing selection is less common in sexually reproducing species than in self-fertilizing species (Glémin, 2021), in which the effective recombination rate is lowered due to the extreme degree of inbreeding. Our results suggest that balancing selection may function to maintain genetic variability in the face of reduced recombination, and we might expect to see other instances of balancing selection within supergenes.

The phenotypic effects of these balanced haplotypes remain unknown. Two out of the fifteen genes inside the balanced ZAL2m outlier region showed signatures of differential gene expression between the haplogroups. One of these genes is GREB1, which in mammals is a downstream target of ERα (Gegenhuber et al., 2022). This finding is notable because, in white-throated sparrows, ERα is causally related to the behavioral differences between morphs (Merritt et al., 2020; Horton et al., 2014). Thus, the ZAL2m haplogroup may interact in important ways with ERα-mediated phenotypes in white-striped birds, although further research is needed to test this hypothesis.

Taken together, our results demonstrate that the ZAL2m supergene is far from a degenerating counterpart of a ‘normal’ autosome. Rather, it is an active arena of multiple evolutionary forces. Although most studies of the consequences of suppressed recombination have focused on genetic degeneration, the resulting supergenes also give rise to opportunities for evolutionary innovation (Charlesworth, 2016; Thompson and Jiggins, 2014; Schwander et al., 2014). Our results show that although degeneration is operating on ZAL2m, dynamic selective forces are occurring simultaneously, producing divergent complex phenotypes. These results emphasize that supergenes are an important force in adaptive evolution (Charlesworth, 2016; Schwander et al., 2014; Wellenreuther and Bernatchez, 2018) and open the door for future studies to demonstrate and investigate the functional consequences of dynamic natural selection acting inside recombination-suppressed supergenes.

Materials and methods

Whole genome sequencing and population genetics

Request a detailed protocol

We performed whole genome sequencing of 63 samples (WS: N=49, TS: N=13, and SW: N=1), which included 28 birds captured during the breeding season near Argyle, Maine (ME), USA, 25 birds captured in Atlanta, Georgia (GA), USA during fall migration (November and December), and 10 postmortem samples opportunistically collected from birds who died from building strikes in Chicago, Illinois (IL), USA (see Balakrishnan et al., 2014) during the spring or fall migrations. Sequencing reads from these samples, as well as reads from three previously described samples, were mapped to the reference genome assembly (GCF_000385455.1) using Bowtie2 (ver. 2.3.5) with the very-sensitive-local option. Possible PCR duplicates were removed using Picard tools (ver. 2.19). SNP and INDEL calling were conducted using GATK Haplotypecaller (ver. 4.1.2) with the ERC GVCF option and joint genotyping of all samples were performed with Genotype GVCF option. We filtered out SNPs with any missing information, MAF <0.05, meanDP <5, or meanDP >80. Raw sequencing reads are available on NCBI SRA (BioProject PRJNA818012).

Sequencing and genome assembly of super-white bird

Request a detailed protocol

To complement the available short-read sequencing data from the female super-white bird (Sun et al., 2018), we generated additional paired-end reads from the same individual (150 ×; insert size of 300 bp and 500 bp) to improve the assembly quality. Genomic DNA was extracted from a 200 mg liver sample from the super-white bird using the Qiagen DNEasy Blood and Tissue DNA kit. Additionally, mate-pair libraries of different insert sizes (insert size of 1 kb, 4–7 kb, 7–10 kb, and 10–15 kb) were prepared and sequenced by the Brigham Young University Genome Sequencing Center. Raw sequencing reads are available on NCBI SRA (BioProject PRJNA818012).

Using these data, we constructed a whole genome de novo assembly of Z. albicollis, including the ZAL2m chromosome. Paired-end sequencing reads were trimmed by Trimmomatic v.0.32 (Bolger et al., 2014), and error correction of the trimmed sequencing reads was conducted by Lighter v.1.1.2 (Song et al., 2014). Initial contig assembly and scaffolding was conducted by ABySS v.2.1.5 (Jackman et al., 2017). We used Gapcloser (Xie et al., 2014) to fill the gaps emerging from scaffolding process. The total assembly size was 1058 Mbp (N50 length of 3.1 Mbp), consisting of 160 scaffolds (111.76 Mbp) belonging to the ZAL2m chromosome with the longest scaffold length of 5.1 Mbp (Supplementary file 3).

Identification of ZAL2/2m scaffolds

Request a detailed protocol

To discriminate scaffolds that originate from the ZAL2 vs the ZAL2m chromosome, we mapped the super-white assembly against the genome of the House sparrow (Passer domesticus), which is the most closely related species with chromosome level assembly using LASTZ v1.03. Scaffolds uniquely mapped to the homologous House sparrow chromosomes with >30% coverage and >85% identity were retained. In the case of multi-mapping scaffolds, we used more stringent criteria with >70% coverage and >85% identity. We conducted the same procedure using the tan-striped reference assembly. The list of matched scaffolds was highly consistent with our previous study, with the addition of several new scaffolds (n=15, sum of scaffold lengths = 129.1 kb). To distinguish sequences inside vs outside of the rearranged (i.e. inverted) regions, we computed the average frequency of heterozygous SNPs in sliding windows of 25 kb size with 1 kb step.

Genetic differentiation between the ZAL2 and ZAL2m chromosomes

Request a detailed protocol

Utilizing the large amount of newly generated whole genome sequence data, here we identified putatively fixed genetic differences between the ZAL2 and ZAL2m chromosomes. Briefly, we identified positions at which the genomes of the tan-striped (ZAL2/2) and super-white (ZAL2m/2m) birds are homozygous for different alleles while the white-striped (ZAL2/2m) birds are heterozygous. Note that the probability of this allelic pattern occurring by random chance is 2.6x10–23 given the sample size, according to a binomial test. Following this procedure, we obtained a total of 931,424 SNPs and 97,375 insertions and deletions (InDels) between ZAL2 and ZAL2m chromosomes, increasing the number of putatively fixed SNPs between the two chromosomes beyond previous publications (Tuttle et al., 2016; Sun et al., 2018). Variant call format files of fixed differences (both SNPs and InDels) are available at DOI: 10.6084 /m9.figshare.19395146. As we expected, a vast majority (N=930,588; 99.91%) of the fixed differences reside in the scaffolds that we previously predicted to be inside the rearrangement (Sun et al., 2018). The remaining fixed differences were found in scaffolds that were either too short or that mapped ambiguously to multiple chromosomes. FastEprr was used to estimate the recombination rate for non-overlapping 50 kbp sliding window with default setting (Gao et al., 2016).

Haplotype phasing of whole genome and RNA sequencing data

Request a detailed protocol

We performed read-based haplotype phasing of the sequencing data using reads from white-striped individuals (ZAL2/2m heterozygotes). Briefly, using the 931,424 single nucleotide fixed differences between ZAL2/2m, we assigned sequence reads from the heterozygous birds to the corresponding chromosome of origin (i.e. either ZAL2 or ZAL2m) if the paired-end reads were mapped to a region that overlapped at least one fixed difference. Reads from white-striped birds were mapped to N-masked (masking the putative fixed differences between ZAL2 and ZAL2m) reference assembly to avoid mapping bias and assigned to their chromosome of origin using SNPsplit v.0.3.2 (Krueger and Andrews, 2016). Because the two supergene variants differ in the amount of genetic diversity they carry, and because read mapping algorithms are sensitive to the number of differences between the reference sequence and the read, our approach risks a slight mapping bias. We believe that this slight bias is unlikely to have fundamentally changed our results.

On an average, 8.5% of all reads from white-striped birds were assigned to either the ZAL2 or ZAL2m chromosomes. In comparison, the size of the chromosomal inversion is estimated to be approximately 9.5% of the total genome (Thomas et al., 2008). The ZAL2 and ZAL2m assigned reads were extracted respectively using Bedtools bamtofastq. To call variants for both ZAL2 and ZAL2m chromosomes, the reads were remapped to the reference genome assembly using Bowtie2 v. 2.3.5. Variant calling was conducted using GATK Haplotypecaller v.4.1.2 with the ERC GVCF option. Vcftools was used to filter out SNPs with any missing information, MAF <0.05, meanDP <5, or meanDP >80. Accession information for all raw RNA sequencing data used is available in Table 1.

Population genomic analysis

Request a detailed protocol

Estimates for nucleotide diversity (π), between population divergence (dXY), density of fixed differences (df), between population difference in allele frequency (FST), and Tajima’s D were computed in non-overlapping sliding windows sized 10 kb, 25 kb, and 50 kb, respectively. If a data set has substantial missing points, the estimation of nucleotide diversity is often biased (Korunes and Samuk, 2021). Thus, calculation of nucleotide diversity and dXY from the variant call format (VCF) file can be over-estimated since missing sites are not distinguished from invariant (monomorphic) positions in variants-only VCFs. To account for potential inflation of population summary statistics, we considered an average breadth of coverage across samples for each window when we computed nucleotide diversity, dXY, and df. Nucleotide diversity of protein coding sequence was computed using SNPGenie (Nelson et al., 2015).

For phylogenetic reconstruction of ZAL2m outlier windows, we used RaxML v.8.0.2 for 200 kb windows. The targeted window was selected based on the Tajima’s D values and haplogroup assignments were determined by manually inspecting genotype plots. For the control region, we concatenated 10 randomly selected regions of 20 kb windows (200 kb). The evolutionary model was set to GTRGAMMA with acquisition bias correction using conditional likelihood method (-m ASC_GTRGAMMA –asc-corr=lewis). The medium ground finch (Geospiza fortis) was used as an outgroup species. We conducted bootstrap with 100 replicates.

To examine whether introgression from other species could account for the ZAL2m outlier window, we computed the D-statistic (Martin et al., 2015) using as ingroup genomes the ZAL2 chromosome, the Harris' sparrow chromosome 2, and the ZAL2m chromosome (P1, P2, and P3, respectively). We used the medium ground finch as the outgroup species (P4). The D-statistic was computed for four individual white-striped birds with high sequencing coverage.

The analysis of linkage disequilibrium (LD) decay was performed across the entire chromosomal rearrangement. We calculated pairwise r2 values between variant sites using PopLDdecay (Zhang et al., 2019).

We performed H-scan (Schlamp et al., 2016; Messer Lab — Resources, 2014) to identify soft and hard sweeps, inferred by extended tracts of homozygosity, using phased haplotypes of ZAL2 and ZAL2m. H-scan outputs were separated into 25 kb non-overlapping windows. As a representative summary statistic, we chose the maximum H-statistic value for each window. To calculate the empirical p-value, we first binned genomic windows into 50 SNP increments based on the number of SNPs. When a window included more than 100 SNPs, all the windows were merged into one bin. On the basis of the ranking of summary statistics in each bin, we calculated an empirical p-value for each window. Candidate regions of positive selection were defined as those with an empirical p-value less than 0.05.

Gene expression analyses

Request a detailed protocol

We examined both gene expression divergence and allele-specific expression using three RNAseq data sets from white-throated sparrows (four types of tissue from four separate samples of birds, see Table 1). The first data set consisted of gene expression in male white-throated sparrows collected during the breeding season. In those birds, the samples comprised two brain regions, the Hyp and the AMV (Sun et al., 2018; Zinzow-Kramer et al., 2015), a functional homolog of the medial amygdala (Mello et al., 2019). The second data set consisted of gene expression in the same two brain regions (Hyp and AMV) in adult females collected during the breeding season as well as male and female nestlings (see Sun et al., 2021 for details). The third data set consisted of gene expression in heart and liver tissue in white-striped males collected during migration, then housed in captivity under two lighting conditions (long days and short days) and sampled at two time points during the day (ZT6 and ZT18) (Horton et al., 2019). In statistical analyses, the females and nestlings of each sex were treated as separate ‘batches’ of RNAseq such that there were five total batches: brain samples from adult males, adult females, nestling males, nestling females, and liver/heart samples from adult males. Each tissue was nested within batch to account for repeated sampling from the same individuals.

RNAseq reads were aligned to a reference genome N-masked for putative fixed differences for both chromosomal polymorphisms present in this species (i.e. ZAL2/2, ZAL3a/3a,) using HiSat2 (Kim et al., 2015). To examine allele-specific expression, we used SNPsplit v.0.3.2 to assign mapped reads to ZAL2 or ZAL2m for the white-striped samples. Transcripts were quantified using StringTie (Pertea et al., 2016) and differential expression and allelic-specific expression analyses were performed using DESeq2 (Love et al., 2014). To test for differential expression between the morphs, we used the following model in DESeq2: ‘design = ~morph’. To test for allelic bias (AB), we used the size factors generated for each sample in the previous step, and then used the following model: ‘design = ~individual + allele’. To perform hypothesis testing, we used linear mixed models with tissue nested in sequencing batch as random effects using the R package lme4. To test for the effect of interest, we then performed a likelihood ratio test to compare a full model to a reduced model with the factor of interest removed using the anova function. Where applicable, we used the summary function to perform post-hoc comparisons between multiple groups.

To test for significant allelic bias in expression within each sample type, we used a Wilcoxon rank sum test to test whether the ratio of ZAL2m to ZAL2 expression differed significantly from 0.5, applying an FDR correction for multiple testing. To test for an association between allelic bias and the number of mutations within a gene, we computed the per-base number of mutations by dividing the number of mutations by the gene length and then sorted the genes into deciles. We also computed the number of fixed mutations within 1 kB upstream of the transcription start site (TSS). We then tested whether there was an association between allelic bias and the decile rank or number of mutations within 1 kB upstream of the TSS using the following linear model: Log2 AB ~Variable + (1|Batch:Tissue). To test for overrepresentation of morph-biased genes located inside the rearrangement, we performed a two-sample test for equality of proportions for each brain sample type only. For this test, we compared the proportion of differentially expressed genes, out of the 1007 genes inside the rearranged region on ZAL2, to the proportion out of the 13,369 genes located elsewhere in the genome. For this comparison, we used the prop.test function in R to perform a two-proportions Z-test, applying an FDR correction for multiple testing. To test whether the morph bias in expression of a gene significantly predicted allelic bias for that gene, we grouped genes into three categories: those that were more highly expressed in the white-striped morph (W>T), those more highly expressed in the tan-striped morph (T>W) and those that that did not differ significantly between morphs (T=W). Using the allelic bias for each gene, we then tested whether allelic bias differed among these categories using the following linear mixed model: Log2 AB ~MorphBias_Category + (1|Batch:Tissue), followed by Bonferroni-corrected pairwise post-hoc tests using the ‘ghlt’ function in multcomp package. Note that heart and liver tissue samples were obtained from white-striped birds; because we did not have data on differential expression by morph for these tissues, they were excluded from this analysis, as well as the tests for overrepresentation described above.

To test whether haplogroup affects gene expression, we merged the gene count matrices for the male (H1: n=7, H2: n=3) and female (H1: n=3, H2: n=3) white-striped birds that were also included in the whole genome sequencing dataset and tested for an effect of haplogroup using the following model: design ~sex + haplogroup. Only one gene in Hyp (Geranylgeranyl Diphosphate Synthase 1, GGPS1) and one gene in AMV (RUN And FYVE Domain Containing 4, RUFY4) were differentially expressed at the genome-wide level. Neither of these genes are located in the ZAL2m outlier region and only GGPS1 is located on ZAL2. Thus, we report only unadjusted p-values for genes inside the ZAL2m outlier region that were differentially expressed (both unadjusted and adjusted p-values for all genes in the ZAL2m outlier region are reported in Supplementary file 4).

To examine the relationship between the H-statistic and allelic bias in expression, we computed the average H-statistic for each gene. Each gene was assigned the H-statistic value of the 20 kb bin overlapping the gene (or the average of multiple 20 kb bins, if a gene overlapped two bins). We then used a multiple linear regression model to examine the relationship between the Log2ASE and the log2H-statistic of a gene for both ZAL2 and ZAL2m separately, using the following model: Log2 AB ~ Log2H-statistic + (1|Batch:Tissue). Additionally, we performed functional enrichment analyses for the genes in regions with a significant H-statistic using ToppFun with human homolog HGNC gene names (Chen et al., 2009).

Data availability

Whole genome sequencing data have been deposited in NCBI BioProject under accession code PRJNA818012. RNA sequencing data from adult females and nestling brain tissue have been deposited in NCBI BioProject under accession code PRJNA657006. Previously published sequencing data also used are listed in Table 1. Variant call format files of fixed differences between ZAL2 and ZAL2m are available at DOI: https://doi.org/10.6084/m9.figshare.19395146.

The following data sets were generated
    1. Yi SV
    (2022) NCBI BioProject
    ID PRJNA818012. Dynamic molecular evolution of a supergene with suppressed recombination in white-throated sparrows.
The following previously published data sets were used
    1. Maney DL
    2. Horton B
    3. Grogan KE
    (2020) NCBI BioProject
    ID PRJNA657006. Gene expression differences by morph, sex, and age in adult and nestling white-throated sparrows.
    1. Zinzow-Kramer WM
    2. Horton BM
    3. McKee CD
    4. Michaud JM
    5. Tharp GK
    6. Thomas JW
    7. Tuttle EM
    8. Yi SV
    9. Maney DL
    (2017) NCBI Gene Expression Omnibus
    ID GSE77186. Genes located in a chromosomal inversion are correlated with territorial song in white-throated sparrows.
    1. Horton WJ
    2. Jensen M
    3. Sebastian A
    4. Praul CA
    5. Albert I
    6. Bartell PA
    (2018) NCBI Gene Expression Omnibus
    ID GSE116989. Transcriptome Analyses of Heart and Liver Reveal Novel Pathways for Regulating Songbird Migration.

References

  1. Book
    1. Bull JJ
    (1983)
    Evolution of Sex Determining Mechanisms
    Benjamin-Cummings Pub Co.
    1. Charlesworth B
    2. Charlesworth D
    (2000) The degeneration of Y chromosomes
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 355:1563–1572.
    https://doi.org/10.1098/rstb.2000.0717
  2. Book
    1. Falls JB
    2. Kopachena JG
    (2020) White-throated sparrow (Zonotrichia albicollis)
    In: Poole JF, editors. Birds of the World. Cornell Lab of Ornithology. pp. 1–12.
    https://doi.org/10.2173/bow.whtspa.01
  3. Book
    1. Glémin S
    2. François CM
    3. Galtier N
    (2019) Genome evolution in outcrossing vs. selfing vs. asexual species
    In: Anisimova M, editors. Evolutionary Genomics: Statistical and Computational Methods. Springer. pp. 331–369.
    https://doi.org/10.1007/978-1-4939-9074-0_11)
  4. Website
    1. Messer Lab — Resources
    (2014) Messer Lab
    Accessed February 2, 2022.

Decision letter

  1. George H Perry
    Senior and Reviewing Editor; Pennsylvania State University, United States
  2. Yannick Wurm
    Reviewer; Queen Mary University of London, United Kingdom

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Dynamic molecular evolution of a supergene with suppressed recombination in white-throated sparrows" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and George Perry as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Yannick Wurm (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The reviewers are largely positive about your paper, as am I. They have multiple excellent suggestions for your consideration below; please address each comment in your point-by-point response to accompany your revised manuscript. Essential revisions include expanded analytical consideration and discussion concerning the many genes being evaluated here and consideration of the potential role of repetitive element variation in the phenotypic outcomes. I'm personally not one to advocate for the strict use of α=0.05 cutoff for determining statistical 'significance', but I share reviewer #1s view that the breadth of multiple testing corrections should be carefully considered. (And given the breadth of the audience for this paper, you could briefly explain false discovery rate estimation and use it in the main text when it is initially used).

Reviewer #1 (Recommendations for the authors):

– Add a few words describing tests and the meaning of the statistics for people less habituated to their use (H-statistics, D-statistic, …). The results of the ABBA-BABA and, particularly, Supplemental Figure 4 are difficult to understand and additional explanations are needed.

Regarding the study of balancing selection:

– lines 311-312. Looking at the figure it would be more reasonable to consider 3 haplogroups within the outlier region (one of them with just very few samples) or even more. The use of just two haplogroups should be justified.

– Is this outlier region heterozygous for all individuals? The text does not offer information about this

– l. 322-327. When looking for phenotype changes associated with the haplogroups, how is it possible to study the phenotypic effect of the haplogroups if they are expected to be in the heterozygous state as a result of balancing selection? Info needed.

– If the test if Supp Figure 7 was not significant after correcting for multiple testing, it does not seem correct to infer that the differences are significant. Remove. (Also, explain FDR testing. I do not know what this means).

– l. 332-338. Corrections for multiple testing, have they been applied? Differences for KCNS3 can not be considered significant for either tissue if p=0.0511 and p = 0.1567 (especially if no correction for multiple testing is applied).

Regarding the study of positive selection:

– l. 345-347. The possible explanations for the reduced differentiation toward the end of the inversion need further explanation

a) Represents a younger evolutionary stratum? If it is part of the same inversion, how would a younger date be possible?

b) How would genetic diversity be increased in this area?

Could it just be the result of an event of recombination in the past, introducing diversity from one lineage into the other? Or the result of the selective sweep (Supp Figure 9)?

– Regions with evidence of positive selection: The text reads (l.352), "four top candidate regions (NW_005081582.1, 480-520kb and 920-960kb)". The legend of the Supplemental Figure 9 reads "A candidate region showing a positive selection on both ZAL2 and ZAL2m (NW_005081582.1, 480-520kb, and 920-960kb)". Is it one region or four regions? What are the terms in parenthesis?

– The text mentions different regions with evidence of selective sweeps, but it is difficult to relate those regions to results: l. 354, peaks in chromosome end in ZAL2; which ones are those?; l. 357 the long stretch of elevated H-statistics in ZAL2, could this be associated with low nucleotide diversity and high Fst? In general, this paragraph is quite confusing.

– I miss some info about the content of the regions with evidence for a selective sweep. It would be nice to know what genes are inside the region where selective sweeps have taken place for both chromosomes. The authors search for ontological or functional enrichment, but it is not clear if differences should be expected at this level and I would like to see info about specific candidate genes that could have been proposed considering previous studies and phenotypic analyses

– Antagonistic selection implies that changes in one chromosome counteract changes in the other one. Not clear how the approach used allows the investigation of the existence of counteracting changes. The text in this section is hard to follow. For example, in line 379 it is difficult to understand what allelic bias means. Or, in line 383, it seems that tan-biased genes can only show greater ZAL2 allelic bias because ZAL2m is not present in those individuals. Please, revise this entire section (starting in l. 364).

Other comments:

– l. 243 Sentence hard to follow "We found evidence that allelic bias in gene expression was associated with the decile rank of the per-base number of non-synonymous fixed differences". Could this be assessed without using deciles: bias in gene expression vs. per-base number of non-synonymous fixed differences.

– l. 288. Please, provide an intuitive explanation of the D-statistic and what it means here. Difficult to see anything in Figure 3a.

– l. 425 Reference 63 corresponds to a web page with software. It is not clear what this reference means here. Please, check references to make sure that there are no errors (for example, see author list in reference 64).

Reviewer #2 (Recommendations for the authors):

Line 235: "whether degeneration … resulted in … reduced expression of the ZAL2m allele. " – ZAL2m is a whole supergene (or chromosome) variant. It is not an allele. Probably the following is better: "reduced expression of ZAL2m alleles". Throughout the manuscript, I don't think the authors should say "ZAL2m allele" when referring to the supergene variant. "Alleles carried by the Zal2m supergene variant" could be appropriate.

Figure 4b: How many genes were there with the different types of bias? For genes with a higher expression in Tan than white, the authors suggest that the allelic bias towards the tan allele is a sign of adaptation. But it could simply be that the bias is the result of the ZAL2m allele having degenerated (similar to https://elifesciences.org/articles/55862 ). Conversely, while some higher expression of Zal2m alleles in white birds could be adaptive, some of it can also be that a transposon jumped in front of the promoter and leads to maladaptive high gene expression. We need to be careful with adaptationist interpretations.

Line 354 – finding extended homozygosity runs near the end of the chromosome arm could be expected from lower recombination in this region?

Line 295-298: the authors check for introgression by building a tree from the region of Zal2m that is putatively balancing selection. A large amount of collapsed repetitive DNA in Zal2 can create misleading patterns in phylogenetic reconstruction attempts. This is due to the rapid turnover and jumping of repetitive elements (e.g., what looks like it is here in the reference genome may actually be from somewhere else, or could represent a hybrid genotype of several divergent repeat copies). Focusing on single-copy loci, such as exons of intact single-copy genes, should be preferred (see Stolle Nat Com2022 for one approach).

Examination of allelic expression bias involves mapping to a genome where fixed differences between supergene variants have been masked. This is ok, but because the 2 supergene variants differ in the amount of genetic diversity they carry, and read mapping algorithms are sensitive to the number of differences between the reference sequence and the read, this would lead to a slight mapping bias. That creates a risk here. A more balanced control could have involved masking any genetic variation. However, I think that it is unlikely that doing so would fundamentally change the picture.

Figure 1A shows that ZAL2m has higher overall genetic diversity than ZAL2, yet the text describing genetic diversity also consistently suggests the alternate pattern. For example, Ne derived from synonymous genic diversity is 10x lower in ZAL2m than ZAL2 (from line 222).

Alternatively, the apparent conundrum could come from the use of different filtering criteria and the use of a linear reference sequence where repetitive regions are collapsed. The ZAL2m's genome sequence contains more difficult-to-assemble repetitive DNA, as evidenced by finding smaller scaffolds in Zal2m. This is in line with what is seen in the young fire ant Sb chromosomes (Pracana 2017 Mol Eco finding lower genetic diversity, Stolle 2018 finding more repetitive elements, and more variation in repetitive element content), as well as in some labile young Y chromosomes.

https://doi.org/10.7554/eLife.79387.sa1

Author response

Essential revisions:

The reviewers are largely positive about your paper, as am I. They have multiple excellent suggestions for your consideration below; please address each comment in your point-by-point response to accompany your revised manuscript. Essential revisions include expanded analytical consideration and discussion concerning the many genes being evaluated here and consideration of the potential role of repetitive element variation in the phenotypic outcomes. I'm personally not one to advocate for the strict use of α=0.05 cutoff for determining statistical 'significance', but I share reviewer #1s view that the breadth of multiple testing corrections should be carefully considered. (And given the breadth of the audience for this paper, you could briefly explain false discovery rate estimation and use it in the main text when it is initially used).

Thank you for the careful and thoughtful evaluation of our manuscript. We have added an expanded consideration of candidate genes and their functional implications on lines 450-489 and lines 519-531. Additionally, we have done our best to address concerns related to the role of repetitive element variation, at least where we can address whether our findings are a result of increased structural variation, as opposed to other phenomena. One of the limitations of the current system include the lack of high quality long-range sequence data from the species, thus making it difficult to resolve structural variation and repetitive sequences, so this is not something that we can address in depth in this manuscript. We also have carefully considered our determination of statistical ‘significance’. In one case, a result was removed, but we opted in another instance to keep some borderline findings in the manuscript (see below). We have also made sure to explain false discovery rate estimation, as well as provided other operational definitions for statistics used, in the text.

Reviewer #1 (Recommendations for the authors):

– Add a few words describing tests and the meaning of the statistics for people less habituated to their use (H-statistics, D-statistic, …). The results of the ABBA-BABA and, particularly, Supplemental Figure 4 are difficult to understand and additional explanations are needed.

Additional definitions/explanations for the various statistics (H-statistic, D-statistic, & β-statistic) were provided on lines (302-303, D-statistic; 317-318; β-statistic; 365-366, H-statistic). Additionally, in this process and in response to comments from Reviewer #2, we identified and corrected a labeling error that had been made in the supplementary figure (now Figure 3 – supp. 2), as well as several errors in the text. We updated the figure, and made changes to the figure legend, and to the text on lines 301-306 and in the Materials & Methods on line 667-671.

Regarding the study of balancing selection:

– lines 311-312. Looking at the figure it would be more reasonable to consider 3 haplogroups within the outlier region (one of them with just very few samples) or even more. The use of just two haplogroups should be justified.

Indeed, there appears an additional haplogroup. However, the sample size of the potential third haplogroup is too small (n=2) to enable in-depth analysis. We thus focused on the two major haplogroups that have moderate numbers of chromosomes. Further sampling would be necessary to pursue deeper investigation. We modified the text on lines 320-323 to clarify this point. We also note that different subregions of the outlier region had different topology when drawing phylogenetic trees, although the presence of two major haplogroups was generally consistent. Once again, we feel this issue can be resolved with larger sample sizes.

– Is this outlier region heterozygous for all individuals? The text does not offer information about this

There seems to be some confusion here, so we have attempted to clarify here and in the text (on lines 332-334). White-striped birds are heterozygous for ZAL2 and ZAL2m, so they have only one copy of ZAL2m. Our results suggest that within ZAL2m, there is a region for which balancing selection has maintained multiple (2, or possibly 3) haplotypes. Each white-striped (heterozygous) bird has only one of the two haplotypes of the ZAL2m chromosome and tan-striped birds are unaffected altogether. This is an example of balanced haplogroups within one of the chromosomes maintained by another instance of balancing selection.

– l. 322-327. When looking for phenotype changes associated with the haplogroups, how is it possible to study the phenotypic effect of the haplogroups if they are expected to be in the heterozygous state as a result of balancing selection? Info needed.

Here we are testing whether white-striped birds with H1 version of ZAL2m differ from whitestriped birds with the H2 version of ZAL2m in any phenotypes. We have attempted to clarify this on lines 337-340.

– If the test if Supp Figure 7 was not significant after correcting for multiple testing, it does not seem correct to infer that the differences are significant. Remove. (Also, explain FDR testing. I do not know what this means).

We have removed this claim regarding tarsus length. We have also explained our correction for multiple testing approach on lines 339-340.

– l. 332-338. Corrections for multiple testing, have they been applied? Differences for KCNS3 can not be considered significant for either tissue if p=0.0511 and p = 0.1567 (especially if no correction for multiple testing is applied).

Here we chose not to apply corrections for multiple testing, and we have attempted to be transparent about this choice. We did not find any significantly differentially expressed at the genome-wide level (and, again, there are substantial limitations to the data we had available for these comparisons), so we opted to share these tentative findings in a transparent way. See lines 350-352.

Regarding the study of positive selection:

– l. 345-347. The possible explanations for the reduced differentiation towards the end of the inversion need further explanation

a) Represents a younger evolutionary stratum? If it is part of the same inversion, how would a younger date be possible?

It is hypothesized that the origin of the ZAL2m chromosome involves a series of potentially nested inversions (Thomas et al., 2008 Genetics) and now noted on lines 360-361, not unlike what has been observed, for example, for the mammalian Y chromosome. These multiple inversion events could have taken place at different times, which could lead to a younger evolutionary stratum. We cited this possibility due to the strong precedence in other systems. However, we do not have direct evidence that clearly supports this claim.

b) How would genetic diversity be increased in this area?

Could it just be the result of an event of recombination in the past, introducing diversity from one lineage into the other? Or the result of the selective sweep (Supp Figure 9)?

We present evidence that ZAL2 genetic diversity is elevated at the chromosome ends in Figure 3a. This could be as a result of increased recombination taking place at the chromosome ends, consistent with what has been reported in the literature (e.g. Nachman & Churchill, 1996, Genetics; Barton et al., 2008, Genetics). We tested this explicitly and now include Figure 4 – supp. 3, which provides evidence that supports the idea that increased diversity on the ends of the ZAL2 chromosome is a result of increased recombination there (lines 370-376). We do not, however, find evidence of increased selective sweeps taking place on the chromosome ends of the ZAL2 chromosome (see revised Figure 4 – supp. 2a and our expanded description of this result, lines 370-373). ZAL2m shows several broad regions of selective sweeps (for example scaffold NW_005081553.1, the large red colored scaffold on the righthand side of Figure 4 – supp. 2b), but it is worth noting that this area is not in fact the end of the ZAL2m chromosome. Figure 4 – supp. 2 (and all the graphs) shows the genomic coordinates with respect to the *ZAL2* assembly, where the scaffolds have been ordered using BAC sequencing and FISH studies (Thomas et al. 2008). A structurally resolved chromosome-level assembly does not yet exist for the white throated sparrow. Thus, the precise location of these regions on the ZAL2m chromosome are currently unknown, in absence of long-read sequencing and a high quality ZAL2m assembly (now pointed out on lines 373-375). We therefore cannot make strong claims about the exact locations of these regions along the chromosome, unfortunately.

– Regions with evidence of positive selection: The text reads (l.352), "four top candidate regions (NW_005081582.1, 480-520kb and 920-960kb)". The legend of the Supplemental Figure 9 reads "A candidate region showing a positive selection on both ZAL2 and ZAL2m (NW_005081582.1, 480-520kb, and 920-960kb)". Is it one region or four regions? What are the terms in parenthesis?

There are four 20kB regions that are close to each other with significant evidence of positive selection on both the ZAL2 and ZAL2m chromosomes, so part of a single larger region. We have now highlighted this region in blue in Figure 4 – supp. 2. This has also been clarified in the text on lines 367-368 and in the Supplementary figure legend. The terms in parentheses are the scaffold names and genomic coordinates for the reference genome. We have edited the manuscript to clarify these points.

– The text mentions different regions with evidence of selective sweeps, but it is difficult to relate those regions to results: l. 354, peaks in chromosome end in ZAL2; which ones are those?; l. 357 the long stretch of elevated H-statistics in ZAL2, could this be associated with low nucleotide diversity and high Fst? In general, this paragraph is quite confusing.

We have substantially modified this paragraph (lines 463-483) to avoid confusion. We also added shaded blocks to Figure 4 – supp. 2 to aid our presentation. Indeed, regions with elevated H-statistics also exhibit low nucleotide diversity and high Fst, which could have been due to selective sweeps.

– I miss some info about the content of the regions with evidence for a selective sweep. It would be nice to know what genes are inside the region where selective sweeps have taken place for both chromosomes. The authors search for ontological or functional enrichment, but it is not clear if differences should be expected at this level and I would like to see info about specific candidate genes that could have been proposed considering previous studies and phenotypic analyses

The genes in regions that show significantly elevated H-statistics are listed in Supplementary Table 5. Functional genomic resources and tools from this species are still lacking. Thus, we initially hesitated to over-interpret the results, though we believe our results provide an interesting list of candidate genes for future studies to follow up on. In the revised manuscript, we included additional analyses & discussion connecting the positive selection analyses with differential gene expression (in lines 450-489) and provided a list of candidate genes in the new Supp. Table 7.

– Antagonistic selection implies that changes in one chromosome counteract changes in the other one. Not clear how the approach used allows the investigation of the existence of counteracting changes. The text in this section is hard to follow. For example, in line 379 it is difficult to understand what allelic bias means. Or, in line 383, it seems that tan-biased genes can only show greater ZAL2 allelic bias because ZAL2m is not present in those individuals. Please, revise this entire section (starting in l. 364).

We use antagonistic selection to infer selection for divergent alleles on ZAL2 and ZAL2m that benefit individuals of the tan-striped and white-striped morph, respectively. Although allelic bias is an imperfect test of adaptation, we make the assumption that positive selection on cis-regulatory alleles on ZAL2m would, on average, increase the expression of the ZAL2m allele and visa versa. This framework is similar to what is often used to test for sexually antagonistic selection on the X and Y chromosomes. To test this prediction, for each gene within the rearrangement, we relate differential expression between birds of each morph to allelic bias *within white-striped birds only*. So, to be clear, we are not measuring allelic bias in tan-striped birds (this is, as noted, not possible). Furthermore, it is not a given that genes that are more highly expressed in tan individuals would show ZAL2 bias, per se. Genes that show differential expression, but not allelic bias, are likely showing differences in trans-regulation of the gene. Allelic bias, however, points specifically to cis-regulatory differences driving differential expression of the gene. Figure 4b is confirming with a larger sample size a finding that was presented in Sun et al., 2016. We have attempted to clarify these points in the text (see lines 389-392, 396-399, 402-403, and 437-439).

Other comments:

– l. 243 Sentence hard to follow "We found evidence that allelic bias in gene expression was associated with the decile rank of the per-base number of non-synonymous fixed differences". Could this be assessed without using deciles: bias in gene expression vs. per-base number of non-synonymous fixed differences.

We instead now present the rate of fixed differences versus the decile rank in Figure 2 – supp. 1 and in the Results in the text on lines 245-250. This does not change the conclusions.

– l. 288. Please, provide an intuitive explanation of the D-statistic and what it means here. Difficult to see anything in Figure 3a.

Added “which tests for genomic regions that are discordant with the species tree” on lines 302-304.

– l. 425 Reference 63 corresponds to a web page with software. It is not clear what this reference means here. Please, check references to make sure that there are no errors (for example, see author list in reference 64).

We have added an additional reference for the H-scan software and have corrected the numbering errors for the citations. Other errors in the references have been corrected.

Reviewer #2 (Recommendations for the authors):

Line 235: "whether degeneration … resulted in … reduced expression of the ZAL2m allele. " – ZAL2m is a whole supergene (or chromosome) variant. It is not an allele. Probably the following is better: "reduced expression of ZAL2m alleles". Throughout the manuscript, I don't think the authors should say "ZAL2m allele" when referring to the supergene variant. "Alleles carried by the Zal2m supergene variant" could be appropriate.

Thank you for this suggestion. We have made changes throughout.

Figure 4b: How many genes were there with the different types of bias? For genes with a higher expression in Tan than white, the authors suggest that the allelic bias towards the tan allele is a sign of adaptation. But it could simply be that the bias is the result of the ZAL2m allele having degenerated (similar to https://elifesciences.org/articles/55862 ). Conversely, while some higher expression of Zal2m alleles in white birds could be adaptive, some of it can also be that a transposon jumped in front of the promoter and leads to maladaptive high gene expression. We need to be careful with adaptationist interpretations.

These are good points. Genes showing ZAL2 versus ZAL2m allelic bias are listed in Supp. Table 6. We have added additional caveats and alternative interpretations in the discussion on lines 538541.

Line 354 – finding extended homozygosity runs near the end of the chromosome arm could be expected from lower recombination in this region?

In the revised manuscript we used Fasteprr to infer the population recombination parameter, rho, for each genome assembly contig (Now Figure 4 – supp. 3). This analysis indicates that recombination rates are higher on the chromosome ends of ZAL2, but not on ZAL2m. Instead, recombination is very low throughout ZAL2m. Therefore, this pattern of elevated H-statistic is not caused by especially reduced recombination on the ends of the ZAL2m chromosome. This has been clarified in the manuscript on lines 370-374.

Line 295-298: the authors check for introgression by building a tree from the region of Zal2m that is putatively balancing selection. A large amount of collapsed repetitive DNA in Zal2 can create misleading patterns in phylogenetic reconstruction attempts. This is due to the rapid turnover and jumping of repetitive elements (e.g., what looks like it is here in the reference genome may actually be from somewhere else, or could represent a hybrid genotype of several divergent repeat copies). Focusing on single-copy loci, such as exons of intact single-copy genes, should be preferred (see Stolle Nat Com2022 for one approach).

Unfortunately, our current data (mostly short read sequencing data) are insufficient to investigate structural variation at fine scale. We hope to follow up on this work using long-read sequencing in the future. To address this concern, we re-ran the phylogenetic analysis, this time reconstructing the phylogenetic tree of the ZAL2m outlier region using only exons of the single copy orthologous genes. The results of this analysis, shown below and in Figure 3 – supp. 1, do not indicate an introgression event within ZAL2m. We have updated our Results to include mention of this analysis on lines 297-301.

Examination of allelic expression bias involves mapping to a genome where fixed differences between supergene variants have been masked. This is ok, but because the 2 supergene variants differ in the amount of genetic diversity they carry, and read mapping algorithms are sensitive to the number of differences between the reference sequence and the read, this would lead to a slight mapping bias. That creates a risk here. A more balanced control could have involved masking any genetic variation. However, I think that it is unlikely that doing so would fundamentally change the picture.

We have added this caveat to the Methods section where we describe the N-masking (lines 636-640).

Figure 1A shows that ZAL2m has higher overall genetic diversity than ZAL2, yet the text describing genetic diversity also consistently suggests the alternate pattern. For example, Ne derived from synonymous genic diversity is 10x lower in ZAL2m than ZAL2 (from line 222).

Alternatively, the apparent conundrum could come from the use of different filtering criteria and the use of a linear reference sequence where repetitive regions are collapsed. The ZAL2m's genome sequence contains more difficult-to-assemble repetitive DNA, as evidenced by finding smaller scaffolds in Zal2m. This is in line with what is seen in the young fire ant Sb chromosomes (Pracana 2017 Mol Eco finding lower genetic diversity, Stolle 2018 finding more repetitive elements, and more variation in repetitive element content), as well as in some labile young Y chromosomes.

Figure 1a shows genetic diversity between tan-striped and white-striped birds, not between ZAL2 and ZAL2m (this is actually shown in Figure 3A). Because all white-striped birds are heterozygous and all tan-striped birds are homozygous, there is higher genetic diversity in white-striped birds overall. Only in the haplotype-phased data, where reads are assigned to chromosomes, do we detect the pattern of overall reduced genetic diversity on ZAL2m. We have added a note to the caption of Figure 1 to make clear what is plotted in panel A.

https://doi.org/10.7554/eLife.79387.sa2

Article and author information

Author details

  1. Hyeonsoo Jeong

    School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    Contribution
    Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Nicole M Baran
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5565-2685
  2. Nicole M Baran

    1. School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    2. Department of Psychology, Emory University, Atlanta, United States
    3. Department of Ecology, Evolution, Marine Biology, University of California, Santa Barbara, Santa Barbara, United States
    Contribution
    Formal analysis, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Contributed equally with
    Hyeonsoo Jeong
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6270-0625
  3. Dan Sun

    1. School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    2. Department of Medicine Huddinge, Karolinska Institutet, Stockholm, Sweden
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Paramita Chatterjee

    School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    Contribution
    Investigation, Methodology, Sample collection, generating sequencing libraries, and performing sequencing
    Competing interests
    No competing interests declared
  5. Thomas S Layman

    School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    Contribution
    Investigation, Methodology, Sample collection, generating sequencing libraries, and performing sequencing
    Competing interests
    No competing interests declared
  6. Christopher N Balakrishnan

    Department of Biology, East Carolina University, Greenville, United States
    Contribution
    Conceptualization, Investigation, Methodology, Writing – review and editing, Sample acquisition
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0788-0659
  7. Donna L Maney

    Department of Psychology, Emory University, Atlanta, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    dmaney@emory.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1006-2358
  8. Soojin V Yi

    1. School of Biological Sciences, Georgia Institute of Technology, Atlanta, United States
    2. Department of Ecology, Evolution, Marine Biology, University of California, Santa Barbara, Santa Barbara, United States
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    soojinyi@ucsb.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1497-1871

Funding

National Science Foundation (IOS-0723805)

  • Donna L Maney

National Institutes of Health (1R01MH082833)

  • Donna L Maney
  • Soojin V Yi

National Institutes of Health (R21MH102677)

  • Donna L Maney
  • Soojin V Yi

National Science Foundation (IOS-1656247)

  • Donna L Maney
  • Soojin V Yi

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the members of the Maney lab who performed field work and collected the samples from GA and ME, as well as provided helpful feedback throughout the preparation of this manuscript. K.E. Grogan performed the RNA-sequencing for the samples from adult females and nestlings. David Willard (Collection Manager—Birds, Field Museum of Natural History, Chicago, IL) collected and provided access to the samples from IL. CNB thanks Dr. Elaina Tuttle for passing these samples on to him, and for introducing him to the white-throated sparrow research. This work was supported by NSF IOS-0723805 to DLM and by NIH 1R01MH082833, NIH R21MH102677, and NSF IOS-1656247 to DLM and SVY.

Ethics

All procedures involving live animals were approved by the Emory University Institutional Animal Care and Use Committee (IACUC Protocols PROTO201800016) and with appropriate state and federal (Maine Inland Fisheries and Wildlife, Federal U.S. Geological Survey, and U.S. Fish and Wildlife Service) permits. All procedures, including banding, blood sampling, behavioral studies, capture, and euthanasia, adhered to the Guidelines to the Use of Wild Birds in Research.

Senior and Reviewing Editor

  1. George H Perry, Pennsylvania State University, United States

Reviewer

  1. Yannick Wurm, Queen Mary University of London, United Kingdom

Publication history

  1. Preprint posted: March 2, 2022 (view preprint)
  2. Received: April 10, 2022
  3. Accepted: August 17, 2022
  4. Version of Record published: August 30, 2022 (version 1)

Copyright

© 2022, Jeong, Baran et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 185
    Page views
  • 72
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Hyeonsoo Jeong
  2. Nicole M Baran
  3. Dan Sun
  4. Paramita Chatterjee
  5. Thomas S Layman
  6. Christopher N Balakrishnan
  7. Donna L Maney
  8. Soojin V Yi
(2022)
Dynamic molecular evolution of a supergene with suppressed recombination in white-throated sparrows
eLife 11:e79387.
https://doi.org/10.7554/eLife.79387
  1. Further reading

Further reading

    1. Cell Biology
    2. Genetics and Genomics
    Julie Trolle, Ross M McBee ... Harris H Wang
    Short Report

    Major genomic deletions in independent eukaryotic lineages have led to repeated ancestral loss of biosynthesis pathways for nine of the twenty canonical amino acids1. While the evolutionary forces driving these polyphyletic deletion events are not well understood, the consequence is that extant metazoans are unable to produce nine essential amino acids (EAAs). Previous studies have highlighted that EAA biosynthesis tends to be more energetically costly2,3, raising the possibility that these pathways were lost from organisms with access to abundant EAAs in the environment4,5. It is unclear whether present-day metazoans can reaccept these pathways to resurrect biosynthetic capabilities that were lost long ago or whether evolution has rendered EAA pathways incompatible with metazoan metabolism. Here, we report progress on a large-scale synthetic genomics effort to reestablish EAA biosynthetic functionality in mammalian cells. We designed codon-optimized biosynthesis pathways based on genes mined from Escherichia coli. These pathways were de novo synthesized in 3 kilobase chunks, assembled in yeasto and genomically integrated into a Chinese Hamster Ovary (CHO) cell line. One synthetic pathway produced valine at a sufficient level for cell viability and proliferation, and thus represents a successful example of metazoan EAA biosynthesis restoration. This prototrophic CHO line grows in valine-free medium, and metabolomics using labeled precursors verified de novo biosynthesis of valine. RNA-seq profiling of the valine prototrophic CHO line showed that the synthetic pathway minimally disrupted the cellular transcriptome. Furthermore, valine prototrophic cells exhibited transcriptional signatures associated with rescue from nutritional starvation. 13C-tracing revealed build-up of pathway intermediate 2,3-dihydroxy-3-isovalerate in these cells. Increasing the dosage of downstream ilvD boosted pathway performance and allowed for long-term propagation of second-generation cells in valine-free medium at a consistent doubling time of 3.2 days. This work demonstrates that mammalian metabolism is amenable to restoration of ancient core pathways, paving a path for genome-scale efforts to synthetically restore metabolic functions to the metazoan lineage.

    1. Developmental Biology
    2. Genetics and Genomics
    Melanie MY Chan, Omid Sadeghi-Alavijeh ... Daniel P Gale
    Research Article Updated

    Posterior urethral valves (PUV) are the commonest cause of end-stage renal disease in children, but the genetic architecture of this rare disorder remains unknown. We performed a sequencing-based genome-wide association study (seqGWAS) in 132 unrelated male PUV cases and 23,727 controls of diverse ancestry, identifying statistically significant associations with common variants at 12q24.21 (p=7.8 × 10−12; OR 0.4) and rare variants at 6p21.1 (p=2.0 × 10-8; OR 7.2), that were replicated in an independent European cohort of 395 cases and 4151 controls. Fine mapping and functional genomic data mapped these loci to the transcription factor TBX5 and planar cell polarity gene PTK7, respectively, the encoded proteins of which were detected in the developing urinary tract of human embryos. We also observed enrichment of rare structural variation intersecting with candidate cis-regulatory elements, particularly inversions predicted to affect chromatin looping (p=3.1 × 10-5). These findings represent the first robust genetic associations of PUV, providing novel insights into the underlying biology of this poorly understood disorder and demonstrate how a diverse ancestry seqGWAS can be used for disease locus discovery in a rare disease.