Introduction

Gene families are groups of related genes categorized by functional similarity or presumed evolutionary relatedness. Based on clustering of their proteins’ sequences, human genes fall into hundreds to thousands of distinct families (Gu et al., 2002; Li et al., 2001; Demuth et al., 2006; Friedman and Hughes, 2003). Families originate from successive gene duplications, although particular gene copies or entire families can also be lost (Nei et al., 1997; Demuth et al., 2006). For example, there are hundreds of genes that are specific to human or chimpanzee and have no orthologs in the other species (Demuth et al., 2006). This birth-and-death evolution is distinct from evolution at the nucleotide or protein level (Thornton and Desalle, 2000; Hahn et al., 2005). However, phylogenetics can still be applied to understand the relationships within families of genes, providing insight into speciation and specialization (Thornton and Desalle, 2000).

One large gene family is united by a common protein structure called the “MHC fold”. This group of genes is involved in diverse functions including lipid metabolism, iron uptake regulation, and immune system function (Hansen et al., 2007; Kupfermann et al., 1999; Kaufman, 2022; Adams and Luoma, 2013). This family includes the Class I and Class II MHC genes whose protein products present peptides to T-cells (“classical” genes) and/or interact with other immune cell receptors like killer cell immunoglobulin-like receptors (KIRs) or leukocyte immunoglobulin-like receptors (LILRs) (“classical” and “non-classical” genes). The classical genes are conventionally known to be highly polymorphic, have an excess of missense variants, and even share alleles across species, all indicative of balancing selection at the allele level (Maccari et al., 2017, 2020; Robinson et al., 2024; Hughes and Nei, 1988, 1989b; Arden and Klein, 1982; Mayer et al., 1988). In addition to variation within individual genes, the region is also significantly structurally divergent across the primates (Mao et al., 2024). Balancing selection is evident at the haplotype level as well, where haplotypes with drastically different functional gene content are retained in various primate populations (Hans et al., 2017; de Groot et al., 2017b, 2009; Gleimer et al., 2011; Heijmans et al., 2020). This motivates the need to study the MHC holistically as a gene family. Even though species may retain different sets of genes and haplotypes, related genes likely function similarly, facilitating comparisons across species. Thus, by treating the genes as a related set, our understanding improves significantly compared to considering single genes in isolation. Because gene family birth-and-death is important to speciation and the MHC itself is highly relevant to organismal health, this family is an excellent case for studying gene family evolutionary dynamics.

There are two classes of MHC genes within the greater family (Class I and Class II), and each class contains two functionally distinct types of genes: “classical” and “non-classical”. “Classical” MHC molecules perform antigen presentation to T cells—a key part of adaptive immunity—while “non-classical” molecules have niche immune roles. The classical Class I molecules are generally highly polymorphic, ubiquitously expressed, and present short, intracellularly-derived peptides to T cells. Many of them also serve as ligands for other types of immune cell receptors and influence innate immunity (see Appendix 1 for an overview) (Anderson et al., 2023; Parham and Moffett, 2013; Guethlein et al., 2015; Hans et al., 2017; Wroblewski et al., 2019). The non-classical Class I molecules have limited polymorphism, restricted expression, and perform specific tasks such as mediating maternal-fetal interaction and monitoring levels of MHC synthesis. In humans, the classical Class I genes are HLA-A, -B, and -C, and the non-classical Class I genes are HLA-E, -F, and -G (Heijmans et al., 2020). In contrast, the classical Class II molecules are expressed only on professional antigen-presenting cell types and present longer, extracellularly-derived peptides to T cells (Gfeller and Bassani-Sternberg, 2018; Heijmans et al., 2020; Neefjes et al., 2011). The non-classical Class II molecules assist with loading peptides onto the classical Class II molecules before their transport to the cell surface (Dijkstra and Yamaguchi, 2019; Neefjes et al., 2011). In humans, HLA-DP, -DQ, and -DR are the classical Class II molecules, and HLA-DM and -DO are the non-classical molecules (see Appendix 3 for more detail on all of these genes).

However, the landscape of MHC genes differs even across closely-related species. Over evolutionary time, the Class I gene subfamily has been extraordinarily plastic, having undergone repeated expansions, neofunctionalizations, and losses (Hans et al., 2017; Wilming et al., 2013; Otting et al., 2020; Heijmans et al., 2020). Convergent evolution has also occurred; in different primate lineages, the same gene may be inactivated, acquire a new function, or even evolve similar splice variants (Hans et al., 2017; Wilming et al., 2013; Otting et al., 2020; Heijmans et al., 2020; Walter, 2020). As a result, it is often difficult to detect orthologous relationships in Class I even within the primates (Hughes and Nei, 1989a; Piontkivska and Nei, 2003; Go et al., 2003; Flügge et al., 2002; de Groot et al., 2020). Studies that focus only on the highly-polymorphic binding-site-encoding exons are complicated by these phenomena, necessitating a more comprehensive look into MHC evolution across exons and species groups.

In contrast to Class I, the Class II region has been largely stable across the primates, but gene content still varies in other species. For example, the pig has lost the MHC-DP genes while expanding the number of MHC-DR genes, and the cat has lost both the MHC-DQ and -DP genes, relying entirely on MHC-DR (Hammer et al., 2020; Okano et al., 2020). The use of the different Class II molecules appears to be fluid, at least over longer timescales, motivating the need to fill in the gaps in knowledge in the primate tree.

Due to the large volume of existing MHC literature, results are scattered across hundreds of papers, each presenting findings from a limited number of species or genes. Thus, we first performed an extensive literature review to identify the genes and haplotypes known to be present in different primate species. We present a detailed summary of these genes and their functions in Appendix 3. We also performed a BLAST search using a custom IPD-based MHC allele database against several available reference genomes to discover which genes were present on various primate reference haplotypes (Figure 1—figure Supplement 2 and Figure 2—figure Supplement 2). Our BLAST search and our search of NCBI RefSeq confirmed the presence of various genes in several species for the first time. Figure 1 and Figure 2 show the landscape of MHC genes present in different primate species for Class I and Class II, respectively. The inclusion of sequences from dozens of new species across all genes and the often-ignored pseudogenes helps us paint a more detailed picture of MHC evolution in the primates.

Class I MHC genes present in different species.

The primate evolutionary tree (Kuderna et al., 2023) is shown on the left hand side (nonprimate icons are shown in beige). The MHC region has been well characterized in only a handful of species; the rows corresponding to these species are highlighted in gray. Species that are not highlighted have partially characterized or completely uncharacterized MHC regions. Asterisks indicate new information provided by the present study, typically discovery of a gene’s presence in a species. Each column/color indicates an orthologous group of genes, labeled at the top and ordered as they are in the human genome (note that not all genes appear on every haplotype). A point indicates that a given gene is present in a given species; when a species has 3 or more paralogs of a given gene, only 3 points are shown for visualization purposes. Filled points indicate that the gene is fixed in that species, outlined points indicate that the gene is unfixed, and semi-transparent points indicate that the gene’s fixedness is not known. The shape of the point indicates the gene’s role, either a pseudogene, classical MHC gene, non-classical MHC gene, a gene that shares both features (“dual characteristics”), or unknown. The horizontal gray brackets indicate a breakdown of 1:1 orthology, where genes below the bracket are orthologous to 2 or more separate loci above the bracket. The set of two adjacent gray brackets in the top center of the figure show a block duplication. Gene labels in the middle of the plot (“W”, “A”, “G”, “B”, and “I”) clarify genes that are named differently in different species. OWM, Old-World Monkeys; NWM, New World Monkeys.

Class II MHC genes present in different species.

The mammal evolutionary tree is shown on the left hand side, with an emphasis on the primates (Foley et al., 2023; Kuderna et al., 2023). The rest of the figure design follows that of Figure 1, except that we did not need to limit the number of points shown per locus/species due to space constraints. OWM, Old-World Monkeys; NWM, New World Monkeys; Strep., Strepsirrhini.

In this work, we present a large set of densely-sampled Bayesian phylogenetic trees using sequences from a comprehensive set of MHC genes across dozens of primate species. These trees permit us to explore the overall evolution of the gene family and relationships between genes, as well as trace particular allelic lineages over time. Across the trees, we see examples of rapid gene turnover over just a few million years, evidence for long-term balancing selection retaining allelic lineages, and slowly-evolving genes where orthology is retained for long time periods. In this paper, we describe broad-scale differences between classes and discuss some specific results about the relationships between genes. In a companion paper (Fortier and Pritchard, 2024), we explore the patterns of polymorphism within individual genes, finding evidence for deep trans-species polymorphism at multiple genes.

Results

Data

We collected MHC nucleotide sequences for all genes from the IPD-MHC/HLA database, a large repository for MHC alleles from humans, non-human primates, and other vertebrates (Maccari et al., 2017, 2020; Robinson et al., 2024). Although extensive, this database includes few or no sequences from several key lineages including the gibbon, tarsier, and lemur. Thus, we supplemented our set of alleles using sequences from NCBI RefSeq (see asterisks in Figure 1 and Figure 2). Because the MHC genes make up an evolutionarily related family, they can all be aligned. Using MUSCLE (Edgar, 2004), we aligned all Class I sequences together, all Class IIA sequences together, and all Class IIB sequences together. We then constructed trees for various subsets of these sequences using BEAST2, a Bayesian MCMC phylogenetic inference method (see Methods for more detail) (Bouckaert et al., 2014, 2019). One major advantage of BEAST2 over less tuneable methods is that it can allow evolutionary rates to vary across sites, which is important for genes such as these which experience rapid evolution in functional regions (Wu et al., 2013). We also considered each exon separately to minimize the impact of recombination as well as to compare and contrast the binding-site-encoding exons with non-binding-site-encoding exons.

Here, we present these densely-sampled Bayesian phylogenetic trees which include sequences from 106 species and dozens of MHC genes. In this paper, we focus on the Class I, Class IIA, and Class IIB multi-gene trees and discuss overall relationships between genes. Our companion paper (Fortier and Pritchard, 2024) explores individual clades/gene groups within these multi-gene trees to understand allele relationships and assess support for trans-species polymorphism.

The MHC Across the Primates

The MHC is a particularly dynamic example of a gene family due to intense selective pressure driven by host-pathogen co-evolution (Radwan et al., 2020; Ebert and Fields, 2020). Within the family, genes have duplicated, changed function, and been lost many times in different lineages. As a result, even closely-related species can have different sets of MHC genes. Thus, while the MHC has been extensively studied in humans, there is a limit to how much we can learn from a single species. Leveraging information from other species helps us understand the evolution of the entire family and provides key context as to how it currently operates in humans (Thornton and Desalle, 2000; Adams and Parham, 2001b). In Figure 1 and Figure 2, we compare the genes present in different species. In both, each column represents an orthologous gene, while the left-hand-side shows the evolutionary tree for primates and our closest non-primate relatives (Kuderna et al., 2023; Foley et al., 2023). Humans are part of the ape clade (red label), which is most closely related to the old-world monkeys (OWM; blue label). Next, the ape/OWM clade is most closely related to the new-world monkeys (NWM; orange label), and the ape/OWM/NWM clade is collectively known as the Simiiformes. Only species with rows highlighted in gray have had their MHC regions extensively studied. Gene presence in each species is indicated by points in each column, and the points also indicate the function of the gene and whether it is fixed in the species. Points with an asterisk indicate contributions from this work.

Figure 1 shows that not all Class I genes are shared by apes and OWM, and much fewer are shared between apes/OWM and NWM. Genes have also been differently expanded in different lineages. While humans and most other apes have a single copy of each gene, the OWM and NWM have multiple copies of nearly all genes. Additionally, many genes exhibit functional plasticity; for example, MHC-G is a non-classical gene in the apes and a pseudogene in the OWM (it is not 1:1 orthologous to NWM MHC-G). The differences between even closely-related primate groups indicate that the Class I region is evolving very rapidly.

In contrast, the Class II genes are more stable, as the same genes can be found in even distantly-related mammals (Figure 2). The notable exception to this pattern is the MHC-DRB group of genes, indicated by dark blue points in the middle of Figure 2. While some of the individual MHC-DRB genes are orthologous between apes and OWM, indicated by points in the same column, others are limited to the apes alone. Furthermore, no individual MHC-DRB genes (with the possible exception of MHC-DRB9) are shared between apes/OWM and NWM, pointing to their extremely rapid evolution. While the other genes have been relatively stable, there have been expansions in certain lineages, such as separate duplications of the MHC-DQA and -DQB genes in apes/OWM, NWM, and mouse lemur. Thus, both of the MHC Class I and Class I gene subfamilies appear to be subject to birth-and-death evolution, with Class I and MHC-DRB undergoing the process more rapidly than the rest of Class II.

Evolution of a Gene Family

We performed phylogenetic inference using BEAST2 on our aligned MHC allele sequences collected from NCBI RefSeq and the IPD-MHC database. BEAST2 is a Bayesian method, meaning the set of trees it produces represents the posterior space of trees (Bouckaert et al., 2019). For visualization purposes, we collapsed the space of trees into a single summary tree that maximizes the sum of posterior clade probabilities (BEA, 2024). In each tree, the tips represent sequences, either named with their RefSeq identifier or with standard allele nomenclature (see Appendix 2). The summary tree for Class I is shown in Figure 3, while the summary trees for Class IIA and Class IIB are shown in Figure 4.

The Class I exon 4 multi-gene BEAST2 tree.

The Class I multi-gene tree was constructed using exon 4 (non-PBR) sequences from Class I genes spanning the primates. A) For the purposes of visualization, each clade in the multi-gene tree is collapsed and labeled according to the main species group and gene content of the clade. The white labels on colored rectangles indicate the species group of origin, while the colored text to the right of each rectangle indicates the gene name. The abbreviations are defined in the species key to the right. B) The expanded MHC-F clade (corresponding to the clade in panel A marked by a †). C) The expanded NWM MHC-G clade (marked by a < in panel A). In panels B and C, each tip represents a sequence and is labeled with the species of origin (white label on colored rectangle) and the sequence ID or allele name (colored text to the right of each rectangle; see Appendix 2). The species key is on the right hand side of panel A. Dashed branches have been shrunk to 10% of their original length (to clarify detail in the rest of the tree at this scale). OWM: old-world monkeys; NWM: new-world monkeys; Cat.: Catarrhini—apes and OWM; Pri.: Primates—apes, OWM and NWM; Mam.: mammals—primates and other outgroup mammals.

The Class II exon 3 multi-gene BEAST2 trees.

The trees were constructed using all Class IIA and all Class IIB exon 3 (non-PBR) sequences across all available species. The design of this figure follows Figure 3. A) The top tree shows the collapsed Class IIA gene tree, while the bottom tree shows the collapsed Class IIB gene tree. In this case, all collapsed clades are labeled with “Mam.” for mammals, because sequences from primates and mammal outgroups assort together by gene. B) The expanded MHC-DPA clade (corresponding to the clade in panel A marked by a <). C) The expanded MHC-DPB clade (marked by a † in panel A). D) The expanded MHC-DRB clade (marked by a § in panel A). OWM: old-world monkeys; NWM: new-world monkeys; Cat.: Catarrhini—apes and OWM; Mam.: mammals—primates and other outgroup mammals.

Figure 3A shows the Class I multi-gene tree using sequences from exon 4, a non-peptide-binding-region-encoding (non-PBR) exon equal in size to each of the peptide-binding-region-encoding (PBR) exons 2 and 3. This exon is the least likely to be affected by convergent evolution, making its tree’s structure easier to interpret. This tree—which contains hundreds of tips—has been further simplified by collapsing clades of related tips, although two fully-expanded clades are shown in panels

B and C. Sequences do not always assort by locus. For example, ape MHC-J is separated from OWM MHC-J, which is more closely related to ape/OWM MHC-G. Meanwhile, NWM MHC-G does not group with ape/OWM MHC-G, instead falling outside of the clade containing ape/OWM MHC-A, -G, -J and -K. This supports the fact that the NWM MHC-G genes are broadly orthologous to a large group of genes which expanded within the ape/OWM lineage.

However, some clades/genes do behave in the expected fashion; that is, with their trees matching the overall species tree. One such gene is non-classical MHC-F, shown in Figure 3B. Although the gene has duplicated in the common marmoset (Caja-F), this subtree closely matches the species tree shown in the upper right. This indicates that MHC-F is orthologous across apes, OWM, and NWM. Orthology between apes and OWM is also observed for pseudogenes MHC-L, -K, -J, and -V and non-classical MHC-E and -G (Figure 3—figure Supplement 2 and Figure 5—figure Supplement 1). For the other NWM genes, orthology with apes/OWM is less clear.

Class I α-block-focused multi-gene BEAST2 trees.

The α-block-focused trees use the common backbone sequences as well as additional sequences from our custom BLAST search of available reference genomes. For the purposes of visualization, some clades are collapsed and labeled with the species group and gene content of the clade (colored text to the right of each rectangle). The white labels on colored rectangles indicate the species group of origin, while the colored text to the right of each rectangle indicates the gene or sequence name (see Appendix 2). The species abbreviations are defined in the species key at the bottom. A) Exon 3 α-block-focused BEAST2 tree with expanded MHC-V clade. B) The expanded MHC-A/AL/OKO/U/Y clade from the exon 3 tree (corresponding to the clade in panel A marked by a <), focusing on MHC-U. C) Exon 4 α-block-focused BEAST2 tree with expanded MHC-K/KL clade. D) The expanded MHC-W/WL/P/T/TL/OLI clade from the exon 4 tree (marked by a † in panel C). OWM: old-world monkeys; NWM: new-world monkeys; Cat.: Catarrhini—apes and OWM; Pri.: Primates—apes, OWM and NWM.

Other genes do not at all follow the species tree, such as NWM MHC-G. This gene group is broadly orthologous to a large set of ape/OWM genes and pseudogenes, as its ancestor expanded independently in both lineages. In NWM, the many functional MHC-G genes are classical, and there are also a large number of MHC-G-related pseudogenes. Shown in Figure 3C, NWM MHC-G sequences do not always group by species (colored box with abbreviation), instead forming mixed clades. Thus, while some duplications appear to have occurred prior to speciation events, others are species-specific. Similar expansions are seen among the MHC-A and -B genes of the OWM and the MHC-B genes of the NWM (Figure 3—figure Supplement 2, Figure 3—figure Supplement 3, and Figure 3—figure Supplement 4).

Figure 4 shows summary trees for exon 3 (non-PBR) for the Class IIA and IIB sequence sets. In the Class II genes, exon 3 does not encode the binding site, but is similar in size to binding-site-encoding exon 2. In contrast to Class I (Figure 3), Class II sequences group entirely and unambiguously by gene, shown by the collapsed trees in Figure 4A. However, the subtrees for each gene exhibit varying patterns. As with Class I, non-classical genes tend to evolve in a “typical” fashion with sequences assorting according to the species tree. This is clearly the case for non-classical MHC-DMA, -DMB, -DOA, and -DOB (Figure 4—figure Supplement 1 and Figure 4—figure Supplement 2). Classical genes MHC-DRA and -DPA also follow this pattern (Figure 4B and Figure 4—figure Supplement 1). However, the other classical genes’ subtrees look very different from the species tree.

There are several reasons why these MHC gene trees can differ from the overall species tree. Incomplete lineage sorting can happen purely by chance, especially if species have recently diverged. However, balancing selection can cause alleles to be longer-lived, resulting in incomplete lineage sorting even among deeply-diverged species; this is called trans-species polymorphism (TSP). Figure 4C illustrates this phenomenon for MHC-DPB. Within the OWM clade (shades of green), sequences group by allelic lineage rather than by species. For example, crab-eating macaque allele Mafa-DRB1*09:02:01:01 groups with green monkey allele Chsa-DPB1*09:01 (both members of the DRB1*09 lineage) rather than with the other macaque alleles (Mane-, Mamu-, Math-, and Malo-), despite the fact that these species are 15 million years separated from each other (Kuderna et al., 2023). We see this pattern in many Class II genes and some Class I genes (Figure 3—figure Supplement 3, Figure 3—figure Supplement 4, Figure 3—figure Supplement 5, Figure 4—figure Supplement 8, Figure 4—figure Supplement 9, and Figure 4—figure Supplement 10). In our companion paper, we explore each of these genes further and evaluate the strength of support for TSP in each gene.

Another way to obtain discordant trees is in the case of recent expansions of genes. Such expansions make it difficult to assign sequences to loci, resulting in clades where sequences (ostensibly from the same locus) do not group by species. An example of this is shown in Figure 3C for the NWM Class I gene MHC-G. The Class II MHC-DRB genes have also expanded, although locus assignments are somewhat clearer. Figure 4D shows the Class II subtree for MHC-DRB, where ape sequences (red/orange boxes) are interspersed with OWM sequences (green boxes). The MHC-DRB genes are usually assigned to specific named loci, but in this tree only MHC-DRB5 sequences group by named locus (the collapsed ape/OWM MHC-DRB5 clade is about 1/3 from the bottom of the tree). The failure of the other named loci to group together indicates a lack of 1:1 orthology between apes, OWM, and NWM for these genes and thus rapid evolution. This makes the MHC-DRB genes unique among the Class II genes. We created a “focused” tree with more sequences in order to explore the evolution of the MHC-DRB genes further, which is presented in a later section (Figure 4—figure Supplement 8).

Gene conversion is a third way that gene trees might differ from the overall species tree. Gene conversion is the unidirectional copying of a sequence onto a similar sequence (usually another allele or a related locus), which results in two sequences being unusually similar even if they are not related by descent. We consider this possibility in the next section.

Detection of Gene Conversion

Because the MHC contains many related genes in close proximity, gene conversion—the unidirectional exchange of sequence between two similar sequences—can occur (Chen et al., 2007). We used the program GENECONV (Sawyer, 1999) to infer pairs of sequences of which one has likely been converted by the other (Figure 3—source data 1 and Figure 4—source data 1). We recovered known gene conversion events, such as between human allelic lineages HLA-B*38 and HLA-B*67:02, as well as novel events, such as between gorilla allelic lineages Gogo-B*01 and Gogo-B*03 and ape/OWM lineages MHC-DQA1*01 and MHC-DQA1*05.

However, most of the tracts we detected involved many different groups of species but implicated the same pair of loci. We interpreted these as gene conversion events that must have happened a long time ago in the early history of the two genes, and they are likely to blame for the topological differences from exon to exon among the trees. For example, in exon 2, the Class I pseudogene MHC-K groups with MHC-G, while in exon 3 it groups with MHC-F, and in exon 4 it groups outside of MHC-G, -J, and -A (Figure 3). The uncertain early branching structure we observe in our trees may be due to these ancient gene conversion events.

The Importance of the Pseudogenization Process

Gene birth-and-death drives the evolution of a gene family as a whole. The “death” can include the deletion of all or part of a gene from the genome or pseudogenization by means of inactivating mutations, which can leave gene remnants behind. In Class I, we find many pseudogenes that have been produced in this process; while countless more have undoubtedly already been deleted from primate genomes, many full-length and fragment pseudogenes still remain. Although non-functional, these sequences provide insight into the granular process of birth-and-death as well as improve tree inference.

Full Class I haplotypes including the pseudogenes are known only for human, chimpanzee, gorilla, and macaque, and even so we do not have sequences for all the balanced haplotypes in each species (Anzai et al., 2003; Wilming et al., 2013; Shiina et al., 2017; Karl et al., 2023). From these studies, we know that few functional Class I genes are shared by apes/OWMs and NWMs, and so far no shared pseudogenes have been found (Lugo and Cadavid, 2015; Kono et al., 2014; Cadavid et al., 1996; Maccari et al., 2017, 2020). Therefore, the Class I genes in the two groups have been generated by a largely separate series of duplications, neofunctionalizations, and losses. This means that turnover has occurred on a relatively short timescale, and understanding the pseudogenes within the apes and OWM can thus shed light on the evolution of the region more granularly. These ancient remnants could provide clues as to when genes or whole blocks were duplicated, which regions are more prone to duplication, and how the MHC may have functioned in ancestral species.

The Class I MHC region is further divided into three polymorphic blocks—α, κ, and β —that each contain MHC genes but are separated by well-conserved non-MHC genes. The majority of the Class I genes are located in the α-block, which in humans includes 12 MHC genes and pseudogenes (Shiina et al., 2017). The α-block also contains a large number of repetitive elements and gene fragments belonging to other gene families, and their specific repeating pattern in humans led to the conclusion that the region was formed by successive block duplications (Shiina et al., 1999). Later, comparison of macaque and chimpanzee α-block haplotypes with the sequenced human haplotype bolstered this hypothesis, although the proposed series of events is not always consistent with phylogenetic data (Kulski et al., 2005, 2004; Geraghty et al., 1992; Hughes, 1995; Messer et al., 1992; Alexandrov et al., 2023; Gleimer et al., 2011). Improving existing theories about the evolution of this block is useful for disentangling the global pattern of MHC evolution from locus- and gene-specific influences. This could help us understand how selection on specific genes has affected entire linked regions. We therefore created an α-block-focused tree involving sequences from more species than ever before in order to strengthen and update previous hypotheses about the evolution of the block, shown in Figure 5.

Figure 5A shows the α-block-focused tree for exon 3, with an expanded MHC-V clade. MHC-V is a fragment pseudogene containing exons 1-3 which is located near MHC-F in the α-block. Previous work disagrees on the age of this fragment, with some suggesting it was fixed relatively early while others claiming it arose from one of the more recent block duplications (Shiina et al., 1999; Kulski et al., 2004, 2005). Our tree groups ape and OWM MHC-V together and places them as an out-group to all of the classical and non-classical genes, including those of the NWM. Thus, the MHC-V fragment may be an ancient remnant of one of the ancestral Class I genes. We also dispute the hypothesis that MHC-V and -P (a 3’-end fragment) are since-separated pieces of the same original gene (Horton et al., 2008), as we found that both contain exon 3 and their exon 3 sequences clearly do not group together.

In Figure 5B, we zoom in on a different clade within the exon 3 tree, corresponding to the asterisk in panel A. Here we focus on MHC-U, a previously-uncharacterized fragment pseudogene containing only exon 3. Our tree groups it with a clade of human, chimpanzee, and bonobo MHC-A, suggesting it duplicated from MHC-A in the ancestor of these three species. However, it is present on both chimpanzee haplotypes and nearly all human haplotypes. Since these haplotypes likely diverged earlier in the ancestor of human and gorilla, we presume that MHC-U will be found in the gorilla when more haplotypes are sequenced. Ours is the first work to show that MHC-U is actually an MHC-A-related gene fragment.

The exon 4 α-block-focused tree is shown in Figure 5C. Here, we expand the clade for MHC-K, a full-length pseudogene present in apes and OWM. In humans, only MHC-K is present, but on some chimpanzee haplotypes both MHC-K and its duplicate MHC-KL are present. In gorillas, haplotypes can contain either MHC-K or -KL, and in OWM there are many copies of MHC-K as they are part of one of the basic block duplication units (Figure 5—figure Supplement 2) (Karl et al., 2023). Figure 5C shows that MHC-K and -KL are closely related and OWM MHC-K groups outside of both, indicating that the duplication (which also copied MHC-W, -A, and -T) occurred after the split of apes and OWM. We did not detect MHC-K or -KL sequences in either the gibbon or orangutan reference genomes during our BLAST search, so we cannot date this duplication event more precisely. The pseudogene may have been deleted from both genomes entirely, or it may be present on non-reference haplotypes. Sequencing of more haplotypes may help resolve the timing of this duplication event.

Both the exon 3 and exon 4 trees show a clear separation between the clade of MHC-W, -WL, -P, -T, -TL, and -OLI fragment pseudogenes and the rest of the genes (Figures 5A and C). On the chromosome, members of these two subfamilies are interleaved throughout the α-block, suggesting that both groups are old and a series of block duplications occurred. Previous work on human sequences has shown that HLA-P, -W, -T, and -OLI are related (Alexandrov et al., 2023; Hughes, 1995; Kulski et al., 2005). However, humans do not have orthologs of every single primate gene, so utilizing other primate sequences is critical to understanding this subfamily’s evolution.

The MHC-W/WL/P/T/TL/OLI clade, marked with a † in Figure 5C, is expanded in panel D. We expected OWM MHC-W sequences to form a monophyletic clade either outside of all of the ape genes or with a single ape MHC gene, demonstrating orthology. Surprisingly, OWM MHC-W sequences instead formed four distinct clades, with one grouping with ape MHC-W/WL, one with ape MHC-P, one with ape MHC-T/TL, and one outside of all. Furthermore, based on the alleles present, each of these OWM MHC-W clades corresponds to a type of basic repeat block (as revealed by the published macaque MHC haplotype) (Karl et al., 2023; Kulski et al., 2004). The correspondence between the OWM clades and the genes’ physical location lends further support to the hypothesis that macaque haplotypes were generated by tandem duplications. Furthermore, the fact that the OWM MHC-W clades each group with a different ape pseudogene suggests that there are three ape/OWM orthologous groups. Thus, there must have been three distinct MHC-W-like genes in the ape/OWM ancestor.

We also learned more about HLA-OLI, a recently-discovered MHC pseudogene found on the same insertion segment that carries HLA-Y in a small fraction of the human population. Its discoverers used only human sequences, finding HLA-OLI was most similar (88%) to HLA-P (Alexandrov et al., 2023). Our inclusion of non-human primate genes revealed that HLA-OLI is actually most similar in both structure and sequence to MHC-TL, a gene not found in humans (Figure 5—figure Supplement 1). Furthermore, since MHC-Y and -OLI are fully linked and are located in close proximity, it is likely that they duplicated as a unit. MHC-Y’s similarity to MHC-AL/OKO and HLA-OLI’s similarity to MHC-TL is consistent with them duplicating together from those genes, which are adjacent to each other on non-human haplotypes.

With these findings, in addition to many other observations from our trees and results from past literature (references in Figure 6—figure Supplement 1), we propose a new hypothesis for the evolution of the Class I α-block. Figure 6 shows a possible evolutionary path for α-block haplotypes that could have led to the currently observed haplotypes. Haplotypes found so far in each species are at the bottom of the figure (with additional never-before-reported haplotypes from our BLAST search shown in Figure 1—figure Supplement 2). Notably, our work has revealed that MHC-V is an old fragment, three MHC-W-like genes were already established at the time of the ape/OWM ancestor, MHC-U is closely related to African ape MHC-A, and MHC-OLI is closely related to MHC-TL. Additionally, the OWM MHC-A fragment pseudogene is actually more similar to the ape MHC-A genes than to the other OWM MHC-A genes, supporting the existence of two MHC-A-like genes in the ape/OWM ancestor.

Evolution of the Class I α-block.

The primate evolutionary tree is shown in gray (branches not to scale). The bottom of the tree shows currently known haplotypes in each species or species group. Horizontal gray bars indicate haplotypes shared among the African apes. The history of the genes/haplotypes in the α-block is overlaid on the tree, synthesizing previous work with our own observations (see Methods and Figure 8). Genes are represented by colored rectangles, while haplotypes are shown as horizontal lines containing genes. MHC-F—indicating the telomeric end of the α-block—was fixed early on and is located immediately to the left on all haplotypes shown, but is not pictured due to space constraints. Dashed arrows with descriptive labels represent evolutionary events. In the upper right, the “Symbol Key” explains the icons and labels. The “Gene Relationships” panel shows the relationships between the loci shown on the tree, without the layered complexity of haplotypes and speciation events. The “MHC-A Allelic Lineages” panel shows which MHC-A allele groups are present in human, chimpanzee, and gorilla.

Figure 6—figure supplement 1. A version of Figure 6 with references.

Evolution of the MHC-DRB Region

The Class I vs. Class II division reflects a major functional distinction within the MHC gene family, but even within these subfamilies evolution is not homogeneous. Among the Class II genes, there are few duplicated genes and generally only one way for the protein products to pair, e.g. MHC-DPA1 with MHC-DPB1. However, the MHC-DR genes are a notable exception to the general pattern; MHC-DRA can pair with any of the multiple functional MHC-DRB genes. In addition, MHC-DRB has many more related pseudogenes compared to the rest of the Class II genes, making the region’s pattern of evolution more reminiscent of Class I (see Figure 2 to see the varied landscape of MHC-DRB genes in different species). We explored the evolution of the MHC-DRB region in greater detail by creating focused trees with a larger set of MHC-DRB sequences.

In exon 2, which codes for the binding site, the MHC-DRB genes group mostly by name (e.g. MHC-DRB3) across apes, OWM, and NWM (Figure 4—figure Supplement 8A). The exon 2 tree considered alone thus suggests that the genes are orthologous across apes, OWM, and NWM—which is how the genes were named in the first place. Exon 3 does not encode the binding site and is less likely to be affected by convergent evolution; its tree is shown in Figure 4—figure Supplement 8B. In this tree, all NWM sequences group together (clade with blue boxes about halfway up the tree), suggesting that the genes expanded separately in NWM and apes/OWM. Additionally, OWM MHC-DRB1 and -DRB3 form their own clade (green boxes near the top of the tree) and OWM MHC-DRB4 sequences group outside of several ape and NWM clades. We see that only three MHC-DRB genes/pseudogenes (MHC-DRB5, -DRB2/6, and -DRB9) form monophyletic clades of ape and OWM sequences, indicating that these three are the only orthologous MHC-DRB genes. Further, none are orthologous to any particular NWM gene. Thus, the longevity of individual MHC-DRB genes in the primates appears to be less than 38 million years.

The longevity of MHC-DRB haplotypes is even shorter. Only one haplotype is shared between human and chimpanzee, and none are shared with gorilla (de Groot et al., 2009; Heijmans et al., 2020; Hans et al., 2015). This shows that the region is evolving even more rapidly than Class I (where haplotypes are shared among human, chimpanzee, and gorilla; Figure 6). These haplotypes, combined with past literature (Figure 7—figure Supplement 1) and our trees, allowed us to trace backward and propose a hypothesis for the evolution of the region, shown in Figure 7.

Evolution of MHC-DRB.

The bottom of the tree shows current haplotypes in each species or species group; human, chimpanzee, gorilla, and old-world monkey haplotypes are well characterized, while orangutan, gibbon, and new-world monkey haplotypes are partially known. The history of the genes/haplotypes in the MHC-DRB region is overlaid on the tree, synthesizing previous work with our own observations (see Methods and Figure 8). The rest of the figure design follows that of Figure 6.

Figure 7—figure supplement 1. A version of Figure 7 with references.

Limited data exists for the orangutan, and in our exon 3 tree (Figure 4—figure Supplement 8B) orangutan alleles do not group definitively with any other ape lineages. Furthermore, the orangutan MHC-DRB3 gene groups with orangutan MHC-DRB1, suggesting that it may not be orthologous to the African ape MHC-DRB3 gene. We also found an orangutan sequence that groups with the human HLA-DRB2 pseudogene, suggesting that this gene has an ortholog in the orangutan. Several haplotypes have been previously identified in the gibbon, but since they rely on exon 2 sequence alone, it is unclear how these alleles relate to the known ape lineages (de Groot et al., 2017a). Analysis of more orangutan and gibbon haplotypes will be essential for understanding how the region has evolved in the apes.

Overall, the MHC-DRB genes are not evolving in the same fashion as the rest of the Class II genes, even though they have a shared structure and function. This peculiar case illustrates that there are multiple ways to achieve a functional immune response from the same basic parts.

Diferences between MHC Subfamilies

We explored the evolution of the Class I and Class II genes separately and noticed several differences between the classes. First, sequences group by gene rather than species group in the Class II gene trees (Figure 4, Figure 4—figure Supplement 1, and Figure 4—figure Supplement 2). Our inclusion of RefSeq sequences from distant groups of placental mammals confirms that most of the primate Class II genes have maintained orthology at least since the ancestor of placentals, 105 million years ago (Foley et al., 2023). In contrast, our Class I trees (Figure 3 and Figure 3—figure Supplement 2) showed sequences more often grouping by species group than by gene, indicating that the genes turn over quickly and 1:1 orthology is often lost. Only non-classical MHC-F (and possibly MHC-E) are truly orthologous among the apes, OWM, and NWM, consistent with previous findings (Piontkivska and Nei, 2003; Adams and Parham, 2001b; Sawai et al., 2004). Additionally, our tarsier and Strepsirrhini sequences group outside of all Simiiformes Class I sequences, setting an upper bound on the maintenance of Class I orthology of 58 million years (Kuderna et al., 2023; Flügge et al., 2002).

This turnover of genes at the MHC—rapid for Class I and slower for Class II—is generally believed to be due to host-pathogen co-evolution (Radwan et al., 2020). Although this means the MHC genes are critically important for survival, no single gene is so vital that its role must be preserved. For example, in the apes the MHC-G gene is non-classical, but in the OWM it has been inactivated and its role largely replaced by an MHC-A-related gene called MHC-AG (Heijmans et al., 2020). This process of turnover ultimately results in different sets of MHC genes being used in different lineages. For instance, separate expansions generated the classical Class I genes in NWM (all called MHC-G) and the α-block genes in apes/OWM. Similarly, separate expansions generated the MHC-DRB genes of the NWM and of the apes/OWM. Aside from MHC-DRB, the Class II genes have been largely stable across the mammals, although we do see some lineage-specific expansions and contractions (Figure 2 and Figure 2—figure Supplement 2).

Class I and Class II also differ in their degree of gene conversion. Our GENECONV analysis revealed two types of gene conversion events: 1) specific, more-recent events involving paralogous genes or particular allelic lineages and 2) broad-scale, very-old events involving two dissimilar loci (Figure 3—source data 1 and Figure 4—source data 1). We discovered far more “specific” events in Class I, while “broad-scale” events were predominant in Class II. This could reflect the different age of these gene groups: Class I genes turn over more rapidly and allelic lineages are less diverged from each other, making gene conversion more likely. In contrast, Class II genes have much longer-lived allelic lineages, potentially explaining why we mainly picked up older events in the Class II GENECONV analysis.

Even within a class, evolutionary patterns are not homogeneous. The non-classical vs. classical distinction is one functionally meaningful way to partition the genes. The classical genes perform peptide presentation to T-cells, making them direct targets of host-pathogen co-evolution. In contrast, the non-classical genes are involved in innate immune surveillance or niche roles, and may be less directly affected by this co-evolution. In our trees, sequences from non-classical genes of both classes often group by gene with topology matching the species tree, while sequences from classical genes do neither (Figure 3—figure Supplement 2, Figure 4—figure Supplement 1, and Figure 4—figure Supplement 2). This shows that classical genes experience more turnover and are more often affected by long-term balancing selection or convergent evolution. Ultimately, selection acts upon functional differences between classical and non-classical genes in a manner that is largely independent of whether they belong to Class I or Class II.

Discussion

The MHC proteins serve diverse roles in innate and adaptive immunity (Adams and Luoma, 2013). They are critically important to infection resistance, autoimmune-disease susceptibility, and organ transplantation success, and can provide insight into human evolution, inform disease studies, and improve upon non-human-primate disease models (Kennedy et al., 2017). Despite their varied functions, all Class I and Class II MHC genes are derived from a common ancestor, allowing us to compare genes to learn more about the evolution of the gene family as a whole (Hansen et al., 2007; Kupfermann et al., 1999; Kaufman, 2022; Adams and Luoma, 2013). A few ∼20-year-old studies addressed the overall evolution of the MHC gene family via multi-gene alignment and phylogenetics, but the trees had many polytomies (Adams and Parham, 2001b; Sawai et al., 2004; Cardenas et al., 2005; Piontkivska and Nei, 2003; Takahashi et al., 2000). Since then, most work has focused on particular genes or small sets of species, meaning our knowledge of primate MHC evolution is scattered across hundreds of papers (Urvater et al., 2000; Van Der Wiel et al., 2013; Geller et al., 2002; Hans et al., 2017; Maibach et al., 2017; Wroblewski et al., 2017, 2019; Lafont et al., 2004; Flügge et al., 2002; Go et al., 2003, 2005; Shiina et al., 2017; Abi-Rached et al., 2010; Gleimer et al., 2011; de Groot et al., 2015; De Groot et al., 2012; de Groot et al., 2022; Cao et al., 2015; Otting et al., 2020; Fukami-Kobayashi et al., 2005; Lugo and Cadavid, 2015; Averdam et al., 2011; Figueroa et al., 1994; Doxiadis et al., 2012; Buckner et al., 2021; Doxiadis et al., 2006; Diaz et al., 2000; Gongora et al., 1997; Kasahara et al., 1992; de Groot et al., 2009; Satta et al., 1996). In this project, we revisited primate MHC evolution with more data from a wider range of species and a coherent analysis framework. We confirm and unify past findings, as well as contribute many new insights into the evolution of this complex family.

We found that the Class I genes turn over rapidly, with only the non-classical gene MHC-F being clearly orthologous across the Simiiformes. In the rest of the Class I α-block, genes expanded entirely separately in the ape/OWM and NWM lineages. This process of expansion generated many full-length and fragment pseudogenes, which we found were equally important as the functional genes to understanding the evolution of the region as a whole. Notably, we found that MHC-U is an MHC-A-related pseudogene, MHC-V is not closely related to MHC-P, and that there were at least three genes of the MHC-W/P/T/OLI family present in the ape/OWM ancestor. Generally, Class II genes do not turn over as rapidly, although there were exceptions. The classical MHC-DRB genes were even shorter-lived than the Class I genes, with most human genes lacking 1:1 orthologs beyond the great apes. We also found that the classical MHC-DQA and -DQB genes likely expanded separately in the ape/OWM and NWM lineages. In contrast, the classical MHC-DPA and -DPB genes were orthologous across the Simiiformes, and the non-classical Class II genes were 1:1 orthologous across most of the mammals. In both Class I and Class II, classical genes turned over more rapidly than non-classical genes and their trees exhibited more deviations from the expected species tree. Overall, our treatment of the genes as related entities instead of distinct cases helped us understand shared patterns of evolution across classes and species groups.

One concern when discussing gene families is the relative importance of birth-and-death and concerted evolution by gene conversion (Gu and Nei, 1999a; Nei and Rooney, 2005; Klein et al., 2007; Bergström and Gyllensten, 1995; Gyllensten et al., 1991; Nei et al., 1997). Gene conversion can cause adjacent small sequence tracts to have wildly different evolutionary histories, making it difficult to interpret a tree constructed from larger regions. Our trees reveal different topologies depending on exon and our GENECONV analysis pulled out several different sequence pairs, revealing that gene conversion has played a significant role in the evolution of the MHC genes. With this in mind, comparing trees across exons helps us interpret the overall trees and strengthens our conclusions. Neither birth-and-death nor concerted evolution can be ignored when discussing gene families.

The MHC region is difficult to assemble owing to the large number of related genes, extreme polymorphism, and abundant repetitive regions. Nonetheless, several committed researchers have dedicated effort to sequencing and mapping the region in different species (Wilming et al., 2013; Anzai et al., 2003; Karl et al., 2023; Okano et al., 2020; Hammer et al., 2020; Liao et al., 2023). However, we now know that haplotype variation is just as important as nucleotide variation to maintaining MHC diversity, and haplotypes are surprisingly short-lived (de Groot et al., 2015, 2017a; Gleimer et al., 2011; Hans et al., 2015; Heijmans et al., 2020). Therefore, more research effort should be dedicated to fully characterizing the breadth of MHC haplotypes in different species. This is a difficult problem owing to the plethora of repetitive elements and recently-duplicated genes in the region, and long-read sequencing will be invaluable for parsing these complex haplo-types. Not only will this improve our understanding of health and disease in each species, but it will also help us answer evolutionary questions with better precision.

We hope that our extensive set of trees incorporating classical genes, non-classical genes, pseudogenes, gene fragments, and alleles of medical interest across a wide range of species will provide context to future researchers. This work will provide a jumping-off-point for further exploration of the evolutionary processes affecting different subsets of the gene family as well as necessary context for studies of particular alleles or genes in disease.

Methods and Materials

Data Collection

We downloaded MHC allele nucleotide sequences for all human and non-human genes from the IPD Database (collected January 2023) (Barker et al., 2023; Maccari et al., 2017, 2020; Robinson et al., 2024). To supplement the alleles available in the database, we also collected nucleotide sequences from NCBI using the Entrez E-utilities with query “histocompatibility AND txidX AND alive[prop]”, where X is a taxon of interest. This resulted in a very large collection of sequences from a large number of species. While Class II genes were generally assigned to loci, most Class I sequences had ambiguous or no locus assignments. Therefore, we performed a refined search for additional sequences by running BLAST on the available primate reference genomes.

BLAST Search For the reference genomes, we downloaded human chromosome 6 (GenBank accession CM000668.2), chimpanzee chromosome 5 (CM054439.2), bonobo chromosome 5 (CM055477.2), gorilla chromosome 5 (CM055451.2), Sumatran orangutan chromosome 5 (CM054684.2), Bornean orangutan chromosome 5 (CM054635.2), pileated gibbon linkage group LG22 (CM038537.1), siamang chromosome 23 (CM054531.2), Northern white-cheeked gibbon chromosome 22a (CM016966.1), olive baboon chromosome 6 (CM018185.2), Guinea baboon chromosome 6 (CM053423.1), gelada chromosome 4 (CM009953.1), Tibetan macaque chromosome 4 (CM045091.1), crab-eating macaque chromosome 4 (CP141358.1), Formosan rock macaque chromosome 4 (CM049490.1), mantled guereza chromosome 5 (CM058078.1), snub-nosed monkey chromosome 4 (CM017354.1), cotton-top tamarin chromosome 4 (CM063172.1), golden-handed tamarin linkage group LG04 (CM038394.1), common marmoset chromosome 4 (CM021918.1), coppery titi chromosome 3 (CM080817.1), gray mouse lemur chromosome 6 (CM007666.1), black-and-white ruffed lemur chromosome 6 (CM052441.1), mongoose lemur chromosome 15 (CM052867.1), ring-tailed lemur chromosome 2 (CM036473.1), Bengal slow loris linkage group LG08 (CM043617.1), Sunda slow loris chromosome 9 (CM050145.1), Philippine flying lemur chromosome 5 (CM050031.1), and mouse chromosome 17 (CM001010.3).

To create the BLAST databse, we first compiled all nucleotide MHC sequences from the IPD-MHC and IPD-IMGT/HLA databases into three fasta files: one containing the Class I sequences, one containing the Class II sequences, and one containing MHC-DRB9 sequences. We then constructed three custom databases from these sets of sequences using the makeblastdb command in BLAST version 2.11.0 (Camacho et al., 2009).

We then queried each of the three custom databases using the above reference genomes and screened the hits manually. In most cases, we were able to identify loci unambiguously, resulting in several newly-reported haplotypes (Figure 1—figure Supplement 2 and Figure 2—figure Supplement 2). The discovery of various genes in various species also allowed us to fill in gaps in Figure 1 and Figure 2.

Sequence Selection

Because BEAST2 is computationally limited by the number of sequences, it was necessary to prioritize certain sequences. To do this, we (very roughly) aligned as many exon 2 and 3 sequences as possible (from both NCBI RefSeq and the IPD database) using MUSCLE (Edgar, 2004) with default settings. We then constructed UPGMA trees in R to visualize the sequences. We preferentially selected sequences that were 1) in primate species not represented by the IPD database or 2) grouped with genes not well represented by the IPD database, and which were not similar/identical to other sequences. We also included several non-primate species to provide context and explore orthology beyond the primates. After choosing sequences with this preliminary screening method, we collected the full-length sequences for inclusion in further analyses. We limited sequences to one per species-gene pair for building the Class I, Class IIA, and Class IIB multi-gene trees (lists of alleles provided as Supplementary Files).

For Class I, we then re-aligned all genes together for each exon separately using MUSCLE (Edgar, 2004) with default settings (and manually adjusted). For Class II, alleles for each gene group (MHC-DMA, -DMB, -DOA, -DOB, -DPA, -DPB, -DQA, -DQB, -DRA, and -DRB) were aligned separately for each exon using MUSCLE (Edgar, 2004) with default settings (and manually adjusted). Since some Class II genes are too far diverged from one another to be reliably aligned automatically, the nucleotide alignments were then combined manually based on published amino acid alignments (Radley et al., 1994; Dijkstra et al., 2013; Dijkstra and Yamaguchi, 2019; Cuesta et al., 2006; Chen et al., 2006; Chazara et al., 2011). For Class IIA, exons 4 and 5 were concatenated together before this manual combination process because some analogous sites between genes are located across exons. For the same reason, exons 5 and 6 were concatenated together for Class IIB before combining. This produced three multi-gene alignments: Class I, Class IIA, and Class IIB.

We also aligned a larger set of sequences for each gene group to create our “focused” trees that each zoomed in on a different subtree of the multi-gene trees. Details for this are located in the Methods of our companion paper (Fortier and Pritchard, 2024).

Bayesian Phylogenetic Analysis

We constructed phylogenetic trees using BEAST2 (Bouckaert et al., 2014, 2019) with package substBMA (Wu et al., 2013). SubstBMA implements a spike-and-slab mixture model that simultaneously estimates the phylogenetic tree, the number of site partitions, the assignment of sites to partitions, the nucleotide substitution model, and a rate multiplier for each partition. Since we were chiefly interested in the partitions and their rate multipliers, we used the RDPM model as described by Wu et al. (2013). In the RDPM model, the number of nucleotide substitution model categories is fixed to 1, so that all sites, regardless of rate partition, share the same estimated nucleotide substitution model. This reduces the number of parameters to be estimated and ensures that only evolutionary rates vary across site partitions, reducing overall model complexity. We used an uncorrelated lognormal relaxed molecular clock because we wanted evolutionary rates to be able to vary among branches.

Priors

For the Dirichlet process priors, we used the informative priors constructed by Wu et al. (2013) for their mammal dataset. This is appropriate because they include several of the same species and their mammals span approximately the same evolutionary time that we consider in our study. We also use their same priors on tree height, base rate distribution, and a Yule process coalescent prior. We did not specify a calibration point—a time-based prior on a node—because we did not expect our sequences to group according to the species tree.

Running BEAST2

We ran BEAST2 on various subsets of the three alignments. Considering exons separately helped to minimize the effects of recombination on the tree, while also allowing us to compare and contrast tree topologies for exons encoding the binding site vs. exons encoding the other domains. For Class I, we repeated the analysis for 1) exon 2 only (PBR), 2) exon 3 only (PBR), and 3) exon 4 only (non-PBR). For Class IIA, we used 1) exon 2 only (PBR) and 2) exon 3 only (non-PBR). For Class IIB, we analyzed 1) exon 2 only (PBR) and 2) exon 3 only (non-PBR). In the following, each “analysis” refers to a collection of BEAST2 runs using a particular subset of either the Class I, Class IIA, or Class IIB alignment. The procedure is exactly the same for the “focused” trees, which each focus on a particular gene group within the Class I, Class IIA, or Class IIB alignment. More detail about the generation of the focused trees is located in the Methods of our companion paper (Fortier and Pritchard, 2024).

The xml files we used to run BEAST2 were based closely on those used for the mammal dataset with the RDPM model and uncorrelated relaxed clock in Wu et al. (2013) (https://github.com/jessiewu/substBMA/blob/master/examples/mammal/mammal_rdpm_uc.xml). Running a model with per-site evolutionary rate categories and a relaxed clock means there are many parameters to estimate. Along with the large number of parameters, highly-polymorphic and highly-diverged sequences make it difficult for BEAST2 to explore the state space. Thus, we undertook considerable effort to ensure good mixing and convergence of the chains. First, we employed coupled MCMC for all analyses. Coupled MCMC is essentially the same as the regular MCMC used in BEAST2, except that it uses additional “heated” chains with increased acceptance probabilities that can traverse unfavorable intermediate states and allow the main chain to move away from an inferior local optimum (Müller and Bouckaert, 2020). Using coupled MCMC both speeds up BEAST2 runs and improves mixing and convergence. We used four heated chains for each run with a delta temperature of 0.025. Second, we ran each BEAST2 run for 40,000,000 states, discarding the first 4,000,000 states as burn-in and sampling every 10,000 states. Third, we ran at least eight independent replicates of each analysis. The replicates use the exact same alignment and coupled MCMC settings, but explore state space independently and thus are useful for improving the effective sample size of tricky parameters. As recommended by BEAST2, we examined all replicates in Tracer version 1.7.2 (Rambaut et al., 2018) to ensure that they were sampling from the same parameter distributions and had reached convergence. We excluded replicates for which this was not true, as these chains were probably stuck in suboptimal state space. Additionally, Tracer provides estimates of the effective sample size (ESS) for the combined set of states from all chosen replicates, and we required that the combined ESS be larger than 100 for all parameters. If there were fewer than 4 acceptable replicates or if the ESS was below 100 for any parameter, we re-ran more independent replicates of the analysis until these requirements were satisfied. We obtained between 7 and 14 acceptable replicates (median 8) per analysis for the Class I, Class IIA, and Class IIB runs.

For some analyses, computational limitations prevented BEAST2 from being able to reach 40,000,000 states. In these situations, more replicates (of fewer states) were usually required to achieve good mixing and convergence. Regardless of how far these BEAST2 runs got, the first 4,000,000 states from each run were still discarded as burn-in even though this represented more than 10% of states. The xml files required to run all our analyses are provided as Supplementary Files.

This extremely stringent procedure ensured that all of the replicates were exploring the same parameter space and were converging upon the same global optimum, allowing the g 4 independent runs to be justifiably combined. We combined the acceptable replicates (discarding the first 4,000,000 states as burn-in) using LogCombiner version 2.6.7 (Drummond and Rambaut, 2007), which aggregates the results across all states. We then used the combined results for downstream analyses.

Phylogenetic Trees

After combining acceptable replicates, we obtained 17,927 - 28,384 phylogenies per gene/sequence subset for the Class I, Class IIA, and Class IIB trees (mean 25,154). We used TreeAnnotator version 2.6.3 (Drummond and Rambaut, 2007) to summarize each set of possible trees as a maximum clade credibility tree, which is the tree that maximizes the sum of posterior clade probabilities. Since BEAST2 samples trees from the posterior, one could in principle reduce the large set of trees to a smaller 95% credible set of trees representing the “true” tree (BEA, 2024). However, given the high complexity of the model space, all our posterior trees were unique, meaning this was not possible in practice. Throughout this paper, we rely on summary trees for our observations.

Integration with Literature

Hundreds of authors have contributed to the study of MHC evolution, and their myriad published results played a key role in this project. Figure 8 illustrates our approach to this project, including how we used existing literature and how we divided results among this paper and its companion (Fortier and Pritchard, 2024). We first constructed large multi-gene trees encompassing all Class I, Class IIA, and Class IIB genes. These provided a backbone for us to investigate subtrees in more depth, adding more sequences and more species to construct “focused trees” for each gene group. These, in combination with the literature, allowed us to create hypotheses about the evolution of the Class I α-block (Figure 6) and Class II MHC-DRB region (Figure 7).

BEAST2 trees provide insight into MHC gene and allele relationships.

We first created multi-gene Bayesian phylogenetic trees using sequences from all genes and species, separated into Class I, Class IIA, and Class IIB groups. We then focused on various subtrees of the multi-gene trees by adding more sequences for each subtree and running BEAST2 using only sequences from that group (in addition to the “backbone” sequences common to all trees). Our trees gave us insight into both overall gene relationships (this paper) and allele relationships within gene groups (see our companion paper, Fortier and Pritchard (2024)).

Gene Conversion

We inferred gene conversion fragments using GENECONV version 1.81a (Sawyer, 1999) on each focused alignment. It is generally advisable to use only synonymous sites when running the program on a protein-coding alignment, since silent sites within the same codon position are likely to be correlated. However, the extreme polymorphism in these MHC genes meant there were too few silent sites to use in the analysis. Thus, we considered all sites but caution that this could slightly overestimate the lengths of our inferred conversion tracts. For each alignment, we ran GENECONV with options ListPairs, Allouter, Numsims=10000, and Startseed=310. We collected all inferred “Global Inner” (GI) fragments with sim_pval< 0.05 (this is pre-corrected for multiple comparisons by the program). GI fragments indicate a stretch of similar sequence shared by two otherwise-dissimilar sequences in the alignment. This suggests that a gene conversion event occurred between the ancestors of the two sequences.

Many of the thousands of GI hits were redundant, involving very-closely-related alleles, slightly different fragment bounds, or even a wide range of species all implicating the same gene. We manually grouped and summarized these hits for Figure 3—source data 1 and Figure 4—source data 1. The “start” and “end” columns indicate the smallest start and largest end position (along the alignment) for the group of redundant hits, and the sequences involved are summarized as specifically as possible.

Acknowledgements

We acknowledge support from NIH grants R01 HG011432 and R01 HG008140. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1656518. We appreciate helpful comments from Jeffrey Spence, the Pritchard lab, and the reviewers of the previous version of our companion paper, which jumpstarted this project.