GENESPACE tracks regions of interest and gene copy number variation across multiple genomes

  1. John T Lovell  Is a corresponding author
  2. Avinash Sreedasyam
  3. M Eric Schranz
  4. Melissa Wilson
  5. Joseph W Carlson
  6. Alex Harkess
  7. David Emms
  8. David M Goodstein
  9. Jeremy Schmutz
  1. Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, United States
  2. Joint Genome Institute, Lawrence Berkeley National Laboratory, United States
  3. Biosystematics Group, Wageningen University and Research, Netherlands
  4. Center for Evolution and Medicine, School of Life Sciences, Arizona State University, United States
  5. Department of Crop, Soil, and Environmental Sciences, Auburn University, United States
  6. Oxford University, United Kingdom

Abstract

The development of multiple chromosome-scale reference genome sequences in many taxonomic groups has yielded a high-resolution view of the patterns and processes of molecular evolution. Nonetheless, leveraging information across multiple genomes remains a significant challenge in nearly all eukaryotic systems. These challenges range from studying the evolution of chromosome structure, to finding candidate genes for quantitative trait loci, to testing hypotheses about speciation and adaptation. Here, we present GENESPACE, which addresses these challenges by integrating conserved gene order and orthology to define the expected physical position of all genes across multiple genomes. We demonstrate this utility by dissecting presence–absence, copy-number, and structural variation at three levels of biological organization: spanning 300 million years of vertebrate sex chromosome evolution, across the diversity of the Poaceae (grass) plant family, and among 26 maize cultivars. The methods to build and visualize syntenic orthology in the GENESPACE R package offer a significant addition to existing gene family and synteny programs, especially in polyploid, outbred, and other complex genomes.

Editor's evaluation

GENESPACE is a new and straightforward computational tool to include synteny information in the calculation of genome-wide sets of orthologs. The development of this tool is very timely as more and more complete chromosome-scale assembled genomes are becoming available. While the assembly problem has been solved, this is not the case for multiple genome comparisons, and GENESPACE is an important step to help remedy this gap in our comparative genomics toolbox.

https://doi.org/10.7554/eLife.78526.sa0

eLife digest

The genome is the complete DNA sequence of an individual. It is a crucial foundation for many studies in medicine, agriculture, and conservation biology. Advances in genetics have made it possible to rapidly sequence, or read out, the genome of many organisms. For closely related species, scientists can then do detailed comparisons, revealing similar genes with a shared past or a common role, but comparing more distantly related organisms remains difficult.

One major challenge is that genes are often lost or duplicated over evolutionary time. One way to be more confident is to look at ‘synteny’, or how genes are organized or ordered within the genome. In some groups of species, synteny persists across millions of years of evolution. Combining sequence similarity with gene order could make comparisons between distantly related species more robust.

To do this, Lovell et al. developed GENESPACE, a software that links similarities between DNA sequences to the order of genes in a genome. This allows researchers to visualize and explore related DNA sequences and determine whether genes have been lost or duplicated. To demonstrate the value of GENESPACE, Lovell et al. explored evolution in vertebrates and flowering plants. The software was able to highlight the shared sequences between unique sex chromosomes in birds and mammals, and it was able to track the positions of genes important in the evolution of grass crops including maize, wheat, and rice.

Exploring the genetic code in this way could lead to a better understanding of the evolution of important sections of the genome. It might also allow scientists to find target genes for applications like crop improvement. Lovell et al. have designed the GENESPACE software to be easy for other scientists to use, allowing them to make graphics and perform analyses with few programming skills.

Introduction

De novo genome assemblies and gene model annotations represent increasingly common resources that describe the sequence and positions of protein coding and intergenic regions within a single genotype. Evolutionary relationships among these DNA sequences form the foundation of many molecular tools in modern medical, breeding, and evolutionary biology research.

Perhaps the most crucial inference to make when comparing genomes revolves around homologous genes, which share an evolutionary common ancestor and ensuing sequence or protein structure similarity. Analyses of homologs, including comparative gene expression, epigenetics, and sequence evolution, require the distinction between orthologs, which arise from speciation events, and paralogs, which arise from sequence duplications (see Box 1 for definitions). In some systems, this is a simple task where most genes are single copy, and orthologs are synonymous with reciprocal best-scoring protein BLAST hits. Other sequence similarity approaches such as OrthoFinder (Emms and Kelly, 2019; Emms and Kelly, 2015) leverage graphs and gene trees to test for orthology, permitting more robust analyses in systems with gene copy number (CNV) or presence–absence variation (PAV). However, whole-genome duplications (WGDs), chromosomal deletions, and variable rates of sequence evolution, such as subgenome dominance in polyploids, can confound the evidence of orthology from sequence similarity alone.

Box 1

Definitions

Orthogroup — a set of genes across multiple genomes derived from a single ancestral gene.

Ortholog — a pair of orthogroup members in two species derived from a single gene in their most recent common ancestor.

Paralog — orthogroup members derived from a duplication event since speciation.

Homeolog — paralogs derived from a whole-genome duplication.

Tandem array — paralogs in proximity on a chromosome within a genome.

Gene collinearity — retained order of genes across species due to common ancestry.

Synteny — like collinearity but at larger scales, like chromosome arms.

Pan-genome annotation — A set of orthogroups across multiple genomes, placed along the coordinate system of a specified reference genome.

The physical position of homologs offers a second line of evidence that can help overcome challenges posed by WGDs, tandem arrays, heterozygous-duplicated regions, and other genomic complexities (Drillon et al., 2020; Haug-Baltzell et al., 2017; Wang et al., 2012). Synteny, or the conserved order of DNA sequences among chromosomes that share a common ancestor, is a typical feature of eukaryotic genomes. In some taxa, synteny is preserved across hundreds of millions of years of evolution and is retained over multiple WGDs (Jiao et al., 2014; Simakov et al., 2020; Zhao and Schranz, 2019). Like chromosome-scale synteny, conserved gene order collinearity along local regions of chromosomes can provide evidence of homology, and in some cases enable determination of whether two regions diverged as a result of speciation or a large-scale duplication event (Drillon et al., 2020). Combined, evidence of gene collinearity and sequence similarity should improve the ability to classify paralogous and orthologous relationships beyond either approach in isolation.

Integrating synteny and collinearity into comparative genomics pipelines also physically anchors the positions of related gene sequences onto the assemblies of each genome. For example, by exploring only syntenic orthologs it is possible to examine all putatively functional variants within a genomic region of interest, even those in genes that are absent in the focal reference genome (Lovell et al., 2018). Such a ‘pan-genome annotation’ framework (Lovell et al., 2021a) permits access to multi-genome networks of high-confidence orthologs and paralogs, regardless of ploidy or other complicating aspects of genome biology. Here, we present GENESPACE, an analytical pipeline that explicitly links synteny and sequence similarity to provide high-confidence inference about networks of genes that share a common ancestor and represents these networks as a ‘pan-genome annotation’. We then leverage this framework to explore gene family evolution in flowering plants, mammals, and reptiles.

Results and discussion

GENESPACE syntenic orthology methods to compare multiple complex genomes

Comparative genomics across the complex evolutionary histories of eukaryotes typically requires equally varied input and analytical pipelines depending on researchers’ goals and study systems. For example, synteny between closely related haploid assemblies is often inferred by exploring only 1:1 reciprocal best-scoring hits with MCScanX (Wang et al., 2012). Alternatively, polyploid genomes are typically split into subgenomes so that homeologs are viewed by clustering algorithms like OrthoFinder (Emms and Kelly, 2019; Emms and Kelly, 2015) as orthologous and not paralogous. While expert knowledge that informs these analytical decisions can dramatically improve precision, this knowledge is not available in many systems. These issues boil down to a simple circular problem: a priori knowledge of gene copy number is needed to effectively infer orthology and synteny, yet measures of synteny and orthology are needed to infer copy number between a pair of sequences.

GENESPACE resolves this circular problem by operating on a foundational assumption: homologs should be exactly single copy within any syntenic region between a pair of genomes. There are two major violations of this assumption that cause copy number variation within a syntenic block: (1) tandem arrays and (2) gene PAV. GENESPACE addresses these complexities directly in two ways. First, physically proximate multigene families (hereon, ‘tandem arrays’, Box 1) are condensed to the physically most central gene of the array. Gene rank order along the genome is recalculated on these ‘array representative’ genes, effectively masking copy number variation due to tandem arrays. Second, synteny is inferred in a pairwise manner only using ‘potential anchor’ protein BLAST hits (hereon ‘hits’) where both the query and target genes are in the same orthogroup. The rank-order positions of these potential anchor genes are also recalculated prior to synteny inference, which effectively masks orthogroups missing a gene in one genome (i.e., PAV). Thus, GENESPACE operates on OrthoFinder-derived orthogroups with one-to-one relationships for any accurately defined syntenic region, regardless of ploidy or level of sequence conservation.

Given GENESPACE’s reliance on syntenic regions between genomes, errors in syntenic block coordinates can have major effects on downstream estimates. Therefore, we crafted a sensitive pipeline to infer syntenic regions (see pipeline overview in Figure 1). To demonstrate its functionality, we ran GENESPACE on seven pairs of genomes spanning closely related maize cultivars to ancient mammal–bird divergence (~300 M ya; Table 1, see Materials and methods). The three comparisons between vertebrate genomes have no recent history of WGDs, while WGDs predated evolutionary divergence of the four plant contrasts: maize is a 12 M ya paleo-tetraploid, cotton is a 1.6 M ya meso-tetraploid, and the ~70 M ya Rho WGD predated grass diversification. Paralogs derived from these WGDs notoriously obscure contrasts between orthologs in these systems. As such, we treated all genomes as haploid and did not include an outgroup; therefore, orthogroups should not include homeologs since the genomes share a common ancestor that arose after the WGD (see ‘additional considerations’ methods section: Outgroups and the phylogenetic context of orthology inference). In each run, we contrasted GENESPACE-derived orthogroups and syntenic blocks to the defaults from OrthoFinder and MCScanX, respectively (Table 1). Since it is a common practice to refine hits prior to running MCScanX, we also included syntenic block coverage from MCScanX run on orthogroup-constrained hits where both the target and query genes must be in the same orthogroup. To contrast each approach, we calculated the percent of genes in (1) orthogroups and (2) syntenic blocks that were placed on exactly one chromosome per genome (Table 1).

GENESPACE synteny and pan-genome annotation methods.

(A, grey panel) GENESPACE runs and parses OrthoFinder results into a synteny-constrained pan-genome annotation. (B, purple panel) Chromosome, gene rank order, and orthogroup membership are added to BLAST hits, which allows direct integration between estimates of orthology and synteny. The three dotplots present the efficacy of GENESPACE syntenic blocks by exploring a particularly challenging region on human (x-axis) and chimpanzee (y-axis) chr. 6. Each point is a BLAST hit rank-order position, colored by syntenic block; colors are recycled if there are more than eight blocks. (C, green panel) Synteny-constrained orthogroups and optionally non-syntenic orthologs are decomposed into a pan-genome annotation where each orthogroup is placed at its inferred syntenic position.

Table 1
Comparison of synteny and orthogroup methods.

To test the precision of GENESPACE syntenic orthogroups estimates, we contrasted seven pairs of haploid genome assemblies. We present the percent of genes that were found in an orthogroup that hit a single chromosome per genome from the default OrthoFinder and GENESPACE runs. The precision of syntenic block breakpoint estimates was calculated similarly, where the percentage of genes that are placed in a single syntenic block per genome are presented for MCScanX run on all hits, those where both the query and target genes are in the same orthogroup (‘OG’) or via the GENESPACE pipeline.

(a) % genes in single-copy OGs(b) % genes in single-copy syntenic blocks
Age (~M ya)OrthoFinderGENESPACEMCScanXMCScanX OGGENESPACE
B73 vs. B97 maize*<0.0151.573.650.879.093.4
Human Hg38 vs. T2T0–0.187.795.981.195.097.7
Cotton*,+0.535.685.72.714.196.2
HAL2 vs. FIL2 panicgrass*1.174.883.262.389.392.0
Human-chimpanzee781.190.278.691.293.3
Sorghum-Brachypodium*5046.750.249.367.476.3
Human-chicken31066.768.566.471.273.0
  1. *

    The plant genomes all have one or more WGDs that predate divergence of the genomes,.+Cotton species Gossypium barbadense and G. darwinii have the most recent WGD of ~1.6 M ya, which causes a large number of blocks to be included as two copies; to avoid confusion between subgenomes, blkSize, and nGaps parameters were increased from 5 (default) to 10 genes.

For every pair of genomes, GENESPACE produced a greater percentage of single-copy syntenic blocks and genes in single chromosome orthogroups than either OrthoFinder or MCScanX in isolation. GENESPACE also outperformed simple integration between the two methods through MCScanX on orthogroup-constrained hits. The improved performance of GENESPACE synteny-constrained orthogroups was most subtle between highly diverged haploid animal assemblies. For example, in the comparison between human and chicken, GENESPACE resolved 2% and 1.8% more single-copy orthogroups and syntenic blocks than default OrthoFinder and orthogroup-constrained MCScanX, respectively. In contrast, the benefits of GENESPACE were most pronounced in recently diverged genomes with a history of WGDs. Single-copy orthogroups between two meso-tetraploid cotton species that share a WGD that predated speciation by ~1 M ya, were uncommon in the default OrthoFinder run (35.6%) but far more prevalent in GENESPACE syntenic orthogroups (85.7%). Similarly, homeologs derived from the cotton WGD impacted estimates of syntenic blocks: only 14% of the genomes were single-copy syntenic in orthogroup-constrained MCScanX but 96.2% were single copy in GENESPACE syntenic blocks. Combined, these results demonstrate significant flexibility and utility of GENESPACE across a range of evolutionary histories and divergence.

It is important to note that some evolutionary processes, including small-scale translocations, can cause true orthologs to exist outside of syntenic regions. Closely related genomes without a history of WGDs tend to have few non-syntenic orthologs. For example, there are 1096 non-syntenic orthologs (6.3% of all orthologs) between human and chimpanzee. In contrast, the 50 M ya diverged Sorghum and Brachypodium genomes have 9002 non-syntenic orthologs, many of which are the result of over-retained Rho WGD-derived paralogs (see below). Since the non-syntenic orthologs can be important in some situations, GENESPACE embraces this complexity by including and flagging non-syntenic orthologs within the pan-genome annotation (Figure 1).

Synteny-anchored exploration of vertebrate sex chromosomes

GENESPACE facilitates the exploration and analysis of sequence evolution across multiple genomes within regions of interest (e.g., quantitative trait loci [QTL] intervals, see the next section). One particularly instructive example comes from the origin and evolution of the mammalian XY and avian ZW sex chromosome systems. To explore these chromosomes, we ran GENESPACE on 15 haploid avian and mammalian genome assemblies, spanning most major clades of birds, placental mammals, monotremes, and marsupials with available chromosome-scale annotated reference genomes (See Materials and methods). We also included two reptile genomes as outgroups to the avian genomes. The heteromorphic chromosomes (Y and W) are often unassembled, or, where assemblies exist, lack sufficient synteny to provide a useful metric for comparative genomics. As such, we chose to focus on the homomorphic X and Z chromosomes, which have remained surprisingly intact over the >100 M years of independent mammalian (Murphy et al., 1999) and avian evolution (Zhou et al., 2014; Figure 2, Figure 2—figure supplement 1).

Figure 2 with 1 supplement see all
Sex chromosome syntenic network across 17 representative vertebrate genomes.

The plot was generated by the plot_riparian GENESPACE function. Genomes are ordered vertically to minimize the number of translocations between each pairwise combination. Chromosomes are ordered horizontally to maximize synteny with the human chromosomes [X, 1–22]. Regions containing syntenic orthogroup members to the mammalian X (gold) or avian Z (blue) chromosomes are highlighted. All sex chromosomes are represented by red segments while autosomes are white. Chromosome segment sizes are scaled by the total number of genes in syntenic networks and positions of the braids are the gene order along the chromosome sequence. See Figure 2—figure supplement 1 for the full synteny graph including autosomes and chromosome labels.

While the same or similar genomic regions often recurrently evolve into sex chromosomes, perhaps due to ancestral gene functions involved in gonadogenesis, evidence about the nonrandomness of sex chromosome evolution is still contentious (Kratochvíl et al., 2021). Given our analysis, the avian Z chromosome clearly did not evolve from either of the two reptile Z chromosomes sampled here, but instead likely arose from autosomal regions or unsampled ancestral sex chromosomes. The situation in mammals is less clear, in part because both reptile genomes are more closely related to avian than mammalian genomes, which makes ancestral state reconstructions between the two groups less accurate. Nonetheless, the mammalian X and sand lizard Z chromosomes partially share syntenic orthology, an outcome that would be consistent with common descent from a shared ancestral sex chromosome or autosome containing sex-related genes. The shared 91.7 M bp region between the human X and sand lizard Z represents 59.0% of the human X chromosome genic sequence. The remaining 64.0 M bp of human X linked sequences are syntenic with autosomes in the sand lizard and garter snake genomes (Figure 2).

The eutherian mammalian X chromosome is largely composed of two regions, an X-conserved ancestral sex chromosome region that arose in the common ancestor of therian mammals, and an X-added region that arose in the common ancestor of eutherians (Ross et al., 2005). Consistent with this evolutionary history, the X chromosome is syntenic across all five eutherian mammals studied here. Further, a 107.2 M bp (68.8%) segment of the human X, which corresponds with the X-conserved region, is syntenic with 77.8 M bp (93.9%) of the Tasmanian devil X chromosome and represents the entire syntenic region between the human and all three marsupial X chromosomes (Figure 2).

Similarly, the chicken Z chromosome is retained in its entirety across all five avian genomes. The only notable exception being the budgie Z chromosome, which features a partial fusion between the Z and an otherwise autosomal 19.5 M bp segment of chicken chromosome 11 (Figure 2), potentially representing a neo-sex chromosome fusion that has not yet been described.

In contrast to conserved eutherian and avian sex chromosomes, the complex monotreme XnYn sex chromosomes are only partially syntenic between the two sampled genomes. Only the first X chromosomes are ancestral to both echidna and platypus (Rens et al., 2007), and all are unrelated to the mammalian X chromosomes (Figure 2—figure supplement 1), consistent with their independent evolution (Rens et al., 2007). Interestingly, the entirety of the echidna X4 and 47.6 M bp (67.9%) of the genic region of the platypus X5 chromosomes are syntenic with the avian Z chromosome (Figure 2). The phylogenetic scale of the genomes presented here precludes evolutionary inference about the origin of these shared sex chromosome sequences; however, the possibility of parallel evolution of sex chromosomes between such diverged lineages may prove an interesting future line of inquiry.

Exploiting synteny to track candidate genes in grasses

The Poaceae grass plant family is one of the best studied lineages of all multicellular eukaryotes and includes experimental model species (Brachypodium distachyon; Panicum hallii; Setaria viridis) and many of the most productive (Zea mays – maize/corn; Triticum aestivum – wheat, Oryza sativa – rice) and emerging (Sorghum bicolor – sorghum; Panicum virgatum – switchgrass) agricultural crops. Despite the tremendous genetic resources of these and other grasses, genomic comparisons among grasses are difficult, in part because of an ancient polyploid origin (see the next section), and because subsequent WGDs are a feature of most clades of grasses. For example, maize is an 11.4 M ya paleo-polyploid (Gaut and Doebley, 1997), allo-tetraploid switchgrass formed 4–6 M ya (Lovell et al., 2021b), and allo-hexaploid bread wheat arose about 8 k ya (Haas et al., 2019). In some cases, homeologous gene duplications from polyploidy have generated genetic diversity that can be targeted for crop improvement; however, in other cases the genetic basis of trait variation may be restricted to sequences that arose in a single subgenome. Thus, it is crucial to contextualize comparative–quantitative genomics and explicitly explore only the orthologous or homeologous regions of interest when searching for markers or candidate genes underlying heritable trait variation — a significant challenge in the complex and polyploid grass genomes.

To help overcome this challenge and provide tools for grass comparative genomics, we conducted a GENESPACE run and built an interactive viewer hosted on Phytozome (Goodstein et al., 2012). Owing to its use of within-block orthology and synteny constraints (Figure 1), GENESPACE is ideally suited to conduct comparisons across species with diverse polyploidy events. Default parameters produced a largely contiguous map of synteny even across notoriously difficult comparisons like the paleo-homeologs between the maize subgenomes (Figure 3A). Furthermore, the sensitive synteny construction pipeline implemented by GENESPACE effectively masks additional paralogous regions like those from the Rho duplication that gave rise to all extant grasses.

Figure 3 with 1 supplement see all
Comparative–quantitative genomics in the grasses.

(A) The GENESPACE syntenic map (‘riparian plot’) of orthologous regions among eight grass genomes. Chromosomes are ordered horizontally to maximize synteny with rice and ribbons are color coded by synteny to rice chromosomes. Genomes are ordered vertically by general phylogenetic positions. (B) The upper bars display the proportion of maize gene models without syntenic orthologs (‘absent’) in each genome, split by the full background (dark colors) and 86 C3/C4 genes (light colors). (C) The proportion of absent genes is higher in the C3 genomes (green bars), even when controlling for more global gene absences (lower odds ratios). (D) Syntenic orthologs, excluding homeologs among the 26 maize nested association mapping (NAM) founder genomes, with two quantitative trait loci (QTL) intervals highlighted on chromosome 3 (‘Chr3’) and chromosome 6 (’Chr6’). (E) Focal QTL regions that affect productivity in drought where only the genome that drives the QTL effect (middle), the top (B73) and bottom (Tzi8) genomes are presented and the region plotted is restricted to the physical B73 QTL interval and a 25 M bp buffer on either side. Note that the Chr3 QTL disarticulates into two intervals. Due to a larger number of potential candidate genes, the larger Chr3 region, flagged with **, is explored separately in Figure 3—figure supplement 1. (F) Presence–absence and copy number variation are presented for two of the three intervals as heatmaps where each row is a genome (order following panel D), each column is a pan-genome entry (see Figure 1), and the color of each tile indicates absence (gray), single copy (light blue), and multicopy (dark blue). PAV/CNV of the focal genome is outlined. For each interval, the estimated QTL allelic effect relative to B73 of each genome is plotted as bars to the right of the heatmap.

Breeders and molecular biologists can take two general approaches to understand the genetic basis of complex traits: studying variation caused by a priori-defined genes of interest or determining candidate genes from genomic regions of interest. As an example of the exploration of lists of a priori-defined candidate genes, we analyzed PAV of 86 genes shown to be involved in the transition between C3 and C4 photosynthesis (Ding et al., 2015), the latter permitting ecological dominance in arid climates and agricultural productivity under forecasted increased heat load of the next century. To conduct this analysis, we built pan-genome annotations across the seven grasses anchored to C4 maize, which was the genome in which these genes were discovered. This resulted in 159 pan-genome entries: nearly always two placements for each gene in the paleo-tetraploid maize genome. Given that many of these genes were discovered in part because of sequence similarity to genes in Arabidopsis and other diverged plant species, it is not surprising that PAV among C3/C4 genes was lower than the background (9.7% vs. 38.2%, odds ratio = 5.7, p < 1 × 10−16; Figure 3B). However, these ratios were highly variable among genomes, particularly among the C3 species (wheat, rice, B. distachyon), which had far more absences than the C4 species (15.3% vs. 5.5%, odds ratio = 3.1, p = 6.25 × 10−8, Figure 3B). This effect is undoubtedly due in part to the increased evolutionary distance between maize and the C3 species compared to the other C4 species. However, when controlling for the elevated level of absent genes globally in C3 species, the effect was still very strong: the odds of C3 species having more of these C3/C4 genes at syntenic pan-genome positions than the background was always lower than the C4 species (Figure 3C). Despite these interesting patterns, given only a single C3/C4 phylogenetic split in this dataset, it is not possible to test evolutionary hypotheses regarding the causes of such PAV. Nonetheless, this result suggests a possible role of gene loss or gain as an evolutionary mechanism for drought- and heat-adapted photosynthesis.

Like the exploration of a priori-defined sets of genes, finding candidate genes within QTL intervals usually involves querying a single reference genome and extracting genes with promising annotations or putatively functional polymorphism. In the case of a biparental mapping population genotyped against a single reference, this is a trivial process where genes within physical bounds of a QTL are the candidates. However, many genetic mapping populations now have reference genome sequences for all parents; this offers an opportunity to explore variation among functional alleles and PAV, which would be impossible with a single reference genome. GENESPACE is ideally suited for this type of exploration, and indeed was originally designed to solve this problem between the two P. hallii reference genomes and their F2 progeny (Lovell et al., 2018) using synteny to project the positions of genes across multiple genomes onto the physical positions of a reference. To illustrate this approach, we reanalyzed QTL generated from the 26-parent USA maize nested association mapping (NAM) population (Li et al., 2016). Originally, candidates for these QTL were defined by the proximate gene models only in the B73 reference genome (Li et al., 2016); however, with GENESPACE and the recently released NAM parent genomes (Hufford et al., 2021), it is now possible to evaluate candidate genes present in the genomes of other NAM founder lines but either absent or unannotated in the B73 reference genome.

To explore this possibility, we built a single-copy synteny map across all 26 NAM founders anchored to the B73 genome (Figure 3D). We opted to focus on QTL where the allelic effect of a single parental genome was an outlier relative to all other alleles. Such ‘private’ allelic contributions in multi-parent populations offer a powerful opportunity to define high-confidence candidates as genes with parent-specific sequences that match parent-specific allelic contributions to phenotypic trait variation (Abdulkina et al., 2019). Among the 190 QTLs, three displayed outlier effects of one parent: cultivar ‘Mo18W’ contributed an allele for delayed anthesis-silking interval at two adjacent chromosome 3 (‘Chr3’) QTLs and cultivar ‘Ki11’ contributed an allele that reduced plant height under drought at a QTL on chromosome 6 (‘Chr6’) (Li et al., 2016). Given that these QTL were chosen only due to their parental allelic effects, we were surprised to find that the two Mo18W QTL regions exist within a 11.7 M bp derived inversion that is only found in the Mo18W genome (Figure 3D, E). Since inversions reduce recombination, it is possible that multiple Mo18W causal variants have been fixed in linkage disequilibrium in this NAM population.

In addition to this chromosomal mutation and sequence variation between the parents and B73 (Li et al., 2016), we sought to define candidate genes from the patterns of presence–absence and copy number variation, explicitly looking for genes that were private to the focal genome. Two genes in the smaller Chr3 and one gene in the larger Chr3 interval were private to Mo18W, and four genes in three pan-genome entries (one two-member array) were private to Ki11 in the Chr6 interval (Figure 3F, Figure 3—figure supplement 1). While these genes do not have functional annotations relating to drought, this method provides additional candidates that would not have been discovered by B73-only candidate gene exploration.

Studying the WGD that led to the diversification of the grasses

Like most plant families (Barker et al., 2016; Stebbins, 1950; One Thousand Plant Transcriptomes Initiative, 2019), but unlike nearly all animal lineages (Muller, 1925), the grasses radiated following a whole-genome duplication: the ~70 M ya Rho WGD. The resulting gene family redundancy and gene-function subfunctionalization is hypothesized to underlie the tremendous ecological and morphological diversity of grasses (Preston et al., 2009; Preston and Kellogg, 2006; Wu et al., 2008).

To explore sequence variation among Rho-derived paralogs, we used GENESPACE to build a ploidy-aware syntenic pan-genome annotation among eight species, using the built-in functionality that allows the user to mask primary (likely orthologous) syntenic regions and search for secondary hits (likely paralogous, Figure 4A). This method acts similarly to using an outgroup that predated the WGD (see ‘additional considerations’ methods), but using the same OrthoFinder run as in Figure 3A. Overall, the peptide identity between Rho-derived paralogous regions was much lower than orthologs between species (e.g., S. viridis vs. P. hallii: Wilcoxon W = 88,094,632, p < 1 × 10−16), consistent with the previous discovery that the Rho duplication predated the split among most extant grasses (Ma et al., 2021). However, as has been previously observed (Wang et al., 2011), there is significant variation in the relative similarity of Rho-duplicated sequences. As an example, the peptide sequences of single-copy gene hits in primary syntenic regions (median identity = 90.6%) between chromosome 8 of P. hallii and S. viridis, were 26.9% more similar than the secondary Rho-derived regions (Figure 4B; median identity = 71.4%, Wilcoxon W = 87,842, p < 1 × 10−16). However, S. viridis chromosome 8 contained a single over-retained paralogous region. Unlike all other Rho-derived blocks, the P. hallii paralogs to this 2.7 M bp chromosome 8 region were not significantly less conserved than the primary orthologous region (91.6% vs. 91.9%, W = 14,830, p = 0.13). Outside of this region, the peptide identity of paralogs returned to the genome-wide average (Figure 4B). In line with this observation, the GENESPACE run treating the eight genomes as haploid representations could not distinguish between the Rho-derived paralogs in the over-retained region across all grasses (Figures 3A and 4C), except for of all chromosome pairs between B. distachyon, wheat, and blocks connecting Maize chromosome 10 to sorghum chromosome 5.

Analysis of the grass Rho WGD.

(A) BLAST hits between P. hallii and S. viridis where the target and query genes were in the same orthogroup are plotted and color coded by sequence similarity. Two over-retained regions are highlighted in the red and yellow boxes. (B) The protein identity of S. viridis chromosome 8 primary orthologous (blue line) hits against P. hallii chromosome 8 and the secondary hits (orange line) against P. hallii chromosome 3 demonstrate sequence conservation heterogeneity. The region between the two red vertical lines corresponds to the red-boxed over-retained primary block in panel A. (C) The two boxed regions in panel A were tracked from P. hallii chromosomes 3 (red) and 8 (yellow); 50% transparency of the braids means that overlapping regions appear orange.

It is interesting to note that all syntenic over-retained regions were at the extreme termini of the chromosomes outside of maize, B. distachyon and wheat; further, the only genome with complete segregation of the two paralogs, wheat, also retains these regions in the center of all six chromosomes (Figure 4C). These results are consistent with the proposed evolutionary mechanism (Wang et al., 2011) where concerted evolution and ‘illegitimate’ homeologous recombination may have homogenized these paralogous regions. This process would be less effective in pericentromeric regions than the chromosome tails, where a single crossover event would be sufficient to homogenize two paralogous regions.

Conclusions

Combined, the historical abundance of genetic mapping studies and ongoing proliferation of genome resources provides a strong foundation for the integration of comparative and quantitative genomics to accelerate discoveries in evolutionary biology, medicine, and agriculture. The incorporation of synteny and orthology into comparative genomics and quantitative genetics pipelines offers a mechanism to bridge these disparate disciplines. Here, we presented the GENESPACE R package as a framework to help bridge the current gaps between comparative and quantitative genomics, especially in complex evolutionary systems. We hope that the examples presented here will inspire further work to leverage the powerful genome-wide annotations that are coming online, both within and among species.

Materials and methods

GENESPACE pipeline and analysis overview

Request a detailed protocol

GENESPACE operates on gff3-formatted annotation files and accompanying peptide fasta files for primary gene models. There are convenience functions for reformatting the gff and peptide files to simplify the naming scheme and reduce redundant gene models to the primary longest transcript. With these data in hand, GENESPACE calculates BLAST-like hits from DIAMOND2 and runs OrthoFinder (Emms and Kelly, 2019) to infer orthogroups and orthologues. GENESPACE then extracts syntenic regions from the hits using a combination of graph- and cluster-based approaches, producing syntenic orthogroups for each unique (not reciprocal) pair of genomes. Syntenic orthogroups are then collapsed into a pan-genome annotation, which is a matrix of positions against a reference genome (rows) and unique gene models in each syntenic orthogroup for each genome (columns). Detailed step-by-step pipeline methods can be found below.

All analyses were performed in R 4.1.2 on macOS Big Sur 10.16. The following R packages were used for visualization or within GENESPACE v0.9.3 (February 11, 2022 release): data.table v1.14.0 (Dowle and Srinivasan, 2021), dbscan v1.1-8 (Hahsler et al., 2019), igraph v1.2.6 (Csardi and Nepusz, 2006), Biostrings v2.58.0 (Pagès et al., 2020), and rtracklayer v1.50.0 (Lawrence et al., 2009). GENESPACE also calls the following third party software: DIAMOND v2.0.8.146 (Buchfink et al., 2021), OrthoFinder v2.5.4 (Emms and Kelly, 2019), and MCScanX no version installed on October 23, 2021 (Wang et al., 2012). All results were generated programmatically; the accompanying scripts and key output are available on github: jtlovell/GENESPACE_data. Minor adjustments to figures to improve clarity were accomplished in Adobe Illustrator v26.01. A full description of each step in GENESPACE is provided in the documentation that accompanies the package source code on github (jtlovell/GENESPACE).

Description of the vignettes

Request a detailed protocol

Publicly available genome annotations were downloaded on or before October 8, 2021. See Table 2 for data sources, citations, and metadata. All GENESPACE runs used default parameterization, with the following exceptions: (1) the Rho grass run allowed a single secondary hit (default is 0, this is how the paralogs are explicitly searched for) and maximum number of gaps in secondary regions of 10 (default is 5, relaxed to reduce ancient paralogous block splitting), and (2) the maize run used the ‘fast’ OrthoFinder method since all genomes are closely related and haploid. Some maize genomes contained small alternative haplotype scaffolds, which were dropped for all analyses.

Table 2
Raw data sources.

A list of the genomes used in analyses here. Genome version IDs are taken from those posted on the respective data sources and may not reflect the name of the genome in the publication. Where multiple haplotypes are available, only the primary was used for these analyses. All polyploids presented here have only a primary haplotype assembled into chromosomes.

IDSpeciesGenome versionData sourcePloidy*Reference
garter snakeThamnophis elegansrThaEle1.priNCBI1Rhie et al., 2021
sand lizardLacerta_agilisrLacAgi1.priNCBI1Rhie et al., 2021
chickenGallus gallusmat.broiler.GRCg7bNCBI1https://www.ncbi.nlm.nih.gov/grc
hummingbirdCalypte annabCalAnn1_v1.pNCBI1Rhie et al., 2021
budgieMelopsittacus undulatusbMelUnd1.mat.ZNCBI1Unpublished VGP
swanCygnus olorbCygOlo1.pri.v2NCBI1Rhie et al., 2021
zebra finchTaeniopygia guttatabTaeGut1.4.priNCBI1Rhie et al., 2021
echidnaTachyglossus aculeatusmTacAcu1.priNCBI1Zhou et al., 2021
platypusOrnithorhynchus anatinusmOrnAna1.pri.v4NCBI1Zhou et al., 2021
brushtail possumTrichosurus vulpeculammTriVul1.priNCBI1Rhie et al., 2021
opossumMonodelphis domesticaMonDom5NCBI1Mikkelsen et al., 2007
Tasmanian devilSarcophilus harrisiimSarHar1.11NCBI1Rhie et al., 2021
human (Hg38)Homo sapiensGRCh38.p13NCBI1https://www.ncbi.nlm.nih.gov/grc
human (t2t)Homo sapiensCHM13-T2T v2.1NCBI1Nurk et al., 2022
chimpanzeePan troglodytesClint_PTRv2NCBI1Chimpanzee Sequencing and Analysis Consortium, 2005
mouseMus musculusGRCm39NCBI1https://www.ncbi.nlm.nih.gov/grc
dogCanis lupus familiarisDog10K_Boxer_TashaNCBI1Jagannathan et al., 2021
slothCholoepus didactylusmChoDid1.priNCBI1Rhie et al., 2021
horseshoe batRhinolophus ferrumequinummRhiFer1_v1.pNCBI1Rhie et al., 2021
dolphinTursiops truncatusmTurTru1.mat.YNCBI1Unpublished VGP
P. halliiPanicum hallii var. halliiHAL2_v2.1Phytozome1Lovell et al., 2018
P. hallii (FIL)Panicum hallii var. filipesFIL2_v3.1Phytozome1Lovell et al., 2018
switchgrassPanicum virgatumAP13_v5.1Phytozome2Lovell et al., 2021b
S viridisSetaria viridisv2.1Phytozome1Mamidi et al., 2020
SorghumSorghum bicolorBTx623_v3.1Phytozome1Paterson et al., 2009
maizeZea maysB73_refgen_v5NCBI*2Hufford et al., 2021
riceOryza sativa cv ‘kitaake’kitaake_v2.1Phytozome1Jain et al., 2019
BrachypodiumBrachypodium distachyonBd21_v3.1Phytozome1International Brachypodium Initiative, 2010
wheatTriticum aestivumV4 (Chinese Spring)NCBI3Zhu et al., 2021
G barbadenseGossypium barbadensev1.1Phytozome2Chen et al., 2020
G. darwiniiGossypium darwiniiv1.1Phytozome2Chen et al., 2020
26 NAM parentsZea mayssee data on NCBINCBI*1Hufford et al., 2021
  1. *

    Ploidy indicates how the genome was treated in the analyses. All values match the ploidy of the primary assembly haplotype except maize, where the refgen_v5 was treated as diploid (to match both homeologs) in the multispecies run, but as haploid in the nested association mapping (NAM) founder population to track only meiotic homologs across the population. This parameterization is to match the phylogenetic position of the whole-genome duplication (WGD) in the terminal branch of the grass-wide analysis, but ancestral in the 26-NAM analysis.

The publicly available C3/C4 gene lists and QTL intervals were generated against the v2 maize assembly. To make this comparable to the across grass and NAM parent GENESPACE runs, we also accomplished a fast GENESPACE run between v2 and the two v5 versions used here. The orthologs and syntenic mapping between these versions are included as text files in the data repository.

All statistics presented here were calculated within R. To compare nonnormal distributions (e.g., sequence identity), we used the nonparametric signed Wilcoxon ranked sum test. To measure sequence divergence, we calculated ungapped percent peptide sequence identity from pairwise Needleman–Wunsch global alignments, implemented in Biostrings (Pagès et al., 2020). To determine single outliers from a unimodal distribution, we applied the Grubbs test implemented in the outliers R package (Komsta, 2011). Some figures were constructed outside of GENESPACE using base R plotting routines and ggplot2 v3.3.3 (Wickham, 2016). Some color palettes were chosen with RColorBrewer (Neuwirth, 2014) and viridis (Garnier et al., 2021).

GENESPACE pipeline: estimating syntenic orthogroups

Request a detailed protocol

GENESPACE relies on high-confidence homologs, which are typically defined as members of the same orthogroup via OrthoFinder. In addition to the original method, which clusters genes and builds a graph from closely related genes based on BLAST scores (Emms and Kelly, 2015), OrthoFinder can also use gene trees to split orthogroups into pairwise orthologs (Emms and Kelly, 2019), which represent a stricter definition of orthology. GENESPACE attempts to merge the benefits of each of these methods by first only considering orthogroups for synteny, which allows users to optionally include paralogs in the scan, then including non-syntenic gene tree-inferred orthologs into the pan-genome annotation during its final steps (Figure 1, see pan-genome pipeline description below). The initial orthogroups are inferred ‘globally’ using pairwise reciprocal hits, either using the default OrthoFinder specification or ‘fast’ by only looking at unique pairs of genomes. Depending on the ploidy and user specifications, orthogroups can be re-estimated within syntenic regions. These three steps are detailed below.

Method 1: default global orthogroups. The default behavior of GENESPACE is to run OrthoFinder using its default parameters. GENESPACE then builds a vector of OrthoFinder geneIDs and their corresponding orthogroup membership. When synteny is called, the GENESPACE-formatted gff text files are read in and merged with OrthoFinder sequenceIDs.

Method 2: fast orthogroup estimation. GENESPACE offers a ‘fast’ orthofinderMethod (Table 3), which performs only one-way DIAMOND2 searches, where the genome annotation with more gene models serves as the query and the smaller annotation is the target. The hits then are mirrored, and each stored as OrthoFinder-formatted BLAST text files. OrthoFinder is then run to the orthogroup step on the precomputed BLAST files. This method results in significant speed improvements with little loss of fidelity among closely related haploid genomes. The user can also specify the sensitivity of DIAMOND2 via the diamondMode parameter during GENESPACE initialization.

Table 3
Comparison of GENESPACE setting performance.

The mirrored ‘fast’ method significantly speeds up OrthoFinder runs by calling DIAMOND2 on each nonredundant pairwise combination of genomes. However, this approach is less sensitive than the default performance and is suggested for only closely related haploid genomes, as the recall of 2:2:2 OGs is less sensitive than the default specification.

Default OrthoFinderGENESPACE ‘fast’
n.1:1:1 OGs22,05022,444
n.2:2:2 OGs13,79313,511
n.tandem arrays10,597 (4433)10,599 (4426)
*Run time (min)59.9512.45
  1. *

    Run time is for ortholog/orthogroup inference (not the GENESPACE pipeline as a whole) using the three cotton genomes, running on 6 2 Gb cores.

Method 3: orthogroups within syntenic blocks. In addition to the above global OrthoFinder runs, GENESPACE can rerun OrthoFinder (only through the -og step) in syntenic regions between pairs of genomes. This is accomplished through four additional steps in the synteny pipeline: (1) syntenic hits are split by large syntenic regions; (2) hits from each syntenic region are passed to OrthoFinder; (3) the within-block orthogroups assignments are aggregated into a single vector across all genomes; and finally, (4) the synteny pipeline (see below) is rerun using the newly defined combined syntenic and inBlkOG vector. While using significant computational resources, this process can improve the sensitivity of orthogroup discovery in paralogous or homeologous regions; however, it is not clear that this offers much improvement over global orthogroups between purely haploid genomes. As such, the default behavior of GENESPACE is to only use orthofinderInBlk when any genome has a ploidy >1.

GENESPACE pipeline: extracting syntenic blast hits

Request a detailed protocol

The GENESPACE function ‘synteny’ accepts several user-defined parameters, which allow for flexibility; however, the defaults are sufficient for most high-quality genomes and evolutionary scenarios. For example, we used the same default parameters for 300 M years of vertebrate evolution, 50 M years and multiple WGDs of grasses, and 10 k years of Maize divergence. For a full list of parameters, see documentation of the GENESPACE function ‘set_syntenyParams’, but here, we will discuss (1) the minimum number of unique hits within a syntenic block (‘blkSize’, default = 5), (2) the maximum number of gaps within a block alignment (‘nGaps’, default = 5), and (3) the radius around a syntenic anchor for a hit to be considered syntenic (‘synBuff’, default = 100). Synteny begins by processing all gene annotations (steps 1–3), then proceeds to process blast hits for each unique pair of genomes (steps 4–7).

Step 1: flag the gff3-formatted annotation. GENESPACE adds the following information to a data matrix that initially contains the name and physical coordinates of each gene for each genome: (1) OrthoFinder IDs (from the sequenceIDs.txt file), (2) gene rank order, and (3) orthogroup IDs.

Step 2: quality control the annotation. Chromosomes with fewer than blkSize genes are dropped so that they will not be used for synteny inference.

Step 3: find and parse tandem arrays. Tandem arrays are defined through six steps: (1) define potential tandem arrays as orthogroups containing >1 gene for each genome-by-chromosome combination; (2) calculate the maximum gene rank-order gap between each adjacent gene in a potential tandem array; (3) split those with a gap between genes > synBuff into separate arrays using dbscan; (4) flag clusters of >1 genes within synBuff as tandem arrays; (5) choose representative genes for each tandem array as the gene closest to the median position of the array and secondarily as the longest peptide sequence; and (6) recalculate ‘arrayOrd’ as the gene rank order of tandem array representative genes.

Step 4: read and process raw blast files. To reduce unnecessary computation, GENESPACE concatenates both reciprocal BLAST hit files so that all unique combinations of query and target genes are represented in a single matrix. For simplicity, the genome with more gene models is treated as the query and the genome with fewer, the target. For each line (‘hit’) in the BLAST file, the following data and statistics are added for both the query (genome1) and target (genome2) gene: (1) positional information, including gene rank order; (2) tandem array representatives; (3) hits where both the query and target are in the same orthogroup; and (4) the relative strength of a hit, where scrRank = 1 represents the single highest bit score blast hit for each query and target gene.

Step 5: define initial syntenic anchors. The following additional parameters are used to define initial syntenic anchors: nHits1 (the top n hits for each gene in the query genome, default ploidy of target genome), nHits2 (the top n hits for each gene in the target genome, default = ploidy of query genome), onlyOgAnchors (logical specifying whether anchors hits must be isOg, default = TRUE), and maskTheseHits (what hits should be masked in the search, default = none; see below on modifications for secondary hits and polyploid self hits). These parameters are applied to find regions that are large and collinear enough to be classified as syntenic through three steps: (1) potential anchor hits are defined as those where both the query and target are array representatives, the hit is not in maskTheseHits, the query scrRank ≤ nHits1, the target scrRank ≤ nHits2, and, optionally if onlyOgAnchors, both the query and target are in the same orthogroup; (2) gene rank order is recalculated for the potential anchor hits; (3) MCScanX_h (-s = blkSize, -m = nGaps) is called from R using condensed gene rank-order positions as the physical location, and those hits within collinear blocks are flagged as initial anchors.

Step 6: clean up initial syntenic anchors. In some cases, syntenic anchors from step 5 can be broken up or do not extend to the ends of syntenic regions. To resolve this, we run five additional block finalization steps: (1) potential anchor hits within synBuff of the initial anchors are extracted and collinear hits are recalled into ‘cleaned’ anchors; (2) the cleaned anchor hits are clustered into blocks via dbscan using synBuff as the radius and blkSize as minPts arguments, dropping blocks with fewer anchors than blkSize and merging nearby syntenic regions; (3) final syntenic anchors are flagged by rerunning MCScanX within each syntenic region; (4) initial block breakpoints are generated by dbscan first with a large radius of synBuff, then on reranked gene order within regions with a radius of blkSize; and (5) overlapping blocks that are not duplicated are split using run-length equivalent decoding. Since the final coding of blocks is conducted within broad syntenic regions, the parameters passed should be robust to variation in ploidy and sequence divergence between genomes.

Step 7: flag anchors, syntenic region hits, and block coordinates. With the final syntenic anchors from step 6 in hand we finalize and annotate the tab-delimited blast hits. We define two sets of syntenic qualifications: blocks, which are fine grained runs of completely collinear hits, and ‘regions’, which are clustered blocks that average across minor inversions. Blocks are defined from the step 6 anchors. Anchor hits are reclustered via dbscan with a radius of synBuff and each cluster between each pair of genomes and chromosomes is assigned a ‘regID’. All hits within synBuff of an anchor hit are flagged as ‘inBuffer’. The coordinates of blocks and regions are calculated from the bounding anchors for each syntenic block.

Modifications: steps 5–7 are modified depending on whether the BLAST hit file is intra- or inter-genomic, the ploidy of the query and target genome, and the user-defined number of ‘secondary hits’. These modifications are as follows. (1) step 5–7(1): [if necessary] syntenic regions are called for self blast hits within haploid genomes. This is the simplest case where the steps are ignored and syntenic anchors are defined as self hits between tandem array representatives. Block and region IDs are chromosome IDs. (2) Modification step 5–7(2): [if necessary] syntenic regions are called for self blast hits within genomes with ploidy >1. Here, the modified syntenic hits from 5-7(1) are specified in maskTheseHits with a synBuff radius of 500 genes. This excludes potentially problematic large tandem arrays in some genomes. The number of hits after masking is set to ploidy – 1. Then the standard step 5–7 processes are run. (3) Modification step 5–7(3): [if necessary] syntenic regions are called for blast hits where nSecondaryHits >0. Here, the methods for syntenic hit calculation from 5-7(2) (if intragenomic) or 5-7 (as a mask of the rerun if intergenomic) are conducted except that the parameters specified are given as those ending in ‘Second’.

GENESPACE pipeline: constructing pan-genome annotations

Request a detailed protocol

The GENESPACE function ‘pangenome’ decodes pairwise syntenic orthologs into a multigenome pan-annotation. The output is a long-formatted text file, where each gene is given a reference genome syntenic position and chromosome, and flags that are described below. In addition, pangenome also returns a wide-formatted data.table (R object) where each row is a pan-genome entry with positional information and each column is a list of genes by genome (e.g., Figure 1).

Step 1: build a reference-anchored map of syntenic hits. The chromosome and positions of the tandem array representatives from the reference genome annotation form the foundation of mapping across all genomes. Upon building this positional backbone, syntenic hits are pulled for each genome and placed against the anchor genes that are in the same orthogroup.

Step 2: interpolate syntenic reference positions. In most pairwise comparisons between genomes, some syntenic orthogroups will be missing in the reference. Since we want to extract PAV by physical coordinates, all syntenic orthogroups, even those that do not include a reference genome gene model, need to have reference positional information. To fill this gap, GENESPACE interpolates the syntenic reference position of all array representative genes in all genomes through a three step pipeline: (1) subset the syntenic anchor hits to ungapped collinear hits following a 1:1 synteny mapping ratio (i.e., perfect diagonal of hits by gene rank order); (2) cluster the 1:1 syntenic anchors so that any jump of >1 gene is split into its own cluster; (3) fill in missing positions through linear interpolation.

Step 3: determine the reference mapping positions of each syntenic orthogroup. In the case where the reference genome is purely haploid with no segmental duplicates, this is a straightforward step: the orthogroups that contain reference genome genes are placed at that backbone position, and those without a reference gene are clustered into the most likely interpolated placement. However, we do not want to rely on single-copy references, so GENESPACE allows multiple placements not only for each nonreference gene, but also for genes in the reference genome itself. For example, in a polyploid, we would want to know if there is PAV between two homeologous chromosomes. To do this, we take a three-step approach: (1) the interpolated (or actual reference-anchored) positions of genes in each syntenic orthogroup are extracted and split by interpolated chromosome and maximum gap between any two inferred positions >synBuff; (2) unsplit orthogroups are set aside; (3) split orthogroups are checked and retained if they contain ≥ propAssignThresh (default = 0.5) proportion of genes in the syntenic orthogroup with an interpolated position in that cluster; (4) clustered positions are dropped if there are more than maxPlacementsPerRefChr (default = 2) positional placements for that syntenic orthogroup, retaining the top clusters ranked by propAssignThresh; (5) the culled positional clusters are merged with those from step (2) to create the initial pan-genome annotation.

Step 4: add and flag other forms of orthogroups. The reference pan-genome built in step 3 only acts on direct syntenic blast hits (edges), which allows for strict construction of interpolated positions without the potential polluting positional effects of orthogroups in minor or miss-assembled syntenic blocks. GENESPACE fixes these and other complexities of syntenic orthogroups through a four step pipeline to finalize the pan-genome annotation: (1) ‘indirect syntenic’ orthogroup members are added back into the initial pan-genome annotation by parsing the gff-like text file that contains a vector of syntenic orthogroups; (2) syntenic orthogroups that are missing all interpolated positions are added and given NA positions; (3) tandem array members that are not representative genes are added into the pan-genome annotation; (4) if available, orthologues from the initial OrthoFinder run that are not present in the pan-genome annotation are added and flagged. Note that if OrthoFinder was run using the ‘fast’ orthofinderMethod, orthologs will not be produced nor added into the pan-genome annotation.

Additional considerations for comparative genomics parameterization

Request a detailed protocol

There are several factors to consider when constructing your GENESPACE run, generating syntenic block breakpoints, or looking at comparative genomics through protein similarity estimates like OrthoFinder. We detail two of these below:

(1) Outgroups and the phylogenetic context of orthology inference. OrthoFinder defines orthogroups as the set of genes that are descended from a single gene in the last common ancestor of all the species being considered. As such, the scale of the run matters, often significantly. For example, an orthogroup would not be likely to contain homeologs across the two ancient subgenomes for a run that included only two maize genomes. Since the coalescence of any two maize genotypes occurred more recently than the ~12M ya WGD, few homeologs would both be descended from the same common ancestor when considering only maize genotypes. Hence, the within-maize NAM parent run (Figure 3D) excludes homeologs. However, if an outgroup to maize was included in the run, both maize homeologs would be likely to show common ancestry to a single gene in the outgroup, thus connecting the maize homeologs into a single orthogroup. Hence, both maize homeologous regions are present in the across grasses synteny graph (Figure 3A) despite using identical synteny parameters to the maize NAM parent run. Given the potentially significant role of outgroups on the results of the global run, GENESPACE offers an ‘outgroup’ parameter, which specifies the genomes that should be included in the orthofinder run but excluded from all downstream analyses.

(2) Studying homoelogs in polyploids and other paralogs. Non-homeologous paralogs are typically excluded in the default GENESPACE parameterization. This is in part because GENESPACE was originally designed for work with plants, and most plant lineages have undergone one or more WGDs. However, there are two ways to study paralogs by altering GENESPACE parameters: (1) specify an ‘outgroup’ genome (see above), which is only used for the global OrthoFinder run and set the genomes’ ploidy as that expected by the number of duplications relative to the outgroup; (2) if an outgroup is not available (or too distantly related to be of use), specify the synteny parameter ‘nSecondaryHits’ as the number of paralogous copies per genome. In the second case, secondaryHits are inferred after masking out the normal syntenic hits, then synteny is rerun without requiring anchors to be global orthogroups. In both cases, it will be far better to set orthofinderInBlk to TRUE, so that pairs of genomes are considered and genes without a solid hit in the outgroup are not excluded.

Data availability

Raw data were sourced entirely from NCBI and Phytozome. Processed data, intermediate files, scripts, plots, and source data are all available in the data repository: https://github.com/jtlovell/GENESPACE_data (copy archived at swh:1:rev:77612f8c59fbfd43ef3f4c1719933bf0cbca3261). All source code and documentation for the GENESPACE R package can be found at https://github.com/jtlovell/GENESPACE (copy archieved at swh:1:rev:390341499ee1d2ccd5e1a894c4bd7c1bd20a3dda). An interactive viewer for the plant genomes can be found on phytozome at https://phytozome-next.jgi.doe.gov/tools/dotplot/synteny.html.

References

    1. Csardi G
    2. Nepusz T
    (2006)
    The igraph software package for complex network research
    InterJournal, Complex Systems 1695:1–9.
    1. Ross MT
    2. Grafham DV
    3. Coffey AJ
    4. Scherer S
    5. McLay K
    6. Muzny D
    7. Platzer M
    8. Howell GR
    9. Burrows C
    10. Bird CP
    11. Frankish A
    12. Lovell FL
    13. Howe KL
    14. Ashurst JL
    15. Fulton RS
    16. Sudbrak R
    17. Wen G
    18. Jones MC
    19. Hurles ME
    20. Andrews TD
    21. Scott CE
    22. Searle S
    23. Ramser J
    24. Whittaker A
    25. Deadman R
    26. Carter NP
    27. Hunt SE
    28. Chen R
    29. Cree A
    30. Gunaratne P
    31. Havlak P
    32. Hodgson A
    33. Metzker ML
    34. Richards S
    35. Scott G
    36. Steffen D
    37. Sodergren E
    38. Wheeler DA
    39. Worley KC
    40. Ainscough R
    41. Ambrose KD
    42. Ansari-Lari MA
    43. Aradhya S
    44. Ashwell RIS
    45. Babbage AK
    46. Bagguley CL
    47. Ballabio A
    48. Banerjee R
    49. Barker GE
    50. Barlow KF
    51. Barrett IP
    52. Bates KN
    53. Beare DM
    54. Beasley H
    55. Beasley O
    56. Beck A
    57. Bethel G
    58. Blechschmidt K
    59. Brady N
    60. Bray-Allen S
    61. Bridgeman AM
    62. Brown AJ
    63. Brown MJ
    64. Bonnin D
    65. Bruford EA
    66. Buhay C
    67. Burch P
    68. Burford D
    69. Burgess J
    70. Burrill W
    71. Burton J
    72. Bye JM
    73. Carder C
    74. Carrel L
    75. Chako J
    76. Chapman JC
    77. Chavez D
    78. Chen E
    79. Chen G
    80. Chen Y
    81. Chen Z
    82. Chinault C
    83. Ciccodicola A
    84. Clark SY
    85. Clarke G
    86. Clee CM
    87. Clegg S
    88. Clerc-Blankenburg K
    89. Clifford K
    90. Cobley V
    91. Cole CG
    92. Conquer JS
    93. Corby N
    94. Connor RE
    95. David R
    96. Davies J
    97. Davis C
    98. Davis J
    99. Delgado O
    100. Deshazo D
    101. Dhami P
    102. Ding Y
    103. Dinh H
    104. Dodsworth S
    105. Draper H
    106. Dugan-Rocha S
    107. Dunham A
    108. Dunn M
    109. Durbin KJ
    110. Dutta I
    111. Eades T
    112. Ellwood M
    113. Emery-Cohen A
    114. Errington H
    115. Evans KL
    116. Faulkner L
    117. Francis F
    118. Frankland J
    119. Fraser AE
    120. Galgoczy P
    121. Gilbert J
    122. Gill R
    123. Glöckner G
    124. Gregory SG
    125. Gribble S
    126. Griffiths C
    127. Grocock R
    128. Gu Y
    129. Gwilliam R
    130. Hamilton C
    131. Hart EA
    132. Hawes A
    133. Heath PD
    134. Heitmann K
    135. Hennig S
    136. Hernandez J
    137. Hinzmann B
    138. Ho S
    139. Hoffs M
    140. Howden PJ
    141. Huckle EJ
    142. Hume J
    143. Hunt PJ
    144. Hunt AR
    145. Isherwood J
    146. Jacob L
    147. Johnson D
    148. Jones S
    149. de Jong PJ
    150. Joseph SS
    151. Keenan S
    152. Kelly S
    153. Kershaw JK
    154. Khan Z
    155. Kioschis P
    156. Klages S
    157. Knights AJ
    158. Kosiura A
    159. Kovar-Smith C
    160. Laird GK
    161. Langford C
    162. Lawlor S
    163. Leversha M
    164. Lewis L
    165. Liu W
    166. Lloyd C
    167. Lloyd DM
    168. Loulseged H
    169. Loveland JE
    170. Lovell JD
    171. Lozado R
    172. Lu J
    173. Lyne R
    174. Ma J
    175. Maheshwari M
    176. Matthews LH
    177. McDowall J
    178. McLaren S
    179. McMurray A
    180. Meidl P
    181. Meitinger T
    182. Milne S
    183. Miner G
    184. Mistry SL
    185. Morgan M
    186. Morris S
    187. Müller I
    188. Mullikin JC
    189. Nguyen N
    190. Nordsiek G
    191. Nyakatura G
    192. O’Dell CN
    193. Okwuonu G
    194. Palmer S
    195. Pandian R
    196. Parker D
    197. Parrish J
    198. Pasternak S
    199. Patel D
    200. Pearce AV
    201. Pearson DM
    202. Pelan SE
    203. Perez L
    204. Porter KM
    205. Ramsey Y
    206. Reichwald K
    207. Rhodes S
    208. Ridler KA
    209. Schlessinger D
    210. Schueler MG
    211. Sehra HK
    212. Shaw-Smith C
    213. Shen H
    214. Sheridan EM
    215. Shownkeen R
    216. Skuce CD
    217. Smith ML
    218. Sotheran EC
    219. Steingruber HE
    220. Steward CA
    221. Storey R
    222. Swann RM
    223. Swarbreck D
    224. Tabor PE
    225. Taudien S
    226. Taylor T
    227. Teague B
    228. Thomas K
    229. Thorpe A
    230. Timms K
    231. Tracey A
    232. Trevanion S
    233. Tromans AC
    234. d’Urso M
    235. Verduzco D
    236. Villasana D
    237. Waldron L
    238. Wall M
    239. Wang Q
    240. Warren J
    241. Warry GL
    242. Wei X
    243. West A
    244. Whitehead SL
    245. Whiteley MN
    246. Wilkinson JE
    247. Willey DL
    248. Williams G
    249. Williams L
    250. Williamson A
    251. Williamson H
    252. Wilming L
    253. Woodmansey RL
    254. Wray PW
    255. Yen J
    256. Zhang J
    257. Zhou J
    258. Zoghbi H
    259. Zorilla S
    260. Buck D
    261. Reinhardt R
    262. Poustka A
    263. Rosenthal A
    264. Lehrach H
    265. Meindl A
    266. Minx PJ
    267. Hillier LW
    268. Willard HF
    269. Wilson RK
    270. Waterston RH
    271. Rice CM
    272. Vaudin M
    273. Coulson A
    274. Nelson DL
    275. Weinstock G
    276. Sulston JE
    277. Durbin R
    278. Hubbard T
    279. Gibbs RA
    280. Beck S
    281. Rogers J
    282. Bentley DR
    (2005) The DNA sequence of the human X chromosome
    Nature 434:325–337.
    https://doi.org/10.1038/nature03440

Decision letter

  1. Detlef Weigel
    Senior and Reviewing Editor; Max Planck Institute for Biology Tübingen, Germany

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "GENESPACE: syntenic pan-genome annotations for eukaryotes" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Detlef Weigel as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) While you will see that the reviewers generally liked the functionality of GENESPACE and support its publication in eLife, there were multiple concerns regarding unclear descriptions of its methodology. It is therefore necessary that you revise the manuscript very carefully and make it accessible to both experts (including details on the methods and more formal definitions) and non-experts (on the general way how GENESPACE works).

Reviewer #1 (Recommendations for the authors):

While I support the publication of the manuscript, I have doubts that the manuscript in its current form is explaining well what GENESPACE is doing and how this is performed. While I do appreciate that the Results section tries to give an overview of the methodology, it is still very hard to follow as it is full of jargon and leaves out some important information, which only can be found in the methods.

How is synteny defined (line 105), which is the basis for dropping graph edges (which I assume are BLAST hits) before running OrthoFinder (while in Figure S1 orthofinder is run before the synteny module)? Once the BLAST hits between non-syntenic regions are deleted OrthoFinder can be run without any adjustments, but still connect genes to OrthoGroup in syntenic regions only. However, the sentence in line 108 is not clear, how can BLAST hits in syntenic regions be merged with synteny-constrained OGs (where BLAST hits in syntenic regions are actually the basis for synteny-constrained OGs)? In this regard, it is not clear what the differences between synteny-constrained OGs and within-block OGs are.

Table 1. It is not clear what this table actually shows and requires guessing. (Why is this analysis done with nine chromosomes only?). It becomes more clear after rereading the later section that explains Table 1, but on the first occurrence of table 1 in the text, I got lost. The GENESPACE run using the tetraploid genome still relies on the synteny between the subgenomes in order to generate OG groups of size of 6? It is not fully clear what the differences between the tetraploid run and the split by subg. runs are when synteny information between the subgenomes is also needed in the tetraploid run. Besides the description of this approach is very hard to follow. I also do not see why this approach reveals sensitivity – for example in the pure OrthoFinder runs some of the OGs of size of 6 might be extended by in-paralogs which is (as far as I understand) the goal of OrthoFinder.

In a second paragraph, GENESPACE is also presented as a tool for the identification of syntenic regions (which presumably are the syntenic regions that the OG analysis are based on, however, this is not clear). GENESPACE goal is to filter BLAST hits in a way that regions are linked to only a single region in the other genome by "subsetting the BLAST hits to those 181 within the same orthogroups". This again is not clear, does this imply that OrthoFinder results are used to filter the BLAST hits? The blast hits are then extended. Though there is more about this in the methods, it is not clear to me, why this leads to non-overlapping syntenic regions, specifically, as connected orthologs do not need to be right next to each other.

I think the expression "concept of a pan-genome annotation" or just "pan-genome annotation" in itself is misleading. GENESPACE is a tool for ortholog identification and not for gene annotation. I see the point why it is called like this (all genes are projected on one reference genome), however, I actually expected that GENESPACE would annotate/correct gene structure annotations, which is the typical context of the term "annotation" in genomics.

It would be helpful to define (the differences of) synteny and collinearity (L49 and rest of manuscript).

How is the exact location defined when projecting non-reference OG to reference positions?

In multiple parts of the text, it is mentioned that tandem arrays are excluded or treated in a different way. Could this be clarified?

It is not clear how the "second proximity search step" in GENESPACE actually works. Where are the significant gains in single copy genes coming from? Why are they not already identified?

Reviewer #2 (Recommendations for the authors):

The code and data are available on GitHub.

The algorithm could be described more clearly. For this, it would probably help to introduce key definitions more formally. Below is a list of questions I was left with after reading the manuscript.

When you use 'syntenic regions', are regions in multiple genomes meant or always only in two genomes? If appropriate you could qualify the term according to the usage.

The riparian plot appears to use only n-1 pairs of genomes out of the n*(n-1) pairs of different genomes for which syntenic regions were computed. What is the formal criterion or algorithm for that choice that was made? The respective comments in the Figure captions are not clear to me.

What does GENESPACE do if a gene or region is missing in reference genome?

How are the syntenic regions extracted and how is can this be adjusted by the user (line 620)?

I do not understand how 'collinear arrays' are defined (line 635). What does it mean for a group of genes to share an orthogroup?

How is the step done from pairwise synteny to synteny of multiple genomes? Is this done at all? The manuscript reads around line 108 as if OrthoFinder is (re)run on syntenic regions. This makes sense if there are in general multiple (>2) regions considered to be mutually syntenic with one another. If that is the case, how do you address the choice of having a few long syntenic regions versus many short syntenic regions when both are plausible and the choices share a subset of genes?

Can there be two genes in an orthogroup that GENESPACE outputs that are not syntenic within their pair of genomes?

Does GENESPACE scale quadratically with the number of genomes?

Why are for two genomes g1 and g2 both (g1, g2) and (g2, g1) contained, e.g. in Supplementary File 1?

(Why) is it not symmetric? What are the roles of first and second genomes?

Table 1: The row labels are hard to make sense of. Are any rows the results of GENESPACE?

Figure 1: The human Y chromosome is in the figure caption but not in the image. The same holds for Supplementary Figure 3.

OrthoFinder is inconsistently spelled as OrthoFinder, Orthofinder and orthofinder.

Description of Supplemental Data 1. There are no columns pgChr or pgOrd in Lovell_09-03-2022-TR-eLife-78526_Supplementary_File_1.txt

Apparently the order of files and descriptions is not consistent with each other.

Line items:

43: grammar

162: What does 'contrasted' mean here?

173: What does 'dosage' mean here?

713: Add 'of dotplots' as the link apparently does not contain all of GENESPACEs results.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "GENESPACE: syntenic pan-genome annotations for eukaryotes" for further consideration by eLife. Your revised article has been evaluated by Detlef Weigel (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer #1 (Recommendations for the authors):

The authors have addressed many of my concerns by rewriting large parts of the manuscript, which significantly improved the description of what GENESPACE does and how it works. In particular the step-by-step descriptions in the methods worked well for me. I fully support publication of this manuscript, though there are some small suggestions below.

In the first Results section, I would have found it helpful to read one sentence on what OrthoFinder and MCScanX do (resp. what they output).

Line 71: This sentence was still not clear to me. What is the gene rank that is recalculated, and why would a recalculation make genes with PAV?

Table 2: no horizontal bars visible for me as indicated in the legend.

Reviewer #2 (Recommendations for the authors):

In the revision, the authors have made substantial changes. In particular the first two tables were removed from the original manuscript. Lovell et al., have added a new Table 1, in which they compare their tool GENESPACE to OrthoFinder and MCScanX on pairs of genome annotations, and an in-depth description of their implementation steps.

The example plots are convincing, not so the benchmark in its current form. I am missing a measure of sensitivity or coverage as well as a measure of how much the orthogroups respect synteny.

In the very important sentence in line 87 it is grammatically ambiguous to what ‘that’ refers to. It should be reformulated so that it is unambiguous even for readers not understanding the authors intention yet.

I assume ‘that’ in line 87 qualifies ‘orthogroup’. Under that assumption, a trivial and very insensitive bogus orthology finder that outputs in a genome-wide search for synteny a single orthogroup with (any) two genes in two different species would achieve 2/2=100% of what the authors refer to as "accuracy and precision". A little less bogus, a trivial filter method that randomly (and independent of synteny) reduces an orthogroup with multiple genes from different chromosomes of the same species to one with genes on at most one chromosome per species would perform better in this measure than its orthogroup input. The numbers of Table 1 can consequently by themselves not reasonably establish the ”outperformance” claimed in line 89.

Line 88 appears to refer to the same results of Table 1 but formulates “percentage of … orthogroups” as opposed to “percent of genes”, which is of course different. I strongly suggest the authors use a precise definition of their central measures of accuracy, e.g. with a formula in which all terms are themselves unambiguously defined.

The methods are now described at length and include coding details on pipelining, thresholding and implementation flow. Admittedly, this reviewer had requested details on thelgorithmm, but had ideally expected a concise and precise formulation of the formal objectives that GENESPACE achieves. Most readers would arguably appreciate a more succinct formulation that is specific about where the main ideas of the program lie and to which high-level design decisions are made. For example, is the decision which homologous genes are syntenic based on all-versus-all pairwise comparisons of coordinates, on pairwise comparisons of neighboring species in a user-specified order, in an order chosen by the program, are comparisons only done all-against-reference or even are comparisons between two genomes informed by third genomes?

https://doi.org/10.7554/eLife.78526.sa1

Author response

Essential Revisions (for the authors):

1) While you will see that the reviewers generally liked the functionality of GENESPACE and support its publication in eLife, there were multiple concerns regarding unclear descriptions of its methodology. It is therefore necessary that you revise the manuscript very carefully and make it accessible to both experts (including details on the methods and more formal definitions) and non-experts (on the general way how GENESPACE works).

We have expanded the methods and worked to make descriptions clearer. See line-specific comments below.

Reviewer #1 (Recommendations for the authors):

While I support the publication of the manuscript, I have doubts that the manuscript in its current form is explaining well what GENESPACE is doing and how this is performed. While I do appreciate that the Results section tries to give an overview of the methodology, it is still very hard to follow as it is full of jargon and leaves out some important information, which only can be found in the methods.

We fully agree. The methodological descriptions in the results were simultaneously too in-theweeds/jargony and not detailed enough. We opted to address this issue with three big changes: (1) we add box 1 and Figure 1, which provide technical definitions and a visual description of how GENESPACE works, respectively; (2) we re-framed the first section of the results to provide less technical detail but more explicit descriptions of how the pipeline works at a high level, with references to processes detailed in Figure 1; and (3) we restructured the methods to first provide an initial high-level overview of the pipeline at the beginning of the methods section then a complete description of each step in the pipeline later in the methods.

How is synteny defined (line 105), which is the basis for dropping graph edges (which I assume are BLAST hits) before running OrthoFinder (while in Figure S1 orthofinder is run before the synteny module)?

This is an example of jargon that we have dropped. Now the methods are described as blast hits and not in graph terms (except where unavoidable), which hopefully makes it easier to follow. The methods to infer synteny are now fully laid out in Figure 1 and the methods [345-428].

Once the BLAST hits between non-syntenic regions are deleted OrthoFinder can be run without any adjustments, but still connect genes to OrthoGroup in syntenic regions only. However, the sentence in line 108 is not clear, how can BLAST hits in syntenic regions be merged with synteny-constrained Ogs (where BLAST hits in syntenic regions are actually the basis for synteny-constrained Ogs)? In this regard, it is not clear what the differences between synteny-constrained Ogs and within-block Ogs are.

Yes, another case of opaque methods in the previous version. We have hopefully improved this description [331-344]. This step has also been added to Figure 1.

Table 1. It is not clear what this table actually shows and requires guessing. (Why is this analysis done with nine chromosomes only?). It becomes more clear after rereading the later section that explains Table 1, but on the first occurrence of table 1 in the text, I got lost. The GENESPACE run using the tetraploid genome still relies on the synteny between the subgenomes in order to generate OG groups of size of 6? It is not fully clear what the differences between the tetraploid run and the split by subg. Runs are when synteny information between the subgenomes is also needed in the tetraploid run. Besides the description of this approach is very hard to follow. I also do not see why this approach reveals sensitivity – for example in the pure OrthoFinder runs some of the Ogs of size of 6 might be extended by in-paralogs which is (as far as I understand) the goal of OrthoFinder.

Thanks for this – we agree that the cotton subgenome comparison is not the best example. We have now revisited the sensitivity/precision estimates for GENESPACE and have followed reviewer #2’s advice and changed our estimates of sensitivity to pairs of genomes across a range of phylogenetic distances [81108]. We hope that these 1:1 comparisons are easier to follow. The confusing cotton-subgenome analyses have been dropped.

In a second paragraph, GENESPACE is also presented as a tool for the identification of syntenic regions (which presumably are the syntenic regions that the OG analysis are based on, however, this is not clear). GENESPACE goal is to filter BLAST hits in a way that regions are linked to only a single region in the other genome by “subsetting the BLAST hits to those 181 within the same orthogroups”. This again is not clear, does this imply that OrthoFinder results are used to filter the BLAST hits? The blast hits are then extended. Though there is more about this in the methods, it is not clear to me, why this leads to non-overlapping syntenic regions, specifically, as connected orthologs do not need to be right next to each other.

We have now directly addressed this with a specific example of how GENESPACE improves syntenic blocks in Figure 1. We hope this is clearer than the cotton comparisons.

I think the expression “concept of a pan-genome annotation” or just “pan-genome annotation” in itself is misleading. GENESPACE is a tool for ortholog identification and not for gene annotation. I see the point why it is called like this (all genes are projected on one reference genome), however, I actually expected that GENESPACE would annotate/correct gene structure annotations, which is the typical context of the term “annotation” in genomics.

We feel that “pan-genome annotation” is the logical extension: a genome annotation is the set of sequences that form genes in one genome, so a pan-genome annotation would be the set of sequences that form genes across multiple genomes. However, we understand the reviewer’s point and now provide an explicit definition up front. We do have scripts to do the gene model improvement/pseudogene finding that the reviewer suggests. We envision integrating these methods with GENESPACE in the next year or two.

It would be helpful to define (the differences of) synteny and collinearity (L49 and rest of manuscript).

We have added a definitions box that presents this distinction.

How is the exact location defined when projecting non-reference OG to reference positions?

We have hopefully dropped confusing text in the results / introduction. We have also provided extensive details about how this process works in the methods [438-463].

In multiple parts of the text, it is mentioned that tandem arrays are excluded or treated in a different way. Could this be clarified?

Yes, this was confusing. We have now addressed this directly in the main text, including a definition in box1, and expanded the description in the methods [364-370]. The example of the pan-genome annotation in Figure 1 also hopefully helps.

It is not clear how the “second proximity search step” in GENESPACE actually works.

We have excluded this complexity from the pipeline description in the main text and now go into far more detail in the methods [424-428, 499-509].

Where are the significant gains in single copy genes coming from? Why are they not already identified?

We now describe the problem/solution in more detail at the beginning of the Results section [62-78]. The main difference can be observed in Figure 1, showing the difference between a single-step MCScanX and GENESPACE.

Reviewer #2 (Recommendations for the authors):

The code and data are available on GitHub.

The algorithm could be described more clearly. For this, it would probably help to introduce key definitions more formally. Below is a list of questions I was left with after reading the manuscript.

We have added a definitions box and provided a much expanded description of the pipeline in the methods.

When you use ‘syntenic regions’, are regions in multiple genomes meant or always only in two genomes? If appropriate you could qualify the term according to the usage.

This is a good question – synteny itself is always a pairwise pattern, but one that can be chained among multiple genomes. We have clarified this distinction when explaining the methods [354-356].

The riparian plot appears to use only n-1 pairs of genomes out of the n*(n-1) pairs of different genomes for which syntenic regions were computed. What is the formal criterion or algorithm for that choice that was made? The respective comments in the Figure captions are not clear to me.

We are not sure we follow this question – here, we reply to the question: “how does GENESPACE choose the chained order of genomes in the riparian plot”. If this isn’t right, please clarify and we are happy to respond in kind. The riparian plot genome order is a query of all possible pairwise combinations to the chain specified by the user in the genomeIDs parameter (default is the order specified in the init_genespace step). The user can apply any algorithm or criteria they wish to order the genomes. For example, one option would be a TSP solver path to minimize the number of inversions. We have hopefully clarified the figure captions [585-586, 592]

What does GENESPACE do if a gene or region is missing in reference genome?

In terms of the synteny step: PAV is invisible in pairwise synteny methods since we deal with gene rankorder position that is collapsed to genes with solid (usually orthogroup-constrained) hits between the two genomes. That is, if genes 1-3 and 8-10 have direct hits between genome A and B, but genes 4-5 have no hit in genome B and 6-7 have no hits in genome A, GENESPACE synteny() would only see genes 1-3,810 ordered 1,2,3,4,5,6. We now describe this at a high level in Figure 1 and in detail in the methods [390391, 400-405].

In terms of the pangenome step: the syntenic position of all genes (within syntenic block breakpoints) are interpolated against the reference (see new detailed methods [438-463]). For a syntenic orthogroup that lacks a reference genome gene, the median interpolated position is used. If multiple high-confidence interpolated positions exist, that orthogroup is given multiple positions (e.g. in a polyploid).

How are the syntenic regions extracted and how is can this be adjusted by the user (line 620)?

We now provide a full description of the synteny pipeline in the methods [345-428] and specifically address syntenic region extraction from raw blast hits to output.

I do not understand how ‘collinear arrays’ are defined (line 635). What does it mean for a group of genes to share an orthogroup?

This was confusing and hopefully has been remedied by using the existing term “tandem” instead of collinear. While a physically proximate group of genes in the same orthogroup (aka “share an orthogroup” … now changed) is not necessarily a tandem array, all true tandem arrays will fall into this category, so we feel the definition is ok, considering that tandem is a far more well understood term.

How is the step done from pairwise synteny to synteny of multiple genomes? Is this done at all? The manuscript reads around line 108 as if OrthoFinder is (re)run on syntenic regions. This makes sense if there are in general multiple (>2) regions considered to be mutually syntenic with one another. If that is the case, how do you address the choice of having a few long syntenic regions versus many short syntenic regions when both are plausible and the choices share a subset of genes?

All steps related to synteny are done in pairs. Overall synteny is never considered and only concatenated into the pan-genome annotation. This step lets a user query the syntenic region across any number of genomes.

Can there be two genes in an orthogroup that GENESPACE outputs that are not syntenic within their pair of genomes?

Non-syntenic orthologs are added to the pan-genome (see Figure 1 for examples and methods [464-482]). All hits, regardless of whether they are syntenic or in the same orthogroup are returned as well.

Does GENESPACE scale quadratically with the number of genomes?

Default OrthoFinder scales quadratically (it does all v. all blast). GENESPACE and orthofinderMethod = “fast” scales n+(n choose 2) since pairwise synteny is only inferred in one direction.

Why are for two genomes g1 and g2 both (g1, g2) and (g2, g1) contained, e.g. in Supplementary File 1?

I wonder if there was an issue with naming/uploading of the SI data in the previous submission. It looks good on my end, but here and below, there is some confusion – SI data 1 should be the pan-genome annotation for the vertebrates against the human coordinate system. I think the reviewer is probably referring to SI data 3, 4 etc which are syntenic block breakpoints and should be a long-formatted symmetric table. We now do not upload the SI as individual files but as a single zip archive with the file names retained. That way there should be no confusion.

(Why) is it not symmetric?

I have checked and block breakpoint tables are coercible to symmetric arrays [genome1, genome2, block].

What are the roles of first and second genomes?

This is for backwards compatibility with the phytozome visualization tool. Plus, it lets the user query by their genomes of interest without having to know which served as the target and which the query.

Table 1: The row labels are hard to make sense of. Are any rows the results of GENESPACE?

Dropped.

Figure 1: The human Y chromosome is in the figure caption but not in the image. The same holds for Supplementary Figure 3.

Good catch. There is no syntenic to the human Y. We have dropped it from the captions.

OrthoFinder is inconsistently spelled as OrthoFinder, Orthofinder and orthofinder.

Fixed to the correct name throughout.

Description of Supplemental Data 1. There are no columns pgChr or pgOrd in Lovell_09-03-2022-TR-eLife-78526_Supplementary_File_1.txt

Apparently the order of files and descriptions is not consistent with each other.

Yes, there was something broken in how SI tables were specified (see above). We have attempted to fix this by just uploading a single.zip archive with all SI data.

Line items:

43: grammar

162: What does ‘contrasted’ mean here?

173: What does ‘dosage’ mean here?

713: Add ‘of dotplots’ as the link apparently does not contain all of GENESPACEs results.

These have been fixed

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Reviewer #1 (Recommendations for the authors):

The authors have addressed many of my concerns by rewriting large parts of the manuscript, which significantly improved the description of what GENESPACE does and how it works. In particular the step-by-step descriptions in the methods worked well for me. I fully support publication of this manuscript, though there are some small suggestions below.

In the first Results section, I would have found it helpful to read one sentence on what OrthoFinder and MCScanX do (resp. what they output).

We have re-worked the intros to synteny and orthology to include references to these programs [65-67].

Line 71: This sentence was still not clear to me. What is the gene rank that is recalculated, and why would a recalculation make genes with PAV?

This wasn’t immediately obvious. We have updated the description to hopefully be clearer [76-81].

Table 2: no horizontal bars visible for me as indicated in the legend.

We have removed this statement as it referred to an earlier version of the table. Thanks for the catch.

Reviewer #2 (Recommendations for the authors):

In the revision, the authors have made substantial changes. In particular the first two tables were removed from the original manuscript. Lovell et al., have added a new Table 1, in which they compare their tool GENESPACE to OrthoFinder and MCScanX on pairs of genome annotations, and an in-depth description of their implementation steps.

The example plots are convincing, not so the benchmark in its current form. I am missing a measure of sensitivity or coverage as well as a measure of how much the orthogroups respect synteny.

We are interpreting this as “how many orthologous gene pairs are not syntenic”. GENESPACE provides these as flagged entries in the pan-genome annotation. We mentioned this in the methods and include examples in Figure 1. We have now included statistics about the number of non-syntenic orthologs in a few comparisons [112-119]. Since GENESPACE provides the non-syntenic orthologs to the user, their presence is not problematic and can be interpreted however the user sees fit.

In the very important sentence in line 87 it is grammatically ambiguous to what 'that' refers to. It should be reformulated so that it is unambiguous even for readers not understanding the authors intention yet.

The sentence in line 87 reads: “To estimate accuracy and precision of each approach, we calculated the percent of genes in orthogroups and syntenic blocks that were placed on exactly one chromosome per genome (Table 1).” It seems clear to us that “that” refers to both orthogroups and syntenic blocks, especially since table 1 shows both these attributes. Nonetheless, we have modified this sentence slightly and added (a) and (b) to the table to remove any ambiguity [96-97].

I assume 'that' in line 87 qualifies 'orthogroup'. Under that assumption, a trivial and very insensitive bogus orthology finder that outputs in a genome-wide search for synteny a single orthogroup with (any) two genes in two different species would achieve 2/2=100% of what the authors refer to as "accuracy and precision". A little less bogus, a trivial filter method that randomly (and independent of synteny) reduces an orthogroup with multiple genes from different chromosomes of the same species to one with genes on at most one chromosome per species would perform better in this measure than its orthogroup input. The numbers of Table 1 can consequently by themselves not reasonably establish the "outperformance" claimed in line 89.

This example would be appropriate if GENESPACE made orthogroups de novo. However, it does not. We have clarified this [81]; GENESPACE uses OrthoFinder orthogroups as the starting point. Since comparisons in Table 1 are made solely between the GENESPACE-refined orthogroups and raw OrthoFinder orthogroups, measuring the proportion of orthogroups that match the expected evolutionary patterns is the most obvious and non-trivial measure we can define.

Line 88 appears to refer to the same results of Table 1 but formulates "percentage of … orthogroups" as opposed to "percent of genes", which is of course different. I strongly suggest the authors use a precise definition of their central measures of accuracy, e.g. with a formula in which all terms are themselves unambiguously defined.

We feel that a formal formula is overkill, since it is a simple statistic that we define clearly in the caption: “We present the percent of genes that were found in an orthogroup that hit a single chromosome per genome…”. We have altered the text to clarify that the values are in terms of genes, not orthogroups [98-99]. Thanks for this catch.

The methods are now described at length and include coding details on pipelining, thresholding and implementation flow. Admittedly, this reviewer had requested details on the algorithm, but had ideally expected a concise and precise formulation of the formal objectives that GENESPACE achieves. Most readers would arguably appreciate a more succinct formulation that is specific about where the main ideas of the program lie and to which high-level design decisions are made. For example, is the decision which homologous genes are syntenic based on all-versus-all pairwise comparisons of coordinates, on pairwise comparisons of neighboring species in a user-specified order, in an order chosen by the program, are comparisons only done all-against-reference or even are comparisons between two genomes informed by third genomes?

In the revision, we significantly expanded both the overall description (as previously suggested by reviewer 1) and pipeline details (previously suggested by reviewer 2). The high-level decisions are all outlined in the new figure 1 and the formal objectives both in the second paragraph of the results [72-82] and second paragraph of the methods [280-288]. The example reviewer 2 provides here represents a pipeline detail that is thoroughly described in the methods [e.g. 364-365].

https://doi.org/10.7554/eLife.78526.sa2

Article and author information

Author details

  1. John T Lovell

    1. Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, United States
    2. Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    jlovell@hudsonalpha.org
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8938-1166
  2. Avinash Sreedasyam

    Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, United States
    Contribution
    Conceptualization, Software, Formal analysis, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7336-7012
  3. M Eric Schranz

    Biosystematics Group, Wageningen University and Research, Wageningen, Netherlands
    Contribution
    Conceptualization, Formal analysis, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6777-6565
  4. Melissa Wilson

    Center for Evolution and Medicine, School of Life Sciences, Arizona State University, Tempe, United States
    Contribution
    Formal analysis, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2614-0285
  5. Joseph W Carlson

    Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  6. Alex Harkess

    1. Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, United States
    2. Department of Crop, Soil, and Environmental Sciences, Auburn University, Auburn, United States
    Contribution
    Conceptualization, Investigation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2035-0871
  7. David Emms

    Oxford University, Oxford, United Kingdom
    Contribution
    Software, Formal analysis, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9065-8978
  8. David M Goodstein

    Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Conceptualization, Software, Supervision, Funding acquisition, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6287-2697
  9. Jeremy Schmutz

    1. Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, United States
    2. Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, United States
    Contribution
    Conceptualization, Software, Funding acquisition, Methodology, Writing – original draft, Project administration, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8062-9172

Funding

U.S. Department of Energy (DE-AC02-05CH1123)

  • John T Lovell
  • Avinash Sreedasyam
  • Joseph W Carlson
  • David M Goodstein
  • Jeremy Schmutz

National Institute of General Medical Sciences (R35GM124827)

  • Melissa Wilson

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

The GENESPACE pipeline has been improved by advice and testing by A Healey, N Walden, V Scarlett, R Walstead, S Carey, L Smith, J Vogel, J Willis, J Jenkins, T Juenger, and many others. Thanks to J Schnable, J Leebens-Mack, JG Monroe, CH Li, R Tarvin, and M Hufford for help refining the datasets and analyses presented in this manuscript. Thank you to Erich D Jarvis and the Vertebrate Genome Project members for advice and prepublication access to several genomes (budgerigar and dolphin). The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC02-05CH1123. Visualization was inspired in part by MCScanX and pairwise ‘river’ plots generated by other software. The use of syntenic orthogroups was originally inspired by work developed by CoGe; similar syntenic homology approaches have been implemented by other software, including pSONIC. MAW’s work on this was supported by the National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health (NIH) grant R35GM124827. JTL would like to thank Ashley Lovell, our friends and family for their support, which allowed him to work on this project during the difficult past 2 years.

Senior and Reviewing Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Publication history

  1. Received: March 10, 2022
  2. Preprint posted: March 11, 2022 (view preprint)
  3. Accepted: July 29, 2022
  4. Version of Record published: September 9, 2022 (version 1)

Copyright

© 2022, Lovell et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,458
    Page views
  • 552
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. John T Lovell
  2. Avinash Sreedasyam
  3. M Eric Schranz
  4. Melissa Wilson
  5. Joseph W Carlson
  6. Alex Harkess
  7. David Emms
  8. David M Goodstein
  9. Jeremy Schmutz
(2022)
GENESPACE tracks regions of interest and gene copy number variation across multiple genomes
eLife 11:e78526.
https://doi.org/10.7554/eLife.78526

Further reading

    1. Biochemistry and Chemical Biology
    2. Evolutionary Biology
    Audrey A Burnim, Matthew A Spence ... Nozomi Ando
    Research Article Updated

    Ribonucleotide reductases (RNRs) are used by all free-living organisms and many viruses to catalyze an essential step in the de novo biosynthesis of DNA precursors. RNRs are remarkably diverse by primary sequence and cofactor requirement, while sharing a conserved fold and radical-based mechanism for nucleotide reduction. Here, we structurally aligned the diverse RNR family by the conserved catalytic barrel to reconstruct the first large-scale phylogeny consisting of 6779 sequences that unites all extant classes of the RNR family and performed evo-velocity analysis to independently validate our evolutionary model. With a robust phylogeny in-hand, we uncovered a novel, phylogenetically distinct clade that is placed as ancestral to the classes I and II RNRs, which we have termed clade Ø. We employed small-angle X-ray scattering (SAXS), cryogenic-electron microscopy (cryo-EM), and AlphaFold2 to investigate a member of this clade from Synechococcus phage S-CBP4 and report the most minimal RNR architecture to-date. Based on our analyses, we propose an evolutionary model of diversification in the RNR family and delineate how our phylogeny can be used as a roadmap for targeted future study.

    1. Ecology
    2. Evolutionary Biology
    Julian Melgar, Mads F Schou ... Charlie K Cornwallis
    Research Article

    Cooperative breeding allows the costs of parental care to be shared, but as groups become larger, such benefits often decline as competition increases and group cohesion breaks down. The counteracting forces of cooperation and competition are predicted to select for an optimal group size, but variation in groups is ubiquitous across cooperative breeding animals. Here, we experimentally test if group sizes vary because of sex differences in the costs and benefits of cooperative breeding in captive ostriches, Struthio camelus, and compare this to the distribution of group sizes in the wild. We established 96 groups with different numbers of males (1 or 3) and females (1, 3, 4, or 6) and manipulated opportunities for cooperation over incubation. There was a clear optimal group size for males (one male with four or more females) that was explained by high costs of competition and negligible benefits of cooperation. Conversely, female reproductive success was maximised across a range of group sizes due to the benefits of cooperation with male and female group members. Reproductive success in intermediate sized groups was low for both males and females due to sexual conflict over the timing of mating and incubation. Our experiments show that sex differences in cooperation and competition can explain group size variation in cooperative breeders.