GENESPACE tracks regions of interest and gene copy number variation across multiple genomes

  1. John T Lovell  Is a corresponding author
  2. Avinash Sreedasyam
  3. M Eric Schranz
  4. Melissa Wilson
  5. Joseph W Carlson
  6. Alex Harkess
  7. David Emms
  8. David M Goodstein
  9. Jeremy Schmutz
  1. Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, United States
  2. Joint Genome Institute, Lawrence Berkeley National Laboratory, United States
  3. Biosystematics Group, Wageningen University and Research, Netherlands
  4. Center for Evolution and Medicine, School of Life Sciences, Arizona State University, United States
  5. Department of Crop, Soil, and Environmental Sciences, Auburn University, United States
  6. Oxford University, United Kingdom
4 figures, 3 tables and 1 additional file

Figures

GENESPACE synteny and pan-genome annotation methods.

(A, grey panel) GENESPACE runs and parses OrthoFinder results into a synteny-constrained pan-genome annotation. (B, purple panel) Chromosome, gene rank order, and orthogroup membership are added to BLAST hits, which allows direct integration between estimates of orthology and synteny. The three dotplots present the efficacy of GENESPACE syntenic blocks by exploring a particularly challenging region on human (x-axis) and chimpanzee (y-axis) chr. 6. Each point is a BLAST hit rank-order position, colored by syntenic block; colors are recycled if there are more than eight blocks. (C, green panel) Synteny-constrained orthogroups and optionally non-syntenic orthologs are decomposed into a pan-genome annotation where each orthogroup is placed at its inferred syntenic position.

Figure 2 with 1 supplement
Sex chromosome syntenic network across 17 representative vertebrate genomes.

The plot was generated by the plot_riparian GENESPACE function. Genomes are ordered vertically to minimize the number of translocations between each pairwise combination. Chromosomes are ordered horizontally to maximize synteny with the human chromosomes [X, 1–22]. Regions containing syntenic orthogroup members to the mammalian X (gold) or avian Z (blue) chromosomes are highlighted. All sex chromosomes are represented by red segments while autosomes are white. Chromosome segment sizes are scaled by the total number of genes in syntenic networks and positions of the braids are the gene order along the chromosome sequence. See Figure 2—figure supplement 1 for the full synteny graph including autosomes and chromosome labels.

Figure 2—figure supplement 1
Complete map of synteny, color-coded by synteny with human chromosomes X, 1-22.
Figure 3 with 1 supplement
Comparative–quantitative genomics in the grasses.

(A) The GENESPACE syntenic map (‘riparian plot’) of orthologous regions among eight grass genomes. Chromosomes are ordered horizontally to maximize synteny with rice and ribbons are color coded by synteny to rice chromosomes. Genomes are ordered vertically by general phylogenetic positions. (B) The upper bars display the proportion of maize gene models without syntenic orthologs (‘absent’) in each genome, split by the full background (dark colors) and 86 C3/C4 genes (light colors). (C) The proportion of absent genes is higher in the C3 genomes (green bars), even when controlling for more global gene absences (lower odds ratios). (D) Syntenic orthologs, excluding homeologs among the 26 maize nested association mapping (NAM) founder genomes, with two quantitative trait loci (QTL) intervals highlighted on chromosome 3 (‘Chr3’) and chromosome 6 (’Chr6’). (E) Focal QTL regions that affect productivity in drought where only the genome that drives the QTL effect (middle), the top (B73) and bottom (Tzi8) genomes are presented and the region plotted is restricted to the physical B73 QTL interval and a 25 M bp buffer on either side. Note that the Chr3 QTL disarticulates into two intervals. Due to a larger number of potential candidate genes, the larger Chr3 region, flagged with **, is explored separately in Figure 3—figure supplement 1. (F) Presence–absence and copy number variation are presented for two of the three intervals as heatmaps where each row is a genome (order following panel D), each column is a pan-genome entry (see Figure 1), and the color of each tile indicates absence (gray), single copy (light blue), and multicopy (dark blue). PAV/CNV of the focal genome is outlined. For each interval, the estimated QTL allelic effect relative to B73 of each genome is plotted as bars to the right of the heatmap.

Figure 3—figure supplement 1
Map of PAV in the larger MO18W chromosome 3 QTL.
Analysis of the grass Rho WGD.

(A) BLAST hits between P. hallii and S. viridis where the target and query genes were in the same orthogroup are plotted and color coded by sequence similarity. Two over-retained regions are highlighted in the red and yellow boxes. (B) The protein identity of S. viridis chromosome 8 primary orthologous (blue line) hits against P. hallii chromosome 8 and the secondary hits (orange line) against P. hallii chromosome 3 demonstrate sequence conservation heterogeneity. The region between the two red vertical lines corresponds to the red-boxed over-retained primary block in panel A. (C) The two boxed regions in panel A were tracked from P. hallii chromosomes 3 (red) and 8 (yellow); 50% transparency of the braids means that overlapping regions appear orange.

Tables

Table 1
Comparison of synteny and orthogroup methods.

To test the precision of GENESPACE syntenic orthogroups estimates, we contrasted seven pairs of haploid genome assemblies. We present the percent of genes that were found in an orthogroup that hit a single chromosome per genome from the default OrthoFinder and GENESPACE runs. The precision of syntenic block breakpoint estimates was calculated similarly, where the percentage of genes that are placed in a single syntenic block per genome are presented for MCScanX run on all hits, those where both the query and target genes are in the same orthogroup (‘OG’) or via the GENESPACE pipeline.

(a) % genes in single-copy OGs(b) % genes in single-copy syntenic blocks
Age (~M ya)OrthoFinderGENESPACEMCScanXMCScanX OGGENESPACE
B73 vs. B97 maize*<0.0151.573.650.879.093.4
Human Hg38 vs. T2T0–0.187.795.981.195.097.7
Cotton*,+0.535.685.72.714.196.2
HAL2 vs. FIL2 panicgrass*1.174.883.262.389.392.0
Human-chimpanzee781.190.278.691.293.3
Sorghum-Brachypodium*5046.750.249.367.476.3
Human-chicken31066.768.566.471.273.0
  1. *

    The plant genomes all have one or more WGDs that predate divergence of the genomes,.+Cotton species Gossypium barbadense and G. darwinii have the most recent WGD of ~1.6 M ya, which causes a large number of blocks to be included as two copies; to avoid confusion between subgenomes, blkSize, and nGaps parameters were increased from 5 (default) to 10 genes.

Table 2
Raw data sources.

A list of the genomes used in analyses here. Genome version IDs are taken from those posted on the respective data sources and may not reflect the name of the genome in the publication. Where multiple haplotypes are available, only the primary was used for these analyses. All polyploids presented here have only a primary haplotype assembled into chromosomes.

IDSpeciesGenome versionData sourcePloidy*Reference
garter snakeThamnophis elegansrThaEle1.priNCBI1Rhie et al., 2021
sand lizardLacerta_agilisrLacAgi1.priNCBI1Rhie et al., 2021
chickenGallus gallusmat.broiler.GRCg7bNCBI1https://www.ncbi.nlm.nih.gov/grc
hummingbirdCalypte annabCalAnn1_v1.pNCBI1Rhie et al., 2021
budgieMelopsittacus undulatusbMelUnd1.mat.ZNCBI1Unpublished VGP
swanCygnus olorbCygOlo1.pri.v2NCBI1Rhie et al., 2021
zebra finchTaeniopygia guttatabTaeGut1.4.priNCBI1Rhie et al., 2021
echidnaTachyglossus aculeatusmTacAcu1.priNCBI1Zhou et al., 2021
platypusOrnithorhynchus anatinusmOrnAna1.pri.v4NCBI1Zhou et al., 2021
brushtail possumTrichosurus vulpeculammTriVul1.priNCBI1Rhie et al., 2021
opossumMonodelphis domesticaMonDom5NCBI1Mikkelsen et al., 2007
Tasmanian devilSarcophilus harrisiimSarHar1.11NCBI1Rhie et al., 2021
human (Hg38)Homo sapiensGRCh38.p13NCBI1https://www.ncbi.nlm.nih.gov/grc
human (t2t)Homo sapiensCHM13-T2T v2.1NCBI1Nurk et al., 2022
chimpanzeePan troglodytesClint_PTRv2NCBI1Chimpanzee Sequencing and Analysis Consortium, 2005
mouseMus musculusGRCm39NCBI1https://www.ncbi.nlm.nih.gov/grc
dogCanis lupus familiarisDog10K_Boxer_TashaNCBI1Jagannathan et al., 2021
slothCholoepus didactylusmChoDid1.priNCBI1Rhie et al., 2021
horseshoe batRhinolophus ferrumequinummRhiFer1_v1.pNCBI1Rhie et al., 2021
dolphinTursiops truncatusmTurTru1.mat.YNCBI1Unpublished VGP
P. halliiPanicum hallii var. halliiHAL2_v2.1Phytozome1Lovell et al., 2018
P. hallii (FIL)Panicum hallii var. filipesFIL2_v3.1Phytozome1Lovell et al., 2018
switchgrassPanicum virgatumAP13_v5.1Phytozome2Lovell et al., 2021b
S viridisSetaria viridisv2.1Phytozome1Mamidi et al., 2020
SorghumSorghum bicolorBTx623_v3.1Phytozome1Paterson et al., 2009
maizeZea maysB73_refgen_v5NCBI*2Hufford et al., 2021
riceOryza sativa cv ‘kitaake’kitaake_v2.1Phytozome1Jain et al., 2019
BrachypodiumBrachypodium distachyonBd21_v3.1Phytozome1International Brachypodium Initiative, 2010
wheatTriticum aestivumV4 (Chinese Spring)NCBI3Zhu et al., 2021
G barbadenseGossypium barbadensev1.1Phytozome2Chen et al., 2020
G. darwiniiGossypium darwiniiv1.1Phytozome2Chen et al., 2020
26 NAM parentsZea mayssee data on NCBINCBI*1Hufford et al., 2021
  1. *

    Ploidy indicates how the genome was treated in the analyses. All values match the ploidy of the primary assembly haplotype except maize, where the refgen_v5 was treated as diploid (to match both homeologs) in the multispecies run, but as haploid in the nested association mapping (NAM) founder population to track only meiotic homologs across the population. This parameterization is to match the phylogenetic position of the whole-genome duplication (WGD) in the terminal branch of the grass-wide analysis, but ancestral in the 26-NAM analysis.

Table 3
Comparison of GENESPACE setting performance.

The mirrored ‘fast’ method significantly speeds up OrthoFinder runs by calling DIAMOND2 on each nonredundant pairwise combination of genomes. However, this approach is less sensitive than the default performance and is suggested for only closely related haploid genomes, as the recall of 2:2:2 OGs is less sensitive than the default specification.

Default OrthoFinderGENESPACE ‘fast’
n.1:1:1 OGs22,05022,444
n.2:2:2 OGs13,79313,511
n.tandem arrays10,597 (4433)10,599 (4426)
*Run time (min)59.9512.45
  1. *

    Run time is for ortholog/orthogroup inference (not the GENESPACE pipeline as a whole) using the three cotton genomes, running on 6 2 Gb cores.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. John T Lovell
  2. Avinash Sreedasyam
  3. M Eric Schranz
  4. Melissa Wilson
  5. Joseph W Carlson
  6. Alex Harkess
  7. David Emms
  8. David M Goodstein
  9. Jeremy Schmutz
(2022)
GENESPACE tracks regions of interest and gene copy number variation across multiple genomes
eLife 11:e78526.
https://doi.org/10.7554/eLife.78526