Dynamic turnover of centromeres drives karyotype evolution in Drosophila

  1. Ryan Bracewell
  2. Kamalakar Chatla
  3. Matthew J Nalley
  4. Doris Bachtrog  Is a corresponding author
  1. University of California, Berkeley, United States
14 figures, 9 tables and 11 additional files


Phylogenetic relationships and karyotype evolution in the D. obscura group.

Drosophila subobscura represents the ancestral karyotype condition consisting of five large and one small pair of telocentric chromosomes (termed Muller elements A-F). Phylogeny adapted from Gao et al. (2007). Chromosomal fusions and movement of centromeres along the chromosomes has resulted in different karyotypes in different species groups (Segarra et al., 1995; Schaeffer et al., 2008). Indicated along the tree are transitions of chromosome morphology, and the different subgroups of the obscura species group are indicated by gray shading (the subobscura, obscura, pseudoobscura, and affinis subgroup). Muller elements are color coded, and centromeres are shown as black ovals.

Figure 2 with 3 supplements
Genome organization in Drosophila obscura group flies.

Shown here are the assembled chromosome sizes, scaffolding stitch points, gene density, repeat content (percentage of bases repeat-masked in 100 kb windows) and H3K9me3 enrichment (50 kb windows) across the genome assemblies of D. subobscura, D. athabasca, D. lowei, D. pseudoobscura and D. miranda. Muller elements are color coded, gene density is shown as a black to green heatmap (genes per 100 kb), H3K9me3 enrichment is shown in orange, and repeat density is shown in teal (note that H3K9me3 enrichment and repeat density are plotted semi-transparent). Scaffolding stich points are indicated as vertical lines.

Figure 2—figure supplement 1
Illumina coverage over assembled chromosomes in reference genome assemblies of D. subobscura, D pseudoobscura, D. lowei and D. athabasca.

Male (blue) and female (red) Illumina coverage (log2) shown in 5 kb non-overlapping windows along each chromosome. Scaffolding stitch point locations shown in the outermost track. Note that stitch points often coincide with aberrant patterns in Illumina coverage highlighting difficult regions to assemble.

Figure 2—figure supplement 2
Nanopore or PacBio coverage over reference genome assemblies of D. subobscura, D pseudoobscura, D. lowei and D. athabasca.

Coverage (log2) shown in 50 kb non-overlapping windows along the genome assembly with different genomic partitions highlighted. Contigs < 50 kb not shown. Note, D. lowei data was derived from males, D. subobscura and D. athabasca was derived from females, and D. pseudoobscura was derived from a combination of males and females. Therefore, coverage over A-AD is expected to be lower in D. lowei and D. pseudoobscura relative to the autosomes (Mullers B, C, E, F). Further, in D. pseudoobscura, A-AD and Y coverage is not expected to be similar due to differences in the female/male contribution to the total read pool.

Figure 2—figure supplement 3
Hi-C association heatmaps of genome assemblies for D. subobscura, D pseudoobscura, D. lowei and D. athabasca.

Lines demarcate assembled chromosomes and unplaced or Y contigs. Note that pericentromeric regions typically show weak Hi-C associations with neighboring euchromatic regions due to their repetitiveness, thus producing ‘checkerboard’ patterns in the plot. Also of note, off-diagonal associations are apparent in many euchromatic arms in our hybrid D. athabasca assembly and thus identify chromosomal inversions (see Materials and methods).

Figure 3 with 2 supplements
Chromosome synteny and evolution.

(A) Conservation of Muller elements in the Drosophila genus. Orthologous single copy Drosophila melanogaster (Dmel) BUSCOs plotted on reference genome assemblies. Muller elements are color-coded based on D. melanogaster. (B) Comparisons of synteny between our genome assemblies. Muller elements are color-coded based on the D. subobscura genome. Each line represents a protein-coding gene. Ovals denote the location of the putative centromere (based on the location of centromere-associated satellite sequences, see Figure 4).

Figure 3—figure supplement 1
Whole genome alignments (MUMmer) of our Drosophila subobscura genome assembly (strain 14011–0131.10) with (A) D.subobscura strain ch-cu and (B) D. guanche.

We show the Muller element naming scheme on the X axis and the subobscura group chromosome naming scheme on the Y axis.

Figure 3—figure supplement 2
Whole genome alignments (MUMmer) of draft Drosophila bifasciata genome contigs with (A) both Muller A and D of D. subobscura and (B) highlighting the Muller A centromere seed regions.

Many small contigs in the D. bifasciata assembly map to the D. subobscura seed region which is indicative of a highly repetitive and more poorly assembled region (i.e., pericentromere) in D. bifasciata. (C) Alignments of D. bifasciata contigs with the fused Muller A-AD of D. pseudoobscura show sequence similarity over the large metacentric pericentromere of Muller A-AD.

Figure 4 with 5 supplements
Identification of centromere-associated satellite sequences.

(A) Histograms of most abundant satellites in assembled genomes. Repeat length refer to the size of the repeat unit. For each species apart from D. subobscura, a specific satellite (or higher-order variant of it as indicated by the same colors) is enriched. In D. miranda, a 99mer (in green) and four units of a unrelated 21mer (84 bp; in red) are the most abundant satellites, in D. pseudoobscura, four units of a similar 21mer (84 bp; in red) is most common, in D. lowei, three units of a similar 21mer (63 bp; in red) is most common, in D. athabasca, an unrelated 160mer (in pink) is the most common satellite, and in the D. guanche genome (Puerma et al., 2018), an unrelated 290mer (in black) is most common. No abundant satellite was identified in the assembled genome of D. subobscura. (B) Location of putative centromere-associated repeats (from panel A) in pericentromeric regions. In D. suboobscura a 12mer is highly enriched in raw sequencing reads. Shown is a 5 Mb fragment for each chromosome with the highest density of the satellite sequence (that is the putative centromere), and all unplaced scaffolds. (C) FISH hybridization confirms centromere location of identified satellites (same color coding as in A and B). Probes corresponding to the 21mer (Cy5; red) and 99mer (Cy3; green) were hybridized to both D. miranda and D. pseudoobscura; the 21mer showed a centromere location in both species, while the 99mer hybridized only to the centromeres of D. miranda. The 160mer (6FAM; red) localized to the centromeres of D. athabasca, and the 12mer (TYE665; red) to the centromeres of D. subobscura. Stronger hybridization signal supposedly correspond to higher repeat abundance at a particular genomic location.

Figure 4—figure supplement 1
Genomic distribution of inferred centromeric satellite sequences.
Figure 4—figure supplement 2
Short satellite DNAs in obscura group flies.

Shown is a heatmap of the results from k-Sseek analyses used to identify enriched satellite sequences. Only kmers >1 bp that constitute >10% of the total short satellite sequence in any one species are shown. White = not present, red = highly enriched.

Figure 4—figure supplement 3
Identification of centromere-associated satellite sequences from Nanopore and PacBio reads.

Shown are counts of satellite lengths identified directly from raw sequencing reads for each Drosophila species. Colored lines are drawn as in Figure 4 to highlight overlap between the results from TRF analyses and TideHunter analyses.

Figure 4—figure supplement 4
Additional fluorescent in situ hybridization images.

(A) To help visualize enrichment on all chromosomes in species with variable intensities, (A) shows only the color channel that identifies the 160mer and 12mer in D. athabasca and D. subobscura, respectively. Arrows show low intensity signal on Muller C in D. athabasca and an unknown Muller element in D. subobscura. (B) Shows replicates for each species and satellite placement during cell division.

Figure 4—figure supplement 5
Nanopore and Illumina sequencing coverage over unplaced contigs (Contig_6 and Contig_7) harboring arrays of putative D. subobscura centromeric-associated satellite sequence.
Figure 5 with 3 supplements
Emergence and loss of centromeres.

(A) Shown are homologous genes between D. subobscura (telocentric), D. athabasca (metacentric) and D. pseudoobscura (metacentric and telocentric) with H3K9me3 enrichment plotted along Muller A (red), B (green) and E (purple) in 50 kb windows. Genes identified in the pericentromere of metacentric chromosomes are shown with black lines. Genes identified in pericentromeres of metacentric chromosomes can be traced to two ‘seed regions’ each on the telocentric chromosome of D. suboboscura, and to paleocentromere regions in species that secondarily lost the metacentric centromere. (B) GC-content across D. subobscura Muller A, B and E. Seed regions have significantly lower GC-content compared to genomic background (Supplementary file 7).

Figure 5—figure supplement 1
Alignments of Muller A between D. subobscura (telocentric) with D. athabasca (metacentric) and H3K9me3 enrichment plotted in 50 kb windows above each chromosome.

Pericentromeric genes in D. pseudoobscura shown in black.

Figure 5—figure supplement 2
GC-content, the percentage of bases repeat-masked, and number of genes, in 10 kb non-overlapping windows across Muller A, B and E.

Seed regions are shown bounded by dashed lines. Note the orientation for the Muller elements are shown as in Figure 5B.

Figure 5—figure supplement 3
GC-content of different functional categories in seed and non-seed regions of Muller A, B and E of D. subobscura.
Karyotype and centromere evolution.

(A) Models for transitions between metacentric and telocentric chromosomes, either invoking pericentric inversions (top), or centromere repositioning (bottom) via the birth of a new centromere (lightning bolt) and death of the old centromere (skull and crossbones). The pericentromere is indicated by darker shading, the centromere as a white rectangle. (B) The syntenic location of genes adjacent to the centromere can allow us to distinguish between a simple inversion model vs. centromere relocation. The genes closest to the centromere of the telocentric chromosome (30 genes in panel C) are shown by different shading. (C) Dot plots for homologous genes (semi-transparent points) between telocentric and metacentric Muller elements (orange: Muller A-AD; purple: Muller E; green: Muller B). In 4 out of 5 cases, pericentric genes in the telocentric species are found in the non-pericentric regions of the metacentric species. Only Muller B between D. athabasca and pseudoobscura group flies (D. lowei is pictured) shows that the same genes are pericentric in both species (and thus support a simple inversion model).

Functional consequences of becoming pericentromeric.

(A) Metagene plots showing H3K9me3 enrichment for genes located in different parts of the genome in D. subobscura (top) and D. athabasca (bottom). (B) Patterns of gene expression for homologous genes in D. subobscura and D. athabasca, classified as whether they are part of the ‘seed’ region in D. subobscura that become part of the pericentromeric heterochromatin in D. athabasca or not. Expression patterns were not found to significantly differ between D. subobscura non-pericentromeric genes and seed genes, while seed orthologs located in the pericentromere of D. athabasca showed significantly higher expression than non-pericentromeric genes (Mann-Whitney U, p<0.0001).

Figure 8 with 2 supplements
Transposable element evolution across the genome.

Shown is the fraction of bases masked in 100 kb genomic windows for different transposable element families with the total TE fraction plotted above each chromosome.

Figure 8—figure supplement 1
Genomic distribution of transposable elements by species and Muller element.

The top 10 TE’s per Muller element, per species, are shown in descending order from top to bottom and ranked by their total contribution (bp) to each element. Each point represents a genomic location masked for an element.

Figure 8—figure supplement 2
De-novo estimates (dnaPipeTE) of transposable element frequencies in D. subobscura, D. athabasca, D. lowei and D. pseudoobscura.
Appendix 1—figure 1
Bandage plot of a typical Drosophila genome assembly.

The left panel is a visualization of the genome graph (.gfa file) from a canu assembly with the node name for each contig and Illumina coverage displayed in text overtop each contig. Each contig in the assembly is shaded by the amount of male Illumina whole genome sequencing coverage (see Materials and methods). In this example, red contigs are likely autosomal (~40×) while darker contigs have less coverage and indicate either putative sex chromosome contigs (~20×) or putative contaminant contigs (<<20×). (B) Shown is a zoomed in image of 2 nodes (535 and 511) in the assembly with exceptionally low male (top) and female (bottom) Illumina coverage (<0.1×). By also visualizing the top BLAST hits for these contigs (not shown), we were able to identify these contigs as belonging to an Acetobacter species and were thus contaminants marked for removal from the assembly. Contigs with exceptionally high Illumina coverage were also scrutinized thoroughly but these can arise for multiple reasons, including mtDNA contigs, collapsed regions of the target genome (e.g., rDNA genes or centromeric satellite sequence), or non-target contaminant contigs.

Appendix 1—figure 2
BAC clone sequencing confirms centromere and pealeocentromere assembly.

Several independent BAC clones map to the assembly of our paleocentromeres in D. miranda.

Appendix 1—figure 3
Drosophila athabasca EB metacentric chromosome Hi-C associations and scaffolding.

Our EB assembly (Appendix 1—table 2) was superior to our EA assembly (Appendix 1—table 3) and long contigs from our EB assembly extended at least a megabase into the pericentromere for all metacentric chromosomes. Shown above are Hi-C association heatmaps from Juicebox (Durand et al., 2016a). Green boxes denote contigs. The pericentromeric region is highlighted in purple, and note the clear transition in Hi-C associations between euchromatic and heterochromatic regions. We used EA Hi-C data to scaffold the EB assembly. The EA and EB semispecies harbor inversions that differentiate the semispecies and we identified numerous inversions when mapping EA Hi-C to the EB genome assembly. Thus, the exceptionally long EB contigs that extend into the pericentromeric region allowed us to accurately scaffold chromosomes while simultaneously identifying inversions along the euchromatic arms.

Appendix 1—figure 4
Drosophila athabasca EA assembly Hi-C associations and scaffolding.

(A) Hi-C scaffolding of the EA assembly recovered long blocks of contigs we identified as Muller elements from our EB assembly. Blue boxes bound putative Muller element boundaries; green boxes denote contigs. Black arrows show contigs that span the euchromatic/heterochromatic transition and allow for confident scaffolding into pericentromeric regions. Blue arrows show regions where contigs failed to assemble across the transition making scaffolding based on Hi-C associations more challenging. (B) Shown is a zoomed in image of Muller C scaffolding. Here, a contig spans the euchromatic/heterochromatic transition on Muller C and we used this scaffolded in our Dath_EB_hybrid assembly. Note the lack of evidence for inversions in the Hi-C heatmaps since here we are using EA Hi-C with an EA assembly.

Appendix 1—figure 5
Whole genome alignment of our D. lowei assembly (Y axis) to the published Drosophila miranda genome.
Appendix 1—figure 6
Whole genome alignment of our assembly (Y axis) to the published Drosophila pseudoobscura genome assembly (version 3.04).

Scaffolds in the published assembly that are near chromosome length (i.e., Muller E and Muller C) largely agree with our scaffolds. However, our assembly extends the assembled length of these chromosomes with far less scaffolding. For Muller B and Muller A-AD, our scaffolded chromosomes show large stretches of collinearity with the fragmented published assembly, with the exception of a few inverted regions. Our Hi-C data and association heatmap (see RESULTS) argue that our assembly orientation is likely the correct one and provide orientation to the five large scaffolds of Muller B and 8 scaffolds of Muller A-AD in the reference genome. The paleocentromeric region on Muller E is assembled in the current reference genome, but our assembly contains additional sequence not present in the published assembly (not shown).



Table 1
Total length (bp) of each assembled Muller element in each species, the number of contigs, and estimated length (Mb) of pericentromere sequence.
SpeciesMuller AMuller DMuller A-ADMuller BMuller CMuller EMuller FTotal
D. subobscura
chromosome (bp)24,182,86523,815,339n/a25,941,76920,343,35330,159,1541,505,893125,948,373
pericentromere (Mb)1.91.7n/a2.81.11.1n/a8.6
D. athabasca
chromosome (bp)n/an/a67,112,82252,101,12724,053,77542,973,4901,524,173187,765,387
pericentromere (Mb)n/an/a14.122.82.511.2n/a50.6
D. lowei
chromosome (bp)n/an/a73,251,62331,032,89724,430,08748,132,7061,606,711178,454,024
pericentromere (Mb)n/an/a17.22.23.515.1n/a38
D. miranda
chromosome (bp)n/an/a77,621,84432,539,84125,306,19135,263,3832,366,016173,097,275
pericentromere (Mb)n/an/a20.53.43.42n/a29.3
D. pseudoobscura
chromosome (bp)n/an/a67,434,67430,637,80322,641,56032,023,2971,941,385154,678,719
pericentromere (Mb)n/an/a14.
Table 2
Putative centromeric satellite lengths inferred from Tandem Repeat Finder (Benson, 1999), k-Seek (Wei et al., 2014), and TideHunter (Gao et al., 2019) for each Drosophila species.

HOR = higher order repeat.

Tandem Repeat Finderk-SeekTideHunter
D. subobscurano candidate12 bp12 bp, 107 bp
D. athabasca160 bp11 bp11 bp, 160 bp
D. lowei63 bp(HOR of 21 bp)no candidate21 bp
D. miranda99 bp, 84 bp(HOR of 21 bp)no candidate99 bp, 84 bp(HOR of 21 bp)
D. pseudoobscura84 bp(HOR of 21 bp)no candidate168 bp(HOR of 21 bp)
Table 3
Transposable elements in the D. obscura species group.
SpeciesTETotal bp masked% of genome masked
D. subobscura
total TE's7,572,8066.0%
D. athabasca
total TE's42,382,29622.6%
D. lowei
total TE's45,307,00625.4%
D. pseudoobscura
total TE's29,907,40719.3%
D. miranda
total TE's42,680,23424.7%
Key resources table
Reagent type
or resource
DesignationSource or
Strain, strain background (Drosophila subobscura male and female)14011–0131.10National Drosophila Species Stock Center (Cornell University)stock center number: 14011–0131.10
Strain, strain background (Drosophila pseudoobscura male and female)MV2-25National Drosophila Species Stock Center (Cornell University)stock center number: 14011–0121.94
Biological sample (Drosophila lowei)Jillo6isofemale line (deceased)
Biological sample (Drosophila athabasca EA)PA60isofemale line (deceased)
Biological sample (Drosophila athabasca EB)NJ28isofemale line (deceased)
Commercial assay or kitTruSeq Stranded RNA kitIlluminacat # 20020595
Commercial assay or kitTruSeq DNA Nano Prep kitIlluminacat # 20015965
Commercial assay or kitDNeasy KitQiagencat # 69504
Commercial assay or kitBlood and Cell Culture DNA Midi KitQiagencat # 13343
Commercial assay or kitGentra Puregene Tissue KitQiagencat # 158667
Commercial assay or kitLigation sequencing kitNanoporeSQK-LSK108
Commercial assay or kitRapid sequencing kitNanoporeSQK-RAD004
Commercial assay or kitQuick DNA plus Midi kitZymocat # D4075
Commercial assay or kitLigation sequencing kitNanoporeSQK-LSK109
Commercial assay or kitSMARTer Universal Low Input DNA-seq kitTakara/Rubicon BioR400676
Chemical compound, drugH3K9me3 antibodyDiagenode
Software, algorithmcanuKoren et al., 2017
Software, algorithmMUMmerKurtz et al., 2004
Software, algorithmWTDBG2Ruan and Li, 2019
Software, algorithmminimap2Li, 2018
Software, algorithmBandageWick et al., 2015
Software, algorithmBWALi and Durbin, 2009
Software, algorithmSAMtoolsLi et al., 2009
Software, algorithmbedtoolsQuinlan and Hall, 2010
Software, algorithmQUIVERChin et al., 2013
Software, algorithmPILONWalker et al., 2014
Software, algorithmRACONVaser et al., 2017
Software, algorithmJuiceboxDurand et al., 2016a
Software, algorithmJuicerDurand et al., 2016b
Software, algorithm3D-DNADudchenko et al., 2017
Software, algorithmGATK UnifiedGenotyperDePristo et al., 2011
Software, algorithmREPdenovoChu et al., 2016
Software, algorithmRepeatMaskerSmith et al., 2005
Software, algorithmMAKERCampbell et al., 2014
Software, algorithmHiSat2Kim et al., 2015
Software, algorithmStringTiePertea et al., 2015
Software, algorithmBUSCOSimão et al., 2015
Software, algorithmTandem Repeat FinderBenson, 1999
Software, algorithmk-SeekWei et al., 2014; https://github.com/weikevinhc/k-seek
Software, algorithmTideHunterGao et al., 2019
Appendix 1—table 1
Summary statistics and BUSCO results from the genome assembly process of Drosophila subobscura.
Canu/WTDBG2 onlyCanu/WTDBG2 + Racon (3x)Canu/WTDBG2 + Racon (3x) + Pilon (1x)Canu/WTDBG2 + Racon (3x) + Pilon (2x)Final Hi-C scaffolded Dsub_1.0
Max Contig25,648,09625,831,58225,849,83725,836,392*
Assembly Size128,396,075129,296,034129,434,338129,376,892126,232,139
Number of Contigs62616161*
Complete BUSCOs9871017105710601062
Complete Single Copy BUSCOs9841011104710501054
% BUSCOs complete92.6%95.4%99.2%99.4%99.6%
Appendix 1—table 2
Summary statistics and BUSCO results from genome assembly process of Drosophila athabasca EB
Canu + bandageCanu + Quiver (2x)Canu + Quiver (2x) + Pilon (2x)Final Hi-C scaffolded Dath_EB_1.0Final Hi-C scaffolded Dath_EB_hybrid
Max Contig34,554,38734,873,65134,879,405**
Assembly Size195,713,124192,740,469192,655,157192,660,667192,054,219
Number of Contigs199133133**
Complete BUSCOs10431054105710571060
Complete Single Copy BUSCOs10231046104810481052
% BUSCOs complete97.9%98.9%99.2%99.2%99.5%
Appendix 1—table 3
Summary statistics and BUSCO results from genome assembly process of Drosophila athabasca EA
Canu + bandageCanu + Quiver (2x) + Pilon (2x)Final Hi-C scaffolded Dath_EA_1.0
Max Contig20,031,44220,052,631*
Assembly Size193,369,473193,423,818193,434,778
Number of Contigs348348*
Complete BUSCOs104110551055
Complete Single Copy BUSCOs102110411041
% BUSCOs complete97.7%99.0%99.0%
Appendix 1—table 4
Summary statistics and BUSCO results from genome assembly process of Drosophila lowei.
Canu/WTDBG2 onlyCanu/WTDBG2 + Racon (3x)Canu/WTDBG2 + Racon (3x) + Pilon (1x)Canu/WTDBG2 + Racon (3x) + Pilon (2x)Final Hi-C scaffolded Dlow_1.0
Max Contig20,816,73020,907,87920,946,52020,929,419*
Assembly Size192,748,718191,586,436191,816,863191,620,915184,313,494
Number of Contigs943726726726*
Complete BUSCOs936975103710391036
Complete Single Copy BUSCOs920960101610211022
% BUSCOs complete87.8%91.5%97.3%97.5%97.2%
Appendix 1—table 5
Summary statistics and BUSCO results from the genome assembly process of Drosophila pseudoobscura
Canu + Racon (3x)Canu + Racon (3x) + Pilon (1x)Canu + Racon (3x) + Pilon (2x)Final Hi-C scaffolded Dpse_1.0
Max Contig20,387,16920,334,96520,319,488*
Assembly Size194,744,540194,172,171193,935,436193,980,066
Number of Contigs481481481*
Complete BUSCOs974105210571055
Complete Single Copy BUSCOs968104310481048
% BUSCOs complete91.4%98.6%99.2%99.0%

Additional files

Supplementary file 1

DNA sequence data generated for this study.

Supplementary file 2

Drosophila strains used for genome assembly.

Supplementary file 3

BUSCO results for assembled genomes.

For D. athabasca, BUSCO scores for the EB assembly are shown.

Supplementary file 4

Number of protein coding gene models from MAKER annotations for each genome assembly and Muller element.

Supplementary file 5

Average percentage of bases repeat-masked in each pericentromere.

Supplementary file 6

Inferred centromeric satellite sequence and fluorescent in situ hybridization probes.

Supplementary file 7

Comparison of gene density, repeat density, and GC-content (%), between seed and non-seed regions of Muller A, B and E in D. subobscura.

P-values for comparisons, Mann-Whitney U.

Supplementary file 8

The most common repeat families in each species and amount of masked sequence (bp).

Supplementary file 9

The top 10 TE’s from each Muller element and amount of masked sequence (bp).

Supplementary file 10

Hi-C data summary.

Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Ryan Bracewell
  2. Kamalakar Chatla
  3. Matthew J Nalley
  4. Doris Bachtrog
Dynamic turnover of centromeres drives karyotype evolution in Drosophila
eLife 8:e49002.