Highly contiguous assemblies of 101 drosophilid genomes

  1. Bernard Y Kim  Is a corresponding author
  2. Jeremy R Wang
  3. Danny E Miller
  4. Olga Barmina
  5. Emily Delaney
  6. Ammon Thompson
  7. Aaron A Comeault
  8. David Peede
  9. Emmanuel RR D'Agostino
  10. Julianne Pelaez
  11. Jessica M Aguilar
  12. Diler Haji
  13. Teruyuki Matsunaga
  14. Ellie E Armstrong
  15. Molly Zych
  16. Yoshitaka Ogawa
  17. Marina Stamenković-Radak
  18. Mihailo Jelić
  19. Marija Savić Veselinović
  20. Marija Tanasković
  21. Pavle Erić
  22. Jian-Jun Gao
  23. Takehiro K Katoh
  24. Masanori J Toda
  25. Hideaki Watabe
  26. Masayoshi Watada
  27. Jeremy S Davis
  28. Leonie C Moyle
  29. Giulia Manoli
  30. Enrico Bertolini
  31. Vladimír Košťál
  32. R Scott Hawley
  33. Aya Takahashi
  34. Corbin D Jones
  35. Donald K Price
  36. Noah Whiteman
  37. Artyom Kopp
  38. Daniel R Matute  Is a corresponding author
  39. Dmitri A Petrov  Is a corresponding author
  1. Department of Biology, Stanford University, United States
  2. Department of Genetics, University of North Carolina, United States
  3. Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, United States
  4. Department of Evolution and Ecology, University of California Davis, United States
  5. School of Natural Sciences, Bangor University, United Kingdom
  6. Biology Department, University of North Carolina, United States
  7. Department of Integrative Biology, University of California, Berkeley, United States
  8. Molecular and Cellular Biology Program, University of Washington, United States
  9. Department of Biological Sciences, Tokyo Metropolitan University, Japan
  10. Faculty of Biology, University of Belgrade, Serbia
  11. University of Belgrade, Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, Serbia
  12. School of Ecology and Environmental Science, Yunnan University, China
  13. Hokkaido University Museum, Hokkaido University, Japan
  14. Biological Laboratory, Sapporo College, Hokkaido University of Education, Japan
  15. Graduate School of Science and Engineering, Ehime University, Japan
  16. Department of Biology, University of Kentucky, United States
  17. Department of Biology, Indiana University, United States
  18. Neurobiology and Genetics, Theodor Boveri Institute, Biocentre, University of Würzburg, Germany
  19. Institute of Entomology, Biology Centre, Academy of Sciences of the Czech Republic, Czech Republic
  20. Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Stowers Institute for Medical Research, United States
  21. School of Life Science, University of Nevada, United States
7 figures, 2 tables and 7 additional files


Figure 1 with 4 supplements
Nanopore-based assemblies are highly contiguous and complete.

(A,B) Assembly contiguity is compared to the D. melanogaster v6.22 reference genome (blue) as well as five recently published, highly contiguous Illumina assemblies (red lines, D. birchii, D. bocki, D. bunnanda, D. kanapiae, D. truncata; Bronski et al., 2020). (A) Nx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the genome assembly that is covered. (B) The distribution of contig N50, the size of the contig at which 50% of the assembly is covered. (C) Assembly completeness assessed by BUSCO v4.0.6 (Seppey et al., 2019). Note, D. equinoxialis was evaluated with BUSCO v4.1.4 due to an issue with v4.0.6. L. stackelbergi has >10% missing BUSCOs. Individual assembly summary statistics are provided in Supplementary file 2.

Figure 1—figure supplement 1
Nanopore-based assemblies compare favorably to representative genomes on NCBI.

(A) The contig N50 of the representative genome assembly for 75 different species on NCBI (right) is compared to the contig N50s of our assemblies (left). (B) The BUSCO (Simão et al., 2015) completeness (sum of complete single-copy and complete duplicated) of the NCBI assemblies of our assemblies is compared to the BUSCO completeness of our assemblies. The list of drosophilid genomes, contig N50s, and BUSCO completeness statistics were obtained from Hotaling et al., 2021. Note, BUSCO v4 was used for both genome assessments, but the OrthoDB v10 (Kriventseva et al., 2019) Diptera gene set was used to evaluate our assemblies while the OrthoDB v10 Insecta set was used to evaluate the NCBI assemblies.

Figure 1—figure supplement 2
Large improvements in assembly contiguity from an updated assembly workflow.

Points on the left depict contig N50s from Miller et al., 2018. Points on the right depict contig N50s with our updated assembly workflow. In the updated workflow, ONT raw data are basecalled with Guppy in high-accuracy mode and assembled with Flye v2.6. For D. bipectinata, D. biarmipes, and D. willistoni (depicted with the light orange lines), new ONT sequencing optimized for longer reads and of a different strain than Miller et al., 2018 was performed. For all other species, the same raw data was used for both assembly workflows.

Figure 1—figure supplement 3
Contiguity metrics standardized by the estimated genome size.

(A) NGx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the estimated genome size that is covered. (B) The distribution of contig NG50, the size of the contig at which 50% of the estimated genome is accounted for.

Figure 1—figure supplement 4
Estimated genome size is similar to assembly size.

The genome size estimated from read coverage over known single-copy genes in each assembly (x-axis) is compared to the length of each final assembly (y-axis). The dotted line is the 1:1 line.

Figure 2 with 1 supplement
Estimated heterozygosity in the data used for genome assembly.

Per-site SNP heterozygosity (number of heterozygous SNPs/number of callable sites) is plotted for each of the 101 assembled lines. Blue dots represent heterozygosity estimates from Nanopore reads with PEPPER-Margin-DeepVariant (Shafin et al., 2021). Orange dots represent heterozygosity estimates from short reads with BCFtools (Li, 2011). The genomes on the right are for species that did not have available short-read data. Numerical values for these estimates are provided in Supplementary file 4.

Figure 2—figure supplement 1
Assembly contiguity is not related to sample heterozygosity.

Per-site estimates of heterozygosity are plotted against the contig N50 for all assemblies. No significant correlation (Pearson’s correlation p=0.30) was observed.

Figure 3 with 2 supplements
Nanopore-based Drosophila assemblies are accurate, particularly in coding regions.

(A) Genome-wide, Phred quality scores estimated with the reference-free, k-mer based approach implemented in Merqury (Rhie et al., 2020). Merqury requires a short-read dataset to perform the evaluation. Filled circles represent QV estimates with short-read data from the same strain used for Nanopore sequencing, and empty circles denote estimates using short-read data from a different strain than used for Nanopore sequencing. (B, C, D) Phred quality score cutoffs for the bottom 10th percentile of 100 kb genomic windows, as evaluated with a reference-based approach, in coding sequences only. Quality scores are capped at 60 for visualization purposes. At least 90% of 100 kb windows are this accurate. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated. Accuracy is shown for SNVs (B), insertions (C), and deletions (D) separately. Additional details on quality score estimates are provided in Figure 3—figure supplement 1 and Supplementary file 4.

Figure 3—figure supplement 1
Variation in sequence accuracy within the genome assemblies.

Phred-scaled quality scores were computed by a reference-based comparison in non-overlapping 100 kb windows. All variants were considered together (accuracy), then SNVs, insertions, and deletions separately. All sequences in each window were considered together (all) then coding sequences, introns, intergenic regions, and repeats separately. All scores above QV50 were set to QV50 for visualization purposes. The cross denotes the mean score, weighted by the bases considered for each window. The dot and both whiskers denote the median, 10th percentile, and 90th percentile scores across all windows, respectively. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated.

Figure 3—figure supplement 2
Large insertions account for nearly all differences between the Nanopore-based and reference D. melanogaster assembly.

The distribution of indel differences between our Nanopore-based assembly and the reference are shown. Each color represents a unique indel per FlyBase protein-coding gene. Note, the x-axis scale of insertions is much larger than that of deletions. Additional details on each indel are provided in Table S5.

Gene content of Muller elements is conserved across drosophilids while gene order changes.

Each node in this graph represents an orthologous marker corresponding to single-copy orthologs annotated by BUSCOv4 (Seppey et al., 2019). An edge between two nodes represents the number of times that BUSCO pair is directly connected within an assembly. Each BUSCO is colored by the chromosome arm in D. melanogaster that it is found on. The ForceAtlas2 (Jacomy et al., 2014) graph layout algorithm was used for visualization.

Figure 5 with 2 supplements
Repeat content varies greatly between drosophilid groups.

For each species, the proportion of each genome annotated with a particular repeat type is depicted. Species relationships were inferred by randomly selecting 250 of the set of BUSCOs (Seppey et al., 2019) that were complete and single-copy in all assemblies. RAxML-NG (Kozlov et al., 2019) was used to build gene trees for each BUSCO then ASTRAL-MP (Yin et al., 2019) to infer a species tree. Repeat annotation was performed with RepeatMasker (Smit et al., 2013) using the Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) databases. ASTRAL local posterior probabilities are reported at each node.

Figure 5—figure supplement 1
Assembly contiguity is not determined by repeat content.

There is no relationship (Spearman’s ρ=0.036, p=0.725) between repeat content (as annotated by RepeatMasker) in a genome and the contiguity of the resulting assembly.

Figure 5—figure supplement 2
The non-repetitive and repetitive portions of the genome both contribute to genome size differences between drosophilids.

Phylogenetically independent contrasts (PICs) are shown for the number of bases in each genome not annotated as repetitive sequence (x-axis) and the number annotated as repeat by RepeatMasker (y-axis). The red dotted line is the best-fitting line through the origin. A positive relationship between the non-repetitive and repetitive portions of the genome is observed (Spearman’s ρ=0.679, p<2.2e-16), suggesting that both play a role in determining the genome size of drosophilids.

Highly contiguous assemblies can be obtained with lower coverage of ultra-long reads.

The NGx curve is shown for Drosophila jambulina assemblies at varying levels of coverage. The length of the assembly with the full data is assumed to be the genome size. Read sets used for each assembly were obtained by randomly downsampling the basecalled reads (read N50 ~27.5 kb) to varying (5× to 30×) depth of coverage. Proportionally, these read sets contain ~55% of total sequenced bases in reads longer than 25 kb, ~25% of bases in reads longer than 50 kb, and ~7% of bases in reads longer than 100 kb. Near chromosome scale assemblies (N50>20Mb) were achievable even at 15× to 20× depth with this read length distribution. This corresponds to approximately 8× to 10× depth in reads longer than 25 kb.

Flow chart depiction of the assembly pipeline.


Table 1
Species and strain information for all samples assembled for this work.

Note: Species group and subgroup information is taken from the NCBI Taxonomy Browser with slight modifications following O'Grady and DeSalle, 2018. Strain names along with corresponding NDSSC and Kyoto DGRC stock center numbers are provided to the best of our knowledge. See Supplementary file 1 and Supplementary file 6 for detailed information on samples and data. When multiple lines of a species are listed, * denotes the preferred assembly.

SubgenusGroupSubgroupSpeciesSexStrain nameNDSSCKyoto DGRC/
Additional notes
SophophoramelanogastermelanogasterD. melanogasterMFISO-1 GENOME14021-0231.36NABDGP reference strain
D. mauritianaFNA14021-0241.01NAMiller et al., 2018
D. simulansFNA14021-0251.006NAMiller et al., 2018
D. sechelliaFNA14021-0248.01NAMiller et al., 2018
D. teissieri *M273.3NANA
D. teissieriMCT02NANA
D. yakubaFNA14021-0261.01NAMiller et al., 2018
D. erectaFNA14021-0224.01NAMiller et al., 2018
eugracilisD. eugracilisFNA14026-0451.02NAMiller et al., 2018
suzukiiD. subpulchrellaML1NANA
D. biarmipesMF361.0 iso1 l-11 GENOME strain 114023-0361.10NAmodENCODE strain
takahashiiD. takahashiiFIR98-3 E-12201NAE-912201inbred derivative of Ehime stock IR98-3
ficusphilaD. ficusphilaF631.0-iso1 l-10 GENOME14025-0441.05NAmodENCODE strain
rhopaloaD. carrolliMFKB866NANA
D. rhopaloaMFBaVi067 GENOME14029-0021.01E-24701modENCODE strain
D. kurseongensisFSaPa58NANA
D. fuyamaiFKB-121714029-0011.01NA
elegansD. elegansFHK0461.03 GENOME14027-0461.03NAmodENCODE strain
suzukiiD. oshimaiMMT-04NANA
montiumD. bocquetiMYAK3_mont-66NANA
D. sp aff chauvacaeMmont_up-71NANA
D. jambulinaMFst-214028-0671.01NA
D. kikkawaiF561.0-iso4 l-10 GENOME14028-0561.14NAmodENCODE strain
D. rufaFEH091 iso-C L_3NA914802inbred derivative of Ehime stock EH091
D. triaurariaFNA14028-0691.9NAMiller et al., 2018; previously mis-identified as D. kikkawai
ananassaeD. malerkotliana pallensFpalQ-isoGNANA
D. malerkotliana malerkotlianaMFmal0-isoC14024-0391.00NAinbred derivative of strain 14024-0391.00
D. bipectinataMF4-4-2-3-1-1-1-1-1 BackUp14024-0381.04NAInbred derivative of NDSSC strain
D. parabipectinataMFpar2-isoB14024-0401.02NAinbred derivative of strain 14024-0401.02 (now extinct)
D. pseudoananassae pseudoananassaeFWau 125NANA
D. pseudoananassae nigrensFVT04-31NANA
D. ananassaeF14024-0371.13NANAMiller et al., 2018
D. variansMFCKM15-L1NANA
D. ercepeaceMF164-1414024-0432.00NA
obscuraobscuraD. ambiguaMR42NANAisofemale strain from the wild
D. tristisMD2NANAisofemale strain from the wild
D. obscuraMBZ-5NANAisofemale strain from the wild
D. subobscuraMKüsnachtNANAstandard laboratory strain
pseudoobscuraD. persimilisFNA14011-0111.01NAMiller et al., 2018
D. pseudoobscuraFNA14011-0121.94NAMiller et al., 2018
willistoniwillistoniD. willistoni (Uruguay) *ML-G314030-0811.17NA
D. willistoniFNA14030-0811.00NAMiller et al., 2018
D. paulistorum L06 *M(Heed) H66.1C14030-0771.06NA
D. paulistorum L12ML1214030-0771.12NA
D. tropicalisM(Heed) H65.214030-0801.00NA
D. insularisMjp01iNANAisofemale line from J. Powell
bocainensisD. sucineaM49.1514030-0791.01NA
D. sucinea**MH176.1014030-0761.01NANDSSC strain is misidentified as D. nebulosa
saltanssaltansD. saltansM(Heed) H180.4014045-0911.00NA
D. prosaltansM(Heed) H29.614045-0901.02NA
neocordataD. neocordataM2536.714041-0831.00NA
sturtevantiD. sturtevantiFH191.2314043-0871.01NA
LordiphosamikiL. clarofinisMFGuizhou062018LCNANALine inbred for 2 generations in the lab before sequencing
L. stackelbergiMFUCILTSSapporo052019LSNANAPool of 50 wild-caught flies
L. magnipectinataMFUCKTSapporo052019LMNANAPool of 50 wild-caught flies
fenestrarumL. collinellaMFUCKTSapporo052019LCNANAPool of 30 wild-caught flies
L. mommaiMFMMSapporo052014LMNANA
DrosophilaZaprionusvittigerZ. nigranusMst01nNANAline derived from wild collection
Z. camerounensisMjd01camNANAisofemale line from J. David
Z. lachaiseiMjd01lNANAline derived from wild collection
Z. vittigerMjd01vNANAisofemale line from J. David
Z. davidiMjd01dNANAisofemale line from J. David
Z. taronusMst01tNANAline derived from wild collection
Z. capensisMjd01capNANAisofemale line from J. David
Z. gabonicusMjd01gabNANAisofemale line from J. David
Z. indianus RCR04MRCR04NANA
Z. indianus 16GNV01M16GNV01NANA
Z. indianus BS02 *MBS02NANA
Z. indianus CDD18MCDD18NANA
Z. africanusMBS06NANA
Z ornatusMjd01oNANAisofemale line from J. David
tuberculatusZ. tsacasiMcar7-4NANA
Z. tsacasi *Mjd01tNANAisofemale line from J. David
inermisZ. kolodkinaeMjd01kNANAisofemale line from J. David
Z. inermisM18BSZ10NANA
Z. ghesquiereiMjd01gheNANAisofemale line from J. David
cardinidunniD. dunniMH254.2115182-2291.00NA
D. arawakanaMMONHI050227(B)-10415182-2261.03NA
cardiniD. cardiniMNA15181-2181.03917701
funebrisfunebris?undescribed (Sao Tome mushroom)Mst01mNANAundescribed species collected on mushroom, Sao Tome
funebrisD. funebrisMfst01NANAline derived from wild collection
immigransimmigransD. immigrans *FFK05-1915111.1731.12NA
D. immigrans kari17Mkari17NANA
(incertae sedis)D. pruinosaMiso-A1 l-9NANA
quadrilineataD. quadrilineataMquad-TMUNA914402
tumiditarsusD. repletoidesMISZ-isoB I-10NANA
ScaptomyzaScaptomyzaS. montanaMFiso-CA-L1NANA
S. graminumFTMU-2019NANA30 wild-caught females
ParascaptomyzaS. pallidaMFiso-CA-L1NANA
HemiscaptomyzaS. hsuiMFiso-CA-L1NANA
HawaiianDrosophilaorphnopezaD. sproatiMFDKPTOMS02NANAPool of wild-caught flies
D. murphyiMFDKPHETFM01NANAFlies from recently established but not inbred lab line
grimshawiD. grimshawiFNA15287-2541.00NASame line as caf1 genome
virilisvirilisD. virilisFNA15010-1051.87NAMiller et al., 2018
D. americanaM3367.115010-0951.00NAAlso called Anderson strain
D. littoralisMKilpisjärvi 1NANAOriginally misidentified as D. ezoana (Lankinen 1986, J Comp Physiol A 159: 123-142)
repletarepletaD. repletaMkari30NANA
mulleriD. mojavensisF15081-1352.22NANAMiller et al., 2018
genus: LeucophengaL. variaMnc01vNANASequenced single wild-caught fly, no amplification
genus: ChymomyzaC. costataMSapporoNANA
  1. * denotes the genome of best quality when multiple assemblies are available for a species.

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional information
Strain, strain background (Drosophila spp. and relatives)See Table 1 and Supplementary files 16 for sample information, strain designations, stock center line identifiers (when applicable), biomaterial provider, and NCBI accession numbers.
Commercial assay or kitBlood and Cell Culture DNA Mini KitQiagencat # 13323
Commercial assay or kitLigation Sequencing KitOxford NanoporeSQK-LSK109Superseded by SQK-LSK110
Commercial assay or kitFlow cell wash kitOxford NanoporeEXP-WSH003Superseded by EXP-WSH004
Commercial assay or kitShort Read Eliminator kitCirculomicsSKU # SS-100-101-01
Commercial assay or kitCompanion Module for ONT Ligation SequencingNEBNextcat # E7180S
Commercial assay or kitNextera XT DNA Library Preparation KitIlluminacat # FC-131–1002Superseded by version 2
Commercial assay or kitKapa HyperPrep KitRochecat # KK8502
Software, algorithmFlyeKolmogorov et al., 20192.6
Software, algorithmCanuKoren et al., 20171.8
Software, algorithmMiniasmLi, 20160.3
Software, algorithmGuppyOxford Nanopore3.2.4
Software, algorithmMedakaOxford Nanopore0.9.1
Software, algorithmMinimap2Li, 20162.17
Software, algorithmSAMtoolsLi et al., 20091.12
Software, algorithmRaconVaser et al., 20171.4.3
Software, algorithmBUSCOSimão et al., 20153.0.2
Software, algorithmBUSCOSeppey et al., 20194.0.6
Software, algorithmPurge_haplotigsRoach et al., 20181.1.1
Software, algorithmnpScarfCao et al., 20171.9-2b
Software, algorithmPilonWalker et al., 20141.23
Software, algorithmBLASTAltschul et al., 19902.10.0
Software, algorithmSPAdesBankevich et al., 20123.11.1
Software, algorithmFMLRCWang et al., 20181.0.0
Software, algorithmLINKSWarren et al., 20151.8.7
Software, algorithmRepeatMaskerSmit et al., 20134.1.0
Software, algorithmDfam repeat databseHubley et al., 20163.1Library for RepeatMasker
Software, algorithmRepBase RepeatMasker editionBao et al., 201520181026Library for RepeatMasker
Software, algorithmcross_matchGreen, 20091.090518
Software, algorithmTandem Repeat FinderBenson, 19994.0.9
Software, algorithmBioawkLi, 20171.0
Software, algorithmGenomeScopeVurture et al., 20171.0.0
Software, algorithmJellyfishMarçais and Kingsford, 20112.2.3
Software, algorithmSambambaTarasov et al., 20150.8.0
Software, algorithmPEPPER-Margin-DeepvariantShafin et al., 20210.4
Software, algorithmBCFtoolsLi, 20111.12
Software, algorithmMerquryRhie et al., 20201.3
Software, algorithmPomoxisOxford Nanopore0.3.7
Software, algorithmbedtoolsQuinlan and Hall, 20102.30.0
Software, algorithmHALtoolsHickey et al., 20132.1
Software, algorithmIntegrative Genomics ViewerRobinson et al., 2011b2.9.4
Software, algorithmMAFFTKatoh and Standley, 20137.453
Software, algorithmRAxML-NGKozlov et al., 20190.9.0
Software, algorithmASTRAL-MPYin et al., 20195.14.7
Software, algorithmForceAtlas2Jacomy et al., 2014Implemented in R package https://github.com/analyxcompany/ForceAtlas2
Software, algorithmapeParadis and Schliep, 20195.4.1R package
Software, algorithmDockerdocker.com
Software, algorithmSingularitysylabs.io

Additional files

Supplementary file 1

Detailed information on both long-read and short-read data used for this project, including accession numbers if publicly available data were used for assembly.

Supplementary file 2

Assembly summary statistics and genome size estimates.

Supplementary file 3

Counts of SNPs, indels, and per-site heterozygosity estimated from both long reads and short reads.

Supplementary file 4

Consensus quality scores estimated with reference-free and reference-based methods.

Supplementary file 5

Characterization of all coding sequence indel differences between Nanopore and Release six reference D. melanogaster assemblies.

Supplementary file 6

Detailed sample information.

Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Bernard Y Kim
  2. Jeremy R Wang
  3. Danny E Miller
  4. Olga Barmina
  5. Emily Delaney
  6. Ammon Thompson
  7. Aaron A Comeault
  8. David Peede
  9. Emmanuel RR D'Agostino
  10. Julianne Pelaez
  11. Jessica M Aguilar
  12. Diler Haji
  13. Teruyuki Matsunaga
  14. Ellie E Armstrong
  15. Molly Zych
  16. Yoshitaka Ogawa
  17. Marina Stamenković-Radak
  18. Mihailo Jelić
  19. Marija Savić Veselinović
  20. Marija Tanasković
  21. Pavle Erić
  22. Jian-Jun Gao
  23. Takehiro K Katoh
  24. Masanori J Toda
  25. Hideaki Watabe
  26. Masayoshi Watada
  27. Jeremy S Davis
  28. Leonie C Moyle
  29. Giulia Manoli
  30. Enrico Bertolini
  31. Vladimír Košťál
  32. R Scott Hawley
  33. Aya Takahashi
  34. Corbin D Jones
  35. Donald K Price
  36. Noah Whiteman
  37. Artyom Kopp
  38. Daniel R Matute
  39. Dmitri A Petrov
Highly contiguous assemblies of 101 drosophilid genomes
eLife 10:e66405.