Highly contiguous assemblies of 101 drosophilid genomes

  1. Bernard Y Kim  Is a corresponding author
  2. Jeremy R Wang
  3. Danny E Miller
  4. Olga Barmina
  5. Emily Delaney
  6. Ammon Thompson
  7. Aaron A Comeault
  8. David Peede
  9. Emmanuel RR D'Agostino
  10. Julianne Pelaez
  11. Jessica M Aguilar
  12. Diler Haji
  13. Teruyuki Matsunaga
  14. Ellie E Armstrong
  15. Molly Zych
  16. Yoshitaka Ogawa
  17. Marina Stamenković-Radak
  18. Mihailo Jelić
  19. Marija Savić Veselinović
  20. Marija Tanasković
  21. Pavle Erić
  22. Jian-Jun Gao
  23. Takehiro K Katoh
  24. Masanori J Toda
  25. Hideaki Watabe
  26. Masayoshi Watada
  27. Jeremy S Davis
  28. Leonie C Moyle
  29. Giulia Manoli
  30. Enrico Bertolini
  31. Vladimír Košťál
  32. R Scott Hawley
  33. Aya Takahashi
  34. Corbin D Jones
  35. Donald K Price
  36. Noah Whiteman
  37. Artyom Kopp
  38. Daniel R Matute  Is a corresponding author
  39. Dmitri A Petrov  Is a corresponding author
  1. Department of Biology, Stanford University, United States
  2. Department of Genetics, University of North Carolina, United States
  3. Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, United States
  4. Department of Evolution and Ecology, University of California Davis, United States
  5. School of Natural Sciences, Bangor University, United Kingdom
  6. Biology Department, University of North Carolina, United States
  7. Department of Integrative Biology, University of California, Berkeley, United States
  8. Molecular and Cellular Biology Program, University of Washington, United States
  9. Department of Biological Sciences, Tokyo Metropolitan University, Japan
  10. Faculty of Biology, University of Belgrade, Serbia
  11. University of Belgrade, Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, Serbia
  12. School of Ecology and Environmental Science, Yunnan University, China
  13. Hokkaido University Museum, Hokkaido University, Japan
  14. Biological Laboratory, Sapporo College, Hokkaido University of Education, Japan
  15. Graduate School of Science and Engineering, Ehime University, Japan
  16. Department of Biology, University of Kentucky, United States
  17. Department of Biology, Indiana University, United States
  18. Neurobiology and Genetics, Theodor Boveri Institute, Biocentre, University of Würzburg, Germany
  19. Institute of Entomology, Biology Centre, Academy of Sciences of the Czech Republic, Czech Republic
  20. Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Stowers Institute for Medical Research, United States
  21. School of Life Science, University of Nevada, United States

Abstract

Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species.

Introduction

The rise of long-read sequencing alongside the continuously decreasing costs of next-generation sequencing have served to greatly democratize the process of genome assembly, making it feasible to assemble high-quality genomes at a previously unthinkable scale. Currently, a number of large consortia are leading well-publicized efforts to assemble the genomes of many taxa throughout the Tree of Life. Some often overlapping examples include the Vertebrate Genomes Project (Rhie et al., 2021), the Bird 10,000 Genomes Project (Feng et al., 2020), the Zoonomia Project (Zoonomia Consortium et al., 2020), the Darwin Tree of Life (Threlfall and Blaxter, 2021), the Earth Biogenome Project (Lewin et al., 2018), and the 5000 Arthropod Genomes Initiative (Robinson et al., 2011a). In addition to establishing new standards for modern large-scale genomics projects and opening avenues for genomic research that were previously only feasible in model organisms across a multitude of species, these projects are creating an opportunity to study genetic variation and address fundamental biological questions at a scope that was simply not possible before.

In many respects, the foundation for modern genomics was built by those studying the vinegar (also called fruit or pomace) fly Drosophila melanogaster and related species in the family Drosophilidae. As a premier model organism for genetic and biological research since the foundational work of Morgan and colleagues, D. melanogaster was, after C. elegans, the second metazoan organism to undergo whole-genome sequencing (Adams et al., 2000). At that time, the completion of the D. melanogaster genome proved the viability of shotgun sequencing approaches and paved the way for larger, more complicated genomes (Hales et al., 2015). The genomic tractability that made drosophilids attractive for this work has led to their continued widespread use as model organisms in the genomic era: the whole-genome sequencing of 12 Drosophila species (Clark et al., 2007) and the characterization of functional elements in Drosophila genomes (Roy et al., 2010) are prominent milestones in the history of modern genomics.

As it is a popular model system, an extensive collection of genomic resources exists for drosophilids today. Excluding genomes from this study, there are representative genome assemblies available on NCBI databases (GenBank and RefSeq) for about 75 different drosophilid species (Hotaling et al., 2021). About a third of these genomes are provided as chromosome-level scaffolds. Along with this diverse catalog of whole-genome sequences are collections of expression and regulation data (Chen et al., 2014; Roy et al., 2010), maps of constrained (i.e. functional) sequences inferred with comparative genomics tools (Stark et al., 2007), and population genomic data (e.g. Guirao-Rico and González, 2019; Lack et al., 2016; Signor et al., 2018). Well-studied D. melanogaster was among the first species to have high-quality genomes assembled for multiple individuals, revealing population variation in structural variants (Chakraborty et al., 2019; Long et al., 2018). Yet even with the intense scientific interest and effort thus far, only a small portion of the remarkably diverse drosophilids, a family which includes over 1600 described and possibly thousands of other undescribed species (O'Grady and DeSalle, 2018), is available for genomic study today.

There is much scientific potential to be unlocked by improving the catalog of genomic diversity within this group, and the simplification that long reads bring to the genome assembly process is key. Long reads have proved to be a way to quickly generate affordable yet high-quality genomes, in fact the cost of a highly contiguous and complete Drosophila assembly based on long-read sequencing was recently estimated to be about $1,000 US dollars (Miller et al., 2018; Solares et al., 2018), orders of magnitude less than the first D. melanogaster genome. While a number of studies have already used long reads to assemble the genomes of one or a few drosophilid species (Bracewell et al., 2019; Chakraborty et al., 2021; Comeault et al., 2020; Flynn et al., 2020; Hill et al., 2020; Mai et al., 2020; Miller et al., 2018; Paris et al., 2020; Rezvykh et al., 2021; Solares et al., 2018), a sequencing and genome assembly project at a scale similar to that of the large genome assembly consortia, especially without similar resources and funding, remains challenging even with the benefits of long reads. Yet, there continue to be rapid improvements to long-read sequencing that may alleviate some of these logistical challenges. Long-read sequencing costs have dropped significantly in the past few years as protocols, kits, and the underlying technology improves. Ultra-long (50–100 kb or longer) reads are obtainable with Oxford Nanopore (ONT) sequencing and under the right conditions should allow entire chromosomes to be fully assembled without additional time-consuming and costly scaffolding methods (e.g. Nurk et al., 2021). By simplifying the genome assembly process and reducing the cost of genome assembly even further, these techniques finally make it possible to assemble tens or hundreds of drosophilid genomes at a time.

Here, we present another step toward a comprehensive drosophilid genome dataset: a community resource of 101 de novo genome assemblies from 93 drosophilid species. These genomes were assembled using lines contributed by Drosophila researchers from across the world, and represent a diversity of ecologies and geographical distributions. We improve upon the Nanopore-based hybrid assembly (Nanopore plus Illumina) approach for Drosophila lines (Miller et al., 2018) to substantially increase the sequencing throughput contained in ultra-long reads while reducing overall costs. The contiguity, completeness, and quality of these genomes is assessed. We show that under ideal conditions, about two Drosophila lines (assuming an average 180 Mb genome) can be sequenced to at least 30× depth of coverage per ONT r9.4.1 (rev D) flow cell, at an approximate cost of 350 US dollars per line. Along with this manuscript and data, we provide a detailed Nanopore sequencing laboratory protocol specifically optimized for Drosophila lines, along with containerized computational pipelines. These genome assemblies and technical resources should facilitate the process of conducting large-scale genome projects in this key model clade and beyond.

Results and discussion

Taxon sampling

Our selection of species and strains for sequencing (Table 1) improves the geographic, ecological, and phylogenetic diversity of genomic data from the family Drosophilidae. Most (99 of 101) of the genome assemblies presented here are from 14 species groups in subgenera Drosophila and Sophophora of the subfamily Drosophilinae (Toda, 2020). One species of each of the genera Leucophenga and Chymomyza, both contained in less-studied sister subfamily Steganinae, have also been sequenced. We note some taxonomic inconsistencies arising from the paraphyly or polyphyly of certain drosophilid taxa (Finet et al., 2021; O'Grady and DeSalle, 2018; Yassin, 2013) but will make no attempt to address those issues here. The sequenced species originate from mainland and island locations in North America, Europe, Africa, and Asia; are distributed from northern (e.g. D. tristis, D.littoralis) to equatorial (e.g. D. bocqueti) latitudes; represent two independent transitions to leaf-mining herbivory (Scaptomyza and Lordiphosa); and for some species, like the pest Zaprionus indianus, represent reproductively isolated populations taken from throughout the range. For difficult to culture species, for instance Leucophenga varia and some Lordiphosa spp., only wild-caught flies were sequenced. Finally, we have sequenced lines in active research use. Additional genomic resources like gene expression or population data should be expected in the near future to accompany many of these assemblies. For species where multiple lines were assembled, we have selected a recommended line to use based on genome quality and denote this recommendation in Table 1.

Table 1
Species and strain information for all samples assembled for this work.

Note: Species group and subgroup information is taken from the NCBI Taxonomy Browser with slight modifications following O'Grady and DeSalle, 2018. Strain names along with corresponding NDSSC and Kyoto DGRC stock center numbers are provided to the best of our knowledge. See Supplementary file 1 and Supplementary file 6 for detailed information on samples and data. When multiple lines of a species are listed, * denotes the preferred assembly.

SubgenusGroupSubgroupSpeciesSexStrain nameNDSSCKyoto DGRC/
Ehime
Additional notes
SophophoramelanogastermelanogasterD. melanogasterMFISO-1 GENOME14021-0231.36NABDGP reference strain
D. mauritianaFNA14021-0241.01NAMiller et al., 2018
D. simulansFNA14021-0251.006NAMiller et al., 2018
D. sechelliaFNA14021-0248.01NAMiller et al., 2018
D. teissieri *M273.3NANA
D. teissieriMCT02NANA
D. yakubaFNA14021-0261.01NAMiller et al., 2018
D. erectaFNA14021-0224.01NAMiller et al., 2018
eugracilisD. eugracilisFNA14026-0451.02NAMiller et al., 2018
suzukiiD. subpulchrellaML1NANA
D. biarmipesMF361.0 iso1 l-11 GENOME strain 114023-0361.10NAmodENCODE strain
takahashiiD. takahashiiFIR98-3 E-12201NAE-912201inbred derivative of Ehime stock IR98-3
ficusphilaD. ficusphilaF631.0-iso1 l-10 GENOME14025-0441.05NAmodENCODE strain
rhopaloaD. carrolliMFKB866NANA
D. rhopaloaMFBaVi067 GENOME14029-0021.01E-24701modENCODE strain
D. kurseongensisFSaPa58NANA
D. fuyamaiFKB-121714029-0011.01NA
elegansD. elegansFHK0461.03 GENOME14027-0461.03NAmodENCODE strain
suzukiiD. oshimaiMMT-04NANA
montiumD. bocquetiMYAK3_mont-66NANA
D. sp aff chauvacaeMmont_up-71NANA
D. jambulinaMFst-214028-0671.01NA
D. kikkawaiF561.0-iso4 l-10 GENOME14028-0561.14NAmodENCODE strain
D. rufaFEH091 iso-C L_3NA914802inbred derivative of Ehime stock EH091
D. triaurariaFNA14028-0691.9NAMiller et al., 2018; previously mis-identified as D. kikkawai
ananassaeD. malerkotliana pallensFpalQ-isoGNANA
D. malerkotliana malerkotlianaMFmal0-isoC14024-0391.00NAinbred derivative of strain 14024-0391.00
D. bipectinataMF4-4-2-3-1-1-1-1-1 BackUp14024-0381.04NAInbred derivative of NDSSC strain
D. parabipectinataMFpar2-isoB14024-0401.02NAinbred derivative of strain 14024-0401.02 (now extinct)
D. pseudoananassae pseudoananassaeFWau 125NANA
D. pseudoananassae nigrensFVT04-31NANA
D. ananassaeF14024-0371.13NANAMiller et al., 2018
D. variansMFCKM15-L1NANA
D. ercepeaceMF164-1414024-0432.00NA
obscuraobscuraD. ambiguaMR42NANAisofemale strain from the wild
D. tristisMD2NANAisofemale strain from the wild
D. obscuraMBZ-5NANAisofemale strain from the wild
D. subobscuraMKüsnachtNANAstandard laboratory strain
pseudoobscuraD. persimilisFNA14011-0111.01NAMiller et al., 2018
D. pseudoobscuraFNA14011-0121.94NAMiller et al., 2018
willistoniwillistoniD. willistoni (Uruguay) *ML-G314030-0811.17NA
D. willistoniFNA14030-0811.00NAMiller et al., 2018
D. paulistorum L06 *M(Heed) H66.1C14030-0771.06NA
D. paulistorum L12ML1214030-0771.12NA
D. tropicalisM(Heed) H65.214030-0801.00NA
D. insularisMjp01iNANAisofemale line from J. Powell
bocainensisD. sucineaM49.1514030-0791.01NA
D. sucinea**MH176.1014030-0761.01NANDSSC strain is misidentified as D. nebulosa
saltanssaltansD. saltansM(Heed) H180.4014045-0911.00NA
D. prosaltansM(Heed) H29.614045-0901.02NA
neocordataD. neocordataM2536.714041-0831.00NA
sturtevantiD. sturtevantiFH191.2314043-0871.01NA
LordiphosamikiL. clarofinisMFGuizhou062018LCNANALine inbred for 2 generations in the lab before sequencing
L. stackelbergiMFUCILTSSapporo052019LSNANAPool of 50 wild-caught flies
L. magnipectinataMFUCKTSapporo052019LMNANAPool of 50 wild-caught flies
fenestrarumL. collinellaMFUCKTSapporo052019LCNANAPool of 30 wild-caught flies
L. mommaiMFMMSapporo052014LMNANA
DrosophilaZaprionusvittigerZ. nigranusMst01nNANAline derived from wild collection
Z. camerounensisMjd01camNANAisofemale line from J. David
Z. lachaiseiMjd01lNANAline derived from wild collection
Z. vittigerMjd01vNANAisofemale line from J. David
Z. davidiMjd01dNANAisofemale line from J. David
Z. taronusMst01tNANAline derived from wild collection
Z. capensisMjd01capNANAisofemale line from J. David
Z. gabonicusMjd01gabNANAisofemale line from J. David
Z. indianus RCR04MRCR04NANA
Z. indianus 16GNV01M16GNV01NANA
Z. indianus BS02 *MBS02NANA
Z. indianus CDD18MCDD18NANA
Z. africanusMBS06NANA
Z ornatusMjd01oNANAisofemale line from J. David
tuberculatusZ. tsacasiMcar7-4NANA
Z. tsacasi *Mjd01tNANAisofemale line from J. David
inermisZ. kolodkinaeMjd01kNANAisofemale line from J. David
Z. inermisM18BSZ10NANA
Z. ghesquiereiMjd01gheNANAisofemale line from J. David
cardinidunniD. dunniMH254.2115182-2291.00NA
D. arawakanaMMONHI050227(B)-10415182-2261.03NA
cardiniD. cardiniMNA15181-2181.03917701
funebrisfunebris?undescribed (Sao Tome mushroom)Mst01mNANAundescribed species collected on mushroom, Sao Tome
funebrisD. funebrisMfst01NANAline derived from wild collection
immigransimmigransD. immigrans *FFK05-1915111.1731.12NA
D. immigrans kari17Mkari17NANA
(incertae sedis)D. pruinosaMiso-A1 l-9NANA
quadrilineataD. quadrilineataMquad-TMUNA914402
tumiditarsusD. repletoidesMISZ-isoB I-10NANA
ScaptomyzaScaptomyzaS. montanaMFiso-CA-L1NANA
S. graminumFTMU-2019NANA30 wild-caught females
ParascaptomyzaS. pallidaMFiso-CA-L1NANA
HemiscaptomyzaS. hsuiMFiso-CA-L1NANA
HawaiianDrosophilaorphnopezaD. sproatiMFDKPTOMS02NANAPool of wild-caught flies
D. murphyiMFDKPHETFM01NANAFlies from recently established but not inbred lab line
grimshawiD. grimshawiFNA15287-2541.00NASame line as caf1 genome
virilisvirilisD. virilisFNA15010-1051.87NAMiller et al., 2018
D. americanaM3367.115010-0951.00NAAlso called Anderson strain
D. littoralisMKilpisjärvi 1NANAOriginally misidentified as D. ezoana (Lankinen 1986, J Comp Physiol A 159: 123-142)
repletarepletaD. repletaMkari30NANA
mulleriD. mojavensisF15081-1352.22NANAMiller et al., 2018
genus: LeucophengaL. variaMnc01vNANASequenced single wild-caught fly, no amplification
genus: ChymomyzaC. costataMSapporoNANA
  1. * denotes the genome of best quality when multiple assemblies are available for a species.

Near chromosome-scale assembly with ultra-long reads

We sequenced the fly samples using a ONT 1D ligation kit approach, replacing magnetic bead cleanups with size selective precipitation. This modified workflow is optimized for genomic DNA extractions from 15 to 30 whole flies, increases the yield of ultra-long reads relative to the standard ligation kit protocol, increases overall sequencing throughput, and significantly reduces the cost of library preparation. Sequencing runs varied with sample quality and type, and in general read lengths and throughput increased over the course of this work with improved iterations of the protocol. Under optimal conditions and with enough starting material (at least 2,000 ng of very high molecular weight DNA) to prepare at least three library loads (~1200–500 ng total prepared library, 350–500 ng per load), along with regular DNAse flushes to maintain yields, Nanopore sequencing runs following the supplied protocol should net 12–15 Gb of data per R9.4.1 flow cell with a read N50 greater than 20 kb, and about 30% of data in reads longer than 50 kb. We generated paired-end, 150 bp Illumina reads for most strains unless public datasets were available.

Deep (average 52×) sequencing coverage with a substantial fraction of ultra-long reads (Supplementary file 1) resulted in high-quality genome assemblies that were comparable to and often better than currently available reference genomes in terms of contiguity and completeness (Figure 1, Figure 1—figure supplement 1, Supplementary file 2). We chose Flye (Kolmogorov et al., 2019) as our assembler based on superior contiguity and favorable runtimes relative to Miniasm (Li, 2016) and Canu (Koren et al., 2017Figure 1—figure supplement 2). To provide standardization for measures of contiguity, we estimated genome size for each assembly using long-read coverage over single-copy BUSCO loci (Supplementary file 2).

Figure 1 with 4 supplements see all
Nanopore-based assemblies are highly contiguous and complete.

(A,B) Assembly contiguity is compared to the D. melanogaster v6.22 reference genome (blue) as well as five recently published, highly contiguous Illumina assemblies (red lines, D. birchii, D. bocki, D. bunnanda, D. kanapiae, D. truncata; Bronski et al., 2020). (A) Nx curves, or the (y-axis) size of each contig when contigs are sorted in descending size order, in relation to the (x-axis) cumulative proportion of the genome assembly that is covered. (B) The distribution of contig N50, the size of the contig at which 50% of the assembly is covered. (C) Assembly completeness assessed by BUSCO v4.0.6 (Seppey et al., 2019). Note, D. equinoxialis was evaluated with BUSCO v4.1.4 due to an issue with v4.0.6. L. stackelbergi has >10% missing BUSCOs. Individual assembly summary statistics are provided in Supplementary file 2.

Of 101 total assemblies, 94 contain over 98% of the assembly in contigs larger than 10 kb, and both contig N50s and NG50s exceed 1 Mb for these genomes (Figure 1A, Figure 1B, Figure 1—figure supplement 3, Supplementary file 2). Assembly sizes were highly correlated with estimated genome sizes (Figure 1—figure supplement 4). In addition to meeting the megabase contig N50 standard for new genomes proposed by the Vertebrate Genomes Project (Rhie et al., 2021), these statistics show that most of the genome is present in the assembly in megabase-sized contigs. In other words, the assemblies are nearly at the chromosome level. For comparison, of the 76 representative drosophilid genomes that were previously available on NCBI (Hotaling et al., 2021), only 25 have an N50 greater than 1 Mb (Figure 1—figure supplement 1). Moreover, many of these highly contiguous NCBI genomes are scaffolded, an additional step that would have added a significant amount of time and additional expenses to this study. Even when DNA was extracted from pools of wild-caught flies or a single fly (Leucophenga varia) resulting in sub-optimal read lengths and output, the assembly was comparable to existing short read assemblies (Figure 1A, Figure 1B). High contiguity resulted in benchmarking universal single-copy ortholog (BUSCO) completeness (Seppey et al., 2019; Simão et al., 2015) in the range of 97–99+% for all but the three most fragmented genomes (Figure 1C). As with contiguity, the completeness of these genomes is comparable to reference genomes on NCBI (Figure 1—figure supplement 1).

Estimates of sample diversity

We have utilized a variety of fly samples, from highly inbred lab lines to wild-caught flies, for genome assembly. We therefore sought to quantify the level of diversity inherent to each sample and use variant calls to estimate the error rate for each assembly. Long and short reads (if available) were mapped separately to each finished genome and variant calling was performed with PEPPER-Margin-DeepVariant (Shafin et al., 2021) for long reads and BCFtools (Danecek et al., 2021; Li, 2011) for short reads. After quality filtering and masking genomic regions annotated as repeats, the counts of single nucleotide polymorphisms (SNPs), indels, and the fraction of sites with a non-reference SNP were computed (Figure 2, Supplementary file 3). Note, when short reads were not from the same strain as used for the assembly, short read polishing was used to only correct indels, and called SNPs will not accurately represent the variation in the sample that was sequenced with Nanopore. Also note that SNP calls from Nanopore data should be relatively accurate but indel calls will not (Shafin et al., 2021).

Figure 2 with 1 supplement see all
Estimated heterozygosity in the data used for genome assembly.

Per-site SNP heterozygosity (number of heterozygous SNPs/number of callable sites) is plotted for each of the 101 assembled lines. Blue dots represent heterozygosity estimates from Nanopore reads with PEPPER-Margin-DeepVariant (Shafin et al., 2021). Orange dots represent heterozygosity estimates from short reads with BCFtools (Li, 2011). The genomes on the right are for species that did not have available short-read data. Numerical values for these estimates are provided in Supplementary file 4.

Large variation in sample diversity over several orders of magnitude was observed. Estimated SNP heterozygosity, the number of heterozygous SNPs divided by the number of callable sites, ranged from 0.00035% to 1.1% from long reads and 0.0015% to 2.1% from short reads, and heterozygosity estimated from long reads was systematically lower than that from short reads, particularly when sample diversity was high (Figure 2, Figure S6). Qualitative patterns of heterozygosity generally tracked the history of the samples (e.g. the highly inbred reference strains had very low diversity). Conditioning on datasets where both long and short reads were generated from the same sample, heterozygosity estimates from both types of reads were positively correlated (Pearson correlation R2=0.50, p=1.13×10–12). If we ignore Lordiphosa, the group with wild-caught or recently collected samples that was consequently the most challenging to assemble, this correlation is greatly increased (Pearson correlation R2=0.81, p<2.2×10–16). Interestingly, we did not observe a significant relationship (p=0.30) between estimated heterozygosity and assembly contiguity (Figure 2—figure supplement 1). The number of heterozygous non-reference variants almost always exceeded the number of homozygous variants (Supplementary file 3), as would be expected from residual diversity in the sequenced lines.

Estimates of sequence quality

Next, we estimated the genome-wide error rates in our assemblies using both the variant calls obtained previously and a reference-free method (Supplementary file 4). For the first approach, Phred-scaled (Ewing et al., 1998) consensus quality (QV) was estimated by assuming all sites with a non-reference variant were an error. The error rate was then computed by dividing the number of sites with at least one non-reference variant by the total number of callable bases. As expected from the patterns of heterozygosity estimated from long and short reads, there was a large amount of variability in quality scores. Estimates from short reads ranged from QV17 to QV45 and from long reads were slightly higher, from QV19 to QV52 (Supplementary file 4).

This method is likely to be biased by assembly features that affect the quality of read mapping, for example, we remove sequences annotated as repeats when filtering the variant calls. To address this bias, we employed the reference-free approach implemented in Merqury (Rhie et al., 2020) for the 94 assemblies which had some kind of short-read data available (Figure 3A, Supplementary file 4). Estimated quality scores ranged from QV16 to QV40, and once again, samples for which reads from a different strain or a genetically diverse sample (i.e. wild samples or recent isolates) were used had the lowest estimated QV. Merqury-estimated QV was on average higher than consensus quality estimated by the variant calling methods, but the relative ranking of QV estimates remained largely consistent with QV based on short-read (Spearman’s ρ=0.642, p<2.2e-16) and long-read (Spearman’s ρ=0.684, p<2.2e-16) variant calls.

Figure 3 with 2 supplements see all
Nanopore-based Drosophila assemblies are accurate, particularly in coding regions.

(A) Genome-wide, Phred quality scores estimated with the reference-free, k-mer based approach implemented in Merqury (Rhie et al., 2020). Merqury requires a short-read dataset to perform the evaluation. Filled circles represent QV estimates with short-read data from the same strain used for Nanopore sequencing, and empty circles denote estimates using short-read data from a different strain than used for Nanopore sequencing. (B, C, D) Phred quality score cutoffs for the bottom 10th percentile of 100 kb genomic windows, as evaluated with a reference-based approach, in coding sequences only. Quality scores are capped at 60 for visualization purposes. At least 90% of 100 kb windows are this accurate. Only Nanopore assemblies with an NCBI RefSeq genome counterpart of the same strain were evaluated. Accuracy is shown for SNVs (B), insertions (C), and deletions (D) separately. Additional details on quality score estimates are provided in Figure 3—figure supplement 1 and Supplementary file 4.

While these estimates showed our genomes to mostly fall below the often-recommended QV40 threshold for reference genomes (Koren et al., 2019; Rhie et al., 2021), there are many reasons to expect that sequence quality in certain regions of the genome will be far better than the average. As expected, we found that QV estimates were particularly low when short-read data from a different sample was used for the estimation, as any true variation between strains will inflate the error rate. Because we sequenced pools of flies, residual polymorphism will be found in the data even when long and short reads are sampled from the same pool of flies. In these cases QV might be considered as a lower bound estimate of the true accuracy of the assembly. Additionally, complex coding sequences are likely to be far more accurate than other regions of the genome, like repeats, due to better short-read mapping. The single genome-wide estimates of QV we report obscure this variation.

Nanopore-based assemblies are highly accurate in coding regions

For these reasons, we found it critical to further examine how errors are distributed in Nanopore assemblies. Of particular concern is the accuracy of coding sequences. Gene annotation is an important and obvious next step after assembling a new genome, but Nanopore sequences are known to systematically contain indels in homopolymer runs that cannot be called accurately when a run exceeds the size of the nanopore reader head. Indel disruptions to otherwise highly accurate coding sequences would have a disproportionately large negative impact on protein prediction (Watson and Warr, 2019). On the other hand, it is likely that coding sequences are generally more accurate than the rest of the genome since short-read mapping is generally more reliable there. In theory, most exons should be free of errors somewhere between a genome-wide quality of QV30 to QV40 (Koren et al., 2019), but many of our assemblies do not appear to reach this benchmark.

Reference-based quality assessments were used to better understand how error rates vary across different genomic elements. We downloaded the 8 NCBI RefSeq genome assemblies for which we had a Nanopore genome of the same species and strain: D. biarmipes, D. elegans, D. ficusphila, D. grimshawi, D. kikkawai, D. melanogaster, D. mojavensis, and D. rhopaloa. Using the ONT Pomoxis software, we aligned each Nanopore assembly to its corresponding reference genome and estimated QV in non-overlapping 100 kb windows, using the entire sequence, then only coding sequences, introns, intergenic regions, and repeats, using gene and repeat definitions provided through NCBI RefSeq. All differences between query and reference assemblies were considered to be errors.

As expected, we found that sequence accuracy varied greatly within each genome assembly (Figure 3—figure supplement 1). Mean genome-wide QV ranged from QV15 to QV24 while median QV across the 100 kb windows ranged from QV14 to QV36. When looking only at coding sequences, mean QV ranged from QV23 to QV29, while the median window accuracy, with the exceptions of D. grimshawi (QV25) and D. rhopaloa (QV30), indicated complete identity (>QV50) between assembly and reference. For D. grimshawi and D. rhopaloa, SNVs were the primary contributor to the error rate and the number of indels was similar to the other genomes (median QV(indel)>50). Sequence accuracy was lower when looking at introns, intergenic regions, and repeats, in that order. However, regardless of the genomic element type, median QV across the windows always exceeded mean QV, often by more than QV10, or an order of magnitude difference in the error rate. In other words, differences between Nanopore and reference assemblies were clustered heavily into a few genomic regions, and most coding sequences were very accurate despite the seemingly high mean error rate (Figure 3B, Figure 3C, Figure 3D). Further caution is warranted in the interpretation of these quality scores: we have assumed that all differences between our Nanopore-based genomes and the reference genomes are errors in the Nanopore assembly, rather than errors in the reference, or true differences between the two sequenced samples. We will shortly show that reference-based comparisons might be heavily biased against the Nanopore assemblies.

To better understand the nature of putative indel errors in coding sequences, we focused on the D. melanogaster reference strain, where we have the best information about the genome from multiple independent high-quality assemblies (Kim et al., 2014; Koren et al., 2017; Solares et al., 2018). While D. melanogaster is a best-case scenario for genome assembly with a fly line, we think it reasonable to expect errors in other assemblies, for which we have utilized the same genome assembly workflow, to be similar in nature. Across the 22,209,264 bp of our D. melanogaster genome that aligned to reference coding sequences, our assembly contained 15 insertions and 17 deletions in 21 out of 13,913 (0.15%) queried protein-coding genes, with a total of 10,092 inserted and 46 deleted base pairs relative to the reference. All deletions (15 out of 15) were under 50 bp, 8 out of 15 insertions were under 50 bp, and the remaining 7 out of 15 insertions ranged from 120 bp to 4410 bp (Figure 3—figure supplement 2, Supplementary file 5). These larger insertions account for nearly all (99.3%) of the coding sequence differences between our genome and the reference. There is a clear, disproportionate impact of these large insertions in an otherwise nearly identical protein-coding genome.

We followed up on each of these 32 coding indels through manual curation with the genome browser IGV (Robinson et al., 2011b). Using the Release 6 D. melanogaster genome (Hoskins et al., 2015) as the reference, we aligned Nanopore and Illumina reads, a different Nanopore-based de novo assembly of the reference strain (Solares et al., 2018), a PacBio-based de novo assembly (Kim et al., 2014; Koren et al., 2017), and our assembly. We were particularly interested in the two other long-read assemblies, as we wondered if they might independently support any of the large variants in our assembly. RepeatMasker annotations for the Nanopore-based assembly were lifted over into Release 6 coordinates to see if these indels overlapped with a repetitive element.

This manual curation process revealed that the coding indels, in addition to being exceedingly rare, could be straightforwardly explained by regions of poor short read mapping and the presence of a duplicate contig in the assembly (Supplementary file 5). A series of large and small indels, including four out of the five insertions longer than 100 bp, overlapped a tandem repeat in genes CR44666, Mu68Ca, and Mu68E. While short reads mapped poorly to this region, limiting our ability to determine accuracy locally, long reads spanning the entire region and the two other long-read assemblies supported the large insertions. The remaining long (1414 bp) insertion was similarly supported by long reads and the other assemblies, but did not overlap with a repeat. Again, these insertions account for more than 99% of the indel differences between our genome and the reference. The remaining indels occurred in either repetitive regions (simple repeats and long interspersed nuclear element retrotransposons), in homopolymer runs in regions with poor short read mapping, or along a single contig that appeared to be a short duplicated segment of chromosome 2L. The other contig was error-free. All indels occurred on contigs with poor short-read mapping, suggesting they were a consequence of locally ineffective short read polishing, but also that sensible filtering based on short read depth or map quality would prevent these issues from propagating into downstream analyses. Importantly, these results suggest that reference-based quality analyses can be heavily biased against long-read assemblies and further support our caution against a naive projection of genome-wide quality score estimates onto coding regions.

A comparative genomics resource

To demonstrate the potential this dataset holds for the study of genome evolution and chromosome organization, we revisit a classic result with our highly contiguous assemblies. Although the ordering of genes in drosophilid chromosomal (Muller) elements has been extensively shuffled throughout ~53 million years of evolution (Suvorov et al., 2021), the gene content of each element remains largely conserved (Bracewell et al., 2019; Ranz et al., 2001; Sturtevant and Novitski, 1941). To examine synteny in our assemblies, many of which contain several contigs tens of megabases in length, we constructed an undirected graph using single-copy orthologous markers (i.e. BUSCOs). The number of times two markers were connected by assemblies determined the weight of the graph’s edges. A graph layout method was applied to spatialize (map) these relationships, clustering together BUSCOs that are frequently connected in the assemblies. We found that BUSCOs formed six major clusters following the D. melanogaster chromosome arm on which they are found, consistent with the expected conservation of gene content in Muller elements across drosophilids (Figure 4). Furthermore, the lack of a clear order within groups is consistent with extensive shuffling within Muller elements. This demonstrates that our dataset can be used for studies of genome evolution. New reference-free, whole-genome alignment methods (Armstrong et al., 2020) should substantially facilitate more detailed comparative analyses.

Gene content of Muller elements is conserved across drosophilids while gene order changes.

Each node in this graph represents an orthologous marker corresponding to single-copy orthologs annotated by BUSCOv4 (Seppey et al., 2019). An edge between two nodes represents the number of times that BUSCO pair is directly connected within an assembly. Each BUSCO is colored by the chromosome arm in D. melanogaster that it is found on. The ForceAtlas2 (Jacomy et al., 2014) graph layout algorithm was used for visualization.

Repeat content

A large number of genome assemblies enables comparative analysis of repeat variation against a wide range of genome assembly sizes (140–450 Mb), for example the independent expansions of satellite repeats in D. grimshawi or retroelements in D. paulistorum, D. bipectinata, or D. subpulchrella (Figure 5). Within our dataset alone, RepeatMasker annotations reveal large variation in repeat content among drosophilids (Figure 5). No correlation exists between assembly contiguity and repeat content (Figure 5—figure supplement 1), suggesting long-read sequencing overcomes many of the challenges to drosophilid genome assembly posed by repetitive sequences. Additionally, we observe a positive relationship between the size of repetitive sequences and non-repetitive sequences, suggesting that genome size is influenced by expansions and contractions of both portions of the genome (Figure 5—figure supplement 2). Some discretion is warranted in the interpretation of these results. Repeats are likely to be better annotated in genomes from well-studied species groups, since they are more likely to be well-characterized in the repeat databases we used. Nevertheless, the high continuity of these assemblies should allow for the proper identification of new transposable elements in the genomes and allow for the analyses of transposable element evolution at the level of individual transposable elements or transposable element families in a way that is not feasible with more fragmented genome assemblies (Clark et al., 2007).

Figure 5 with 2 supplements see all
Repeat content varies greatly between drosophilid groups.

For each species, the proportion of each genome annotated with a particular repeat type is depicted. Species relationships were inferred by randomly selecting 250 of the set of BUSCOs (Seppey et al., 2019) that were complete and single-copy in all assemblies. RAxML-NG (Kozlov et al., 2019) was used to build gene trees for each BUSCO then ASTRAL-MP (Yin et al., 2019) to infer a species tree. Repeat annotation was performed with RepeatMasker (Smit et al., 2013) using the Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) databases. ASTRAL local posterior probabilities are reported at each node.

Next steps

We have built an open resource of 101 nearly chromosome-level drosophilid genome assemblies, adding to the rapidly growing number of high-quality genomes available for this model system (Hotaling et al., 2021; Suvorov et al., 2021). We envision this dataset being used to address a large number of outstanding questions entailing large comparative analyses among species, including the comparison of population genomic data between a large number of species, providing unprecedented resolution to investigate fundamental questions about the evolutionary process. In addition, we provide detailed laboratory and computational workflows that we hope will provide a jumping off point for future genome assembly projects in drosophilids or other taxa. While we hope this to already be a valuable resource to the scientific community, we acknowledge there is much to be done to build upon the resource and to improve its usability.

Despite our best efforts to improve species diversity, both the species we sequenced and the drosophilid genomes available today are significantly biased towards well-studied, easy-to-maintain species already in use for scientific research. Reducing sampling bias, with respect to phylogenetic diversity, geographic distribution, and ecology should be a goal of future genome assembly projects in this group. The high input DNA requirement for PCR-free long-read sequencing is a major limitation of our assembly workflow in this context. Our protocol requires a DNA extraction from multiple flies, ideally from an inbred line to minimize genetic diversity in the sample used for assembly. High diversity in the sample usually results in a fragmented assembly with many duplicated sequences, and while these issues can be addressed with computational tools, the quality of the final assembly is still affected. However, some species, for instance the Lordiphosa spp. or Hawaiian Drosophila we sequenced, cannot be quickly raised in the lab on standard media and thus cannot easily be inbred like other drosophilids. Many other species are simply understudied and sample availability is limited to a few flies collected from the wild and possibly preserved in ethanol for many years. Methods for assembling genomes with small quantities of DNA from single insects (Adams et al., 2020; Kingan et al., 2019; Schneider et al., 2021) or dealing with degraded specimens from older collections will be particularly important as the scope of future work expands beyond stock center and laboratory lines.

Some of these sequencing challenges will be better addressed by new technology and techniques. While we hesitate to make specific recommendations due to the rapidly changing landscape of long-read sequencing and genome assembly methods, there are a few clear ways in which many recently assembled long-read genomes can be improved. Even in the short time since we performed the sequencing for this work, there have been remarkable improvements to library preparation workflows, the accuracy of base calling algorithms, and assembly tools. At a minimum, we plan to iteratively update these assemblies using newer base calling methods to maximize the usefulness of the dataset.

This alone is unlikely to future-proof Nanopore R9 flow cell-based assemblies when the ultimate goal is to build genomes that are free of errors, and we recommend that a genome assembly project initiated today look beyond a Nanopore and Illumina approach. There is ample room to reduce the per-assembly cost while improving both contiguity and accuracy. The current major obstacle to high genome-wide accuracy is the difficulty of calling bases accurately in homopolymer runs combined with the limitations of short reads for correcting these errors when they occur in genomic regions with poor short read mappability. One new strategy to address this is to generate supplementary lower coverage data from high fidelity long read sequencing, for instance with PacBio HiFi (Nanopore versions are currently in development). New polishing tools are specifically designed to polish Nanopore assemblies with higher-fidelity reads (e.g. Shafin et al., 2021) and users should see greatly improved overall sequence accuracy.

This kind of hybrid long-read assembly approach may prove to be even more efficient than the assembly workflow we have presented. Interestingly, we find that high contiguity can be achieved even with minimal (10–20×) coverage of moderately long (read N50 25 kb) Nanopore reads (Figure 6). Similar coverage with even longer reads could serve as a cheap way to generate almost chromosome-level contigs, which will then be polished with higher fidelity long reads or Illumina reads. Ultra-long Nanopore sequencing is also significantly more accessible than before. Recently (as of March 2021), Oxford Nanopore and Circulomics released new ultra-long sequencing kits that, under ideal conditions, allows users to perform ultra-long Nanopore sequencing runs where read N50s exceed 100 kb while nearly doubling overall flow cell throughput compared to the sequencing runs performed for this study. Further cost savings should be possible if sequencing is done with the ONT PromethION or PacBio CLR, depending on the scale of the project. Both technologies have a lower per-base cost than MinION sequencing and similarly long reads can be obtained.

Highly contiguous assemblies can be obtained with lower coverage of ultra-long reads.

The NGx curve is shown for Drosophila jambulina assemblies at varying levels of coverage. The length of the assembly with the full data is assumed to be the genome size. Read sets used for each assembly were obtained by randomly downsampling the basecalled reads (read N50 ~27.5 kb) to varying (5× to 30×) depth of coverage. Proportionally, these read sets contain ~55% of total sequenced bases in reads longer than 25 kb, ~25% of bases in reads longer than 50 kb, and ~7% of bases in reads longer than 100 kb. Near chromosome scale assemblies (N50>20Mb) were achievable even at 15× to 20× depth with this read length distribution. This corresponds to approximately 8× to 10× depth in reads longer than 25 kb.

Finally, we are in the process of improving the utility of this resource by generating a suite of comparative genomics tools and annotations to be released in the upcoming months. Specifically, we are utilizing Progressive Cactus (Armstrong et al., 2020), a reference-free whole-genome aligner that is designed to be scalable to modern genomic datasets and that has already been applied to hundreds of mammal and bird genomes generated by the Zoonomia (Zoonomia Consortium et al., 2020) and Bird 10K (Feng et al., 2020) projects. These alignments will be used to create sequence conservation maps (Hickey et al., 2013; Pollard et al., 2010; Siepel et al., 2005), the precision of which should be close to single nucleotide resolution given the large number of drosophilid genomes that are now available. While ultimately RNA-seq across all species will be needed for annotation, we plan to quickly generate the first round of gene annotations using comparative annotation tools. For new assemblies where a previously annotated reference genome is available, LiftOff (Shumate and Salzberg, 2020) provides a way to quickly transfer annotations to a new genome. For the more challenging task of gene annotation in species that do not already have a well-annotated reference, we are using the Comparative Annotation Toolkit (Fiddes et al., 2018), software to perform first-pass annotations assisted by homology information from the Progressive Cactus alignment. New RNA-seq data will be generated for select species in clades without a well-annotated member (e.g. Zaprionus). These tools will provide a framework for anyone to apply iterative improvements as new data become available.

Reproducibility

Detailed laboratory protocols, computational pipelines, and computational container recipes are provided as a reference and to maximize reproducibility. The protocol is publicly available at Protocols.io and pipeline scripts along with associated compute containers are provided in a public GitHub repository. See Materials and methods for additional details on compute containers, accession numbers, and web links to these resources.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional information
Strain, strain background (Drosophila spp. and relatives)See Table 1 and Supplementary files 16 for sample information, strain designations, stock center line identifiers (when applicable), biomaterial provider, and NCBI accession numbers.
Commercial assay or kitBlood and Cell Culture DNA Mini KitQiagencat # 13323
Commercial assay or kitLigation Sequencing KitOxford NanoporeSQK-LSK109Superseded by SQK-LSK110
Commercial assay or kitFlow cell wash kitOxford NanoporeEXP-WSH003Superseded by EXP-WSH004
Commercial assay or kitShort Read Eliminator kitCirculomicsSKU # SS-100-101-01
Commercial assay or kitCompanion Module for ONT Ligation SequencingNEBNextcat # E7180S
Commercial assay or kitNextera XT DNA Library Preparation KitIlluminacat # FC-131–1002Superseded by version 2
Commercial assay or kitKapa HyperPrep KitRochecat # KK8502
Software, algorithmFlyeKolmogorov et al., 20192.6
Software, algorithmCanuKoren et al., 20171.8
Software, algorithmMiniasmLi, 20160.3
Software, algorithmGuppyOxford Nanopore3.2.4
Software, algorithmMedakaOxford Nanopore0.9.1
Software, algorithmMinimap2Li, 20162.17
Software, algorithmSAMtoolsLi et al., 20091.12
Software, algorithmRaconVaser et al., 20171.4.3
Software, algorithmBUSCOSimão et al., 20153.0.2
Software, algorithmBUSCOSeppey et al., 20194.0.6
Software, algorithmPurge_haplotigsRoach et al., 20181.1.1
Software, algorithmnpScarfCao et al., 20171.9-2b
Software, algorithmPilonWalker et al., 20141.23
Software, algorithmBLASTAltschul et al., 19902.10.0
Software, algorithmSPAdesBankevich et al., 20123.11.1
Software, algorithmFMLRCWang et al., 20181.0.0
Software, algorithmLINKSWarren et al., 20151.8.7
Software, algorithmRepeatMaskerSmit et al., 20134.1.0
Software, algorithmDfam repeat databseHubley et al., 20163.1Library for RepeatMasker
Software, algorithmRepBase RepeatMasker editionBao et al., 201520181026Library for RepeatMasker
Software, algorithmcross_matchGreen, 20091.090518
Software, algorithmTandem Repeat FinderBenson, 19994.0.9
Software, algorithmBioawkLi, 20171.0
Software, algorithmGenomeScopeVurture et al., 20171.0.0
Software, algorithmJellyfishMarçais and Kingsford, 20112.2.3
Software, algorithmSambambaTarasov et al., 20150.8.0
Software, algorithmPEPPER-Margin-DeepvariantShafin et al., 20210.4
Software, algorithmBCFtoolsLi, 20111.12
Software, algorithmMerquryRhie et al., 20201.3
Software, algorithmPomoxisOxford Nanopore0.3.7
Software, algorithmbedtoolsQuinlan and Hall, 20102.30.0
Software, algorithmHALtoolsHickey et al., 20132.1
Software, algorithmIntegrative Genomics ViewerRobinson et al., 2011b2.9.4
Software, algorithmMAFFTKatoh and Standley, 20137.453
Software, algorithmRAxML-NGKozlov et al., 20190.9.0
Software, algorithmASTRAL-MPYin et al., 20195.14.7
Software, algorithmForceAtlas2Jacomy et al., 2014Implemented in R package https://github.com/analyxcompany/ForceAtlas2
Software, algorithmapeParadis and Schliep, 20195.4.1R package
Software, algorithmDockerdocker.com
Software, algorithmSingularitysylabs.io

Taxon sampling and sample collection

Request a detailed protocol

The selection of species used for this study was driven by several key objectives. First, we aimed to provide data for ongoing research projects. Second, we aimed to supplement existing genomic data, both as a benchmarking resource against well-studied references (e.g. D. melanogaster) and to provide a technological update to some older assemblies (Roy et al., 2010). Third, we aimed to increase the phylogenetic and ecological diversity of publically available Drosophila genome assemblies.

In most cases, genomic DNA was collected from lab-raised flies, which were either derived from lines maintained at public Drosophila stock centers and individual labs or, in a few cases, from F1 or F2 progeny of flies recently collected in the wild. We collected specimens from the wild with standard fruit or mushroom-baited traps, sweep netting, and aspiration. We established isofemale lines from individual females collected using these baits unless otherwise specified (Supplementary file 1). For species difficult to culture in the lab (all Lordiphosa spp. except Lo. clarofinis, D. sproati, D. murphyi, Le. varia, S. graminum), either wild-caught flies or flies from a transient lab culture were used. In accordance with domestic and international shipping laws, these flies were either fixed in ethanol before transport (Lordiphosa spp., D. subobscura, D. obscura, C. costata, D. littoralis, D. tristis, D. ambigua) or transported with permits (P526P-15–02964 to D. Matute, P526P-20–02787 and P526P-19–01521 to A. Kopp, and Hawaii State permit I1302 to D. Price).

Of 101 total assemblies, we include 13 genomes assembled with re-analyzed sequences from Miller et al., 2018; 60 genomes from stock center lines or established lab cultures; 22 genomes from lab-raised flies derived from recent wild collections; and six genomes from wild-caught flies. Of note, 6 Zaprionus lines used in this study (Z. africanus, Z. indianus, Z. tsacasi, Z. nigranus, Z. taronus) were assembled by Comeault et al., 2020, but updated higher contiguity assemblies are provided with this manuscript with the exception of Z. indianus line 16GNV01 (see ‘Alternative hybrid assembly process’ section below). Details on each sample including (if available) line designations and collection information, are provided in Table 1 and Supplementary file 6.

DNA extraction and nanopore sequencing

Request a detailed protocol

A high molecular weight (HMW) genomic DNA (gDNA) extraction and ONT library prep was performed for each sample, with slight variation in the protocol through time and to deal with differences in sample quality or preservation. Here, we briefly describe a recommended general protocol for HMW gDNA extraction and library prep from 15 to 30 flies. This protocol is sufficient to reproduce all results from this manuscript at the same or higher levels of data quality. Detailed step-by-step instructions are provided at Protocols.io (see Data availability). We note one exception made necessary by sample availability and shipping laws. Scaptomyza graminum gDNA was extracted by using the Qiagen Blood and Cell Culture DNA Mini Kit (Qiagen, Germantown, MD) from 30 unfrozen flies and prepared with the ONT LSK109 kit (Oxford Nanopore, Oxford, UK) without any modifications to the manufacturer's instructions.

Genomic DNA was prepared from about 30 flash frozen or ethanol fixed adult flies. For non-inbred samples, we tried to use 15 flies or less to minimize the genetic diversity of the sample. In the absence of amplification, about 1.5–3 μg of input DNA is needed to prepare 3–4 library loads with the ONT LSK109 kit. Sufficient input DNA is particularly important when selecting for longer reads. Ethanol preserved samples were soaked in a rehydration buffer (400 mM NaCl, 20 mM Tris-HCl pH 8.0, 30 mM EDTA) for 30 min at room temperature (~23°C), dabbed dry with a Kimwipe, then frozen for 1 hr at −80°C before extraction. Frozen flies were ground in 1.5 mL of homogenization buffer (0.1M NaCl, 30 mM Tris HCl pH 8.0, 10 mM EDTA, 0.5% Triton X-100) with a 2 mL Kimble (DWK Life Sciences, Millville, NJ) Kontes Dounce homogenizer. The homogenate was centrifuged for 5 min at 2000 ×g, the supernatant discarded by decanting, and the pellet resuspended in 100 μL of fresh homogenization buffer. This mixture was then added to a tube with 380 μL extraction buffer (0.1M Tris-HCl pH 8.0, 0.1M NaCl, 20 mM EDTA) along with 10 μL of 20 mg/mL Proteinase K (Thermo Fisher Scientific, Waltham, MA), 10 μL SDS (10% w/v), and 2 μL of 10 mg/mL RNAse A (Millipore Sigma, Hayward, CA). This tube was incubated at 50°C for 4 hr, with mixing at 30–60 min intervals by gentle inversion.

High-molecular-weight gDNA was purified with a standard phenol-chloroform extraction. The lysate was extracted twice with an equal volume of 25:24:1 v/v phenol chloroform isoamyl alcohol (Thermo Fisher Scientific, Waltham, MA) in a 2 mL light phase lock gel tube (Quantabio, Beverly, MA). Next, the aqueous layer was decanted into a fresh 2 mL phase lock gel tube then extracted once with an equal volume of chloroform (Millipore Sigma, Hayward, CA). The use of the phase lock gel tube reduces DNA shearing at this stage by minimizing pipette handling. HMW DNA was precipitated by adding 0.1 vol of 3M sodium acetate and 2.0–2.4 volumes of cold absolute ethanol. Gentle mixing resulted in the precipitation of a white, stringy clump of DNA, which was then transferred to a DNA LoBind tube (Eppendorf, Hamburg, Germany) and washed twice with 70% ethanol. After washing, the DNA was pelleted by centrifugation and all excess liquid removed from the tube. The pellet was allowed to air dry until the moment it became translucent, resuspended in 65 μL of 1× Tris-EDTA buffer on a heat block at 50°C for 60 min, then incubated for at least 48 hr at 4°C. After 48 hr, the viscous DNA solution was mixed by gentle pipetting with a P1000 tip. This controlled shearing step encourages resuspension of HMW DNA and improves library prep yield. DNA was quantified with Qubit (Thermo Fisher Scientific, Waltham, MA) and Nanodrop (Thermo Fisher Scientific, Waltham, MA) absorption ratios were checked to ensure 260/280 was greater than 1.8 and 260/230 was greater than 2.0.

The sequencing library was prepared following the ONT Ligation Sequencing Kit (SQK-LSK109) protocol, with two important modifications. First, we started with approximately 3 μg of input DNA, three times the amount recommended by the manufacturer. Second, we utilized a form of size-selective polymer precipitation (Paithankar and Prasad, 1991) with the Circulomics Short Read Eliminator (SRE) buffer (Circulomics, Baltimore, MD) plus centrifugation to isolate DNA instead of magnetic beads. We found this to be necessary because magnetic beads irreversibly clumped with viscous HMW gDNA, decreasing library yield and limiting read lengths. The manner in which this was performed was specific to the cleanup step. After the end-prep/repair step (New England Biolabs, Ipswich, MA), the SRE buffer was used according to the manufacturer’s instructions. After adapter ligation, DNA was pelleted by centrifuging the sample at 10,000×g for 30 min without the addition of any reagents, since DNA readily precipitated upon addition of the ligation buffer. Ethanol washes were avoided past this step since ethanol will denature motor proteins in the prepared library. Instead, the DNA pellet was washed with 100 μL SFB or LFB (interchangeably) from the ligation sequencing kit instead of 70% ethanol. If library yield was sufficient (>50 ng/μL), the Circulomics SRE buffer was used for a final round of size selection, replacing the ethanol wash with LFB/SFB as described above. Of note, a cheaper and open-source alternative made with polyethylene glycol MW 8000 (PEG 8000), although less effective at size selection, to the SRE buffer is described by Tyson, 2020 (dx.doi.org/10.17504/protocols.io.7euhjew). A 1:1 dilution of the PEG 8000 solution described in that protocol can be substituted for SFB or LFB in the washing steps described above.

The typical yield of a library prepared in this manner is in the range of 1–1.5 μg. Approximately 350 ng of the prepared library was loaded for each sequencing run. To maintain flow cell throughput and read length, flow cells were flushed every 8–16 hr with the ONT Flow Cell Wash Kit (EXP-WSH003) and reloaded with a fresh library.

Obtaining short read datasets for polishing

Request a detailed protocol

We performed 2×150 bp Illumina sequencing for most of the strains that did not have publicly available short read data available. Illumina libraries were prepared from the same gDNA extractions as the Nanopore library for most samples, with some exceptions as described in Supplementary file 1. The libraries were prepared in either of two manners. For the majority of samples, sequencing libraries were prepared with a modified version of the Nextera DNA Library Kit (Illumina, San Diego, CA) protocol (Baym et al., 2015) and sequencing was performed by Admera Health on NextSeq 4000 or HiSeq 4000 machines. Alternatively, Illumina libraries were prepared with the KAPA Hyper DNA kit (Roche, Basel, Switzerland) according to the manufacturer’s protocol and sequenced at the UNC sequencing core on a HiSeq 4000 machine. In either case, all samples on a lane were uniquely dual indexed. Illumina sequencing was not performed for D. equinoxialis, D. funebris, D. subpulchrella, D. tropicalis, Le. varia, Z. lachaisei, Z. taronus, and the unidentified São Tomé mushroom feeder due to material unavailability (line extinction/culling). Details for each sample, including accession numbers for any public data used in this work, are provided in Supplementary file 1.

Choice of long read assembly program

Request a detailed protocol

Flye v2.6 (Kolmogorov et al., 2019) was used due to its quick CPU runtime, low memory requirements, excellent assembly contiguity, and its consistent performance on benchmarking datasets (Wick and Holt, 2020). We additionally validated the performance of Flye for Drosophila genomes using Nanopore data previously generated by Miller et al., 2018 and 60× depth of new Nanopore sequencing of the Berkeley Drosophila Genome Project ISO-1 strain of D. melanogaster. We assembled genomes with Flye v2.6 and Canu v1.8 (Koren et al., 2017) to evaluate simple benchmarks of assembly contiguity and run time and to provide a comparison to the Miniasm (Li, 2016) assemblies from Miller et al., 2018 Canu produced relatively contiguous assemblies, but a single assembly took several days on a 92-core cloud server and even longer when a large number of extra-long (>50kb) reads were present in the data. This was determined to be too costly when scaled to >100 species. In addition to a much shorter (8–12 hr wall-clock time) runtime, Flye also produced significantly more contiguous assemblies than those reported by Miller et al. (Figure 1—figure supplement 2). Note, several new long read assemblers have been released and these assembly programs have been significantly updated since this work was performed. Assembler performance should be evaluated with up-to-date versions in any future work.

Assembly and long read polishing

Request a detailed protocol

After Nanopore sequencing was performed, raw Nanopore data were basecalled with Guppy v3.2.4, using the high-accuracy caller (option: -c dna_r0.4.1_450bps_hac.cfg). Raw Nanopore data previously generated by Miller et al., 2018 were processed in the same manner.

Next, basecalled reads were assembled using Flye v2.6 with default settings. Genome size estimates (option: --genomeSize) were obtained through a web search or taken from a closely related species. If no such information was available, an initial estimate of 200 Mb was used. The specific genome size estimate provided to Flye (separate from the one estimated later from BUSCO coverage) is provided in Supplementary file 2.

After generating a draft assembly, we performed long read polishing using Medaka following the developer’s instructions (https://nanoporetech.github.io/medaka/draft_origin.html). Reads were aligned to the draft genome with Minimap2 v2.17 (Li, 2016) and parsed with SAMtools v1.12 (Danecek et al., 2021; Li et al., 2009) before each round of polishing (option: -ax ont). The draft was polished with two rounds of Racon v1.4.3 (Vaser et al., 2017) (options: -m8 -x 6 g 8 w 500) and then a single round of Medaka v0.9.1.

Haplotig identification and removal

Request a detailed protocol

Next, we assessed each Medaka-polished assembly for the presence of duplicated haplotypes (haplotigs) using BUSCO v3.0.2 (Simão et al., 2015; Waterhouse et al., 2018) along with the OrthoDB v9 dipteran gene set (Zdobnov et al., 2017). If the BUSCO duplication rate exceeded 1%, haplotig identification and removal was performed, but on the draft assembly produced by Flye rather than the polished assembly. Purge_haplotigs v1.1.1 (Roach et al., 2018) was run on these sequences following the guidelines provided by the developer (https://bitbucket.org/mroachawri/purge_haplotigs). Illumina reads were mapped to the draft assembly with Minimap2 (option: -ax sr) to obtain read depth information. The optional clipping step was performed to remove overlapping (duplicate) contig ends. Finally, remaining contigs were re-scaffolded with Nanopore reads using npScarf v1.9-2b (Cao et al., 2017), with support from at least four long reads required to link two contigs (option: --support=4). These sequences were polished with Racon and Medaka as described above.

Final polishing and decontamination

Request a detailed protocol

The Medaka-polished assembly was further polished with Illumina data and any contigs identified as microbial sequences were removed. Illumina reads were mapped to the draft assembly with Minimap2 (option: -ax sr) and the assembly polished with Pilon v1.23 (--fix snps,indels) (Walker et al., 2014). If a genome did not have an accompanying short read dataset but Illumina reads were available from a different strain of the same species (Supplementary file 1), Pilon was run without correcting SNVs (option: --fix indels). We found that allowing Pilon to fix gaps or local misassemblies in default mode introduced large spurious indels in regions where short reads map poorly such as tandem repeats. These variants were not supported by long reads or by comparison to a reference assembly. Thus, we chose to use Pilon to only fix base-level errors.

Assembly decontamination

Request a detailed protocol

After Pilon polishing, assembly completeness was assessed again with BUSCO v3.0.2. We used BLAST v2.10.0 (Altschul et al., 1990) to remove any contigs not associated with at least one BUSCO that were also of bacterial, protozoan, or fungal origin. Finally, any sequences flagged by the NCBI Contamination Screen were excluded or trimmed.

A flow chart outline of the full genome assembly process described here is provided in Figure 7.

Flow chart depiction of the assembly pipeline.

Alternative hybrid assembly process

Request a detailed protocol

Zaprionus indianus line 16GNV01 had insufficient Nanopore data for a Flye assembly. For this line only and to consolidate all assemblies as a single resource, the same genome assembly from Comeault et al., 2020 is both reported here and associated with the NCBI BioProject associated with this work. An alternative assembly strategy was taken for this line. Briefly, short-read sequence data was assembled first using SPAdes v3.11.1 (Bankevich et al., 2012) using default parameters. Nanopore reads were corrected with Illumina data using FMLRC v.1.0.0 (Wang et al., 2018) and subsequently used to scaffold the SPAdes assembly using LINKS v.1.8.7 (Warren et al., 2015) using the recommended iterative approach of 33 iterations with incrementally increasing k-mer distance threshold. The resulting scaffolds were polished with four rounds of Racon followed by four rounds of Pilon (but without Medaka) as described above.

Repeat annotation and masking

Request a detailed protocol

Each draft assembly was soft repeat masked with RepeatMasker v4.1.0 (Smit et al., 2013) at medium sensitivity, with both Dfam 3.1 (Hubley et al., 2016) and RepBase RepeatMasker edition (Bao et al., 2015) repeat libraries installed (options: --species Drosophila --xsmall). RepeatMasker was initialized with cross_match v1.090518 (Green, 2009) as the sequence search engine and Tandem Repeat Finder v4.0.9 (Benson, 1999).

Genome size estimation

Request a detailed protocol

Genome size was estimated with Nanopore and Illumina data separately (Supplementary file 2). To estimate genome size from Illumina reads, we used the k-mer counting approach implemented in GenomeScope v1.0.0 (Vurture et al., 2017). Briefly, we followed the developer-provided workflow (https://github.com/schatzlab/genomescope) and generated a k-mer count histogram using a k-mer size of 21 (option: -m 21) with Jellyfish v2.2.3 (Marçais and Kingsford, 2011). The histogram was passed to the genomescope.R script to estimate the haploid genome size. We found these estimates to be somewhat unreliable, particularly when we tried to estimate genome size from a non-inbred sample. Due to this issue and because some samples were missing short read data, we took additional steps to estimate genome size from long reads.

Since the higher error rate of Nanopore reads (5–15%) precludes the use of k-mer based reference-free approaches for genome size estimation, we instead used regions annotated as a single-copy BUSCO gene to estimate genome size. Our rationale was that non-duplicated complete BUSCOs in each assembly could reasonably be assumed to be true single-copy markers and serve a similar function as unique k-mers for genome size estimation. Then, genome size can be roughly estimated from the depth of coverage across single-copy BUSCOs:

genome size = (total bases in Nanopore reads)/(coverage)

To perform this estimation, Nanopore reads were aligned to the coding sequences with Minimap2, only keeping primary alignments (options: --ax map-ont --secondary=no). Read depth was computed from genomic regions annotated as a single-copy BUSCO with SAMtools. If some proportion of the genome assembly was identified as non-fly and removed during the contaminant removal step, we adjusted the genome size estimate based on the total length of removed sequence:

genome size=(total bases in reads)(1proportion of assembly removed)/(mean depth of coverage)

This assumes uniform Nanopore coverage across fly and contaminant sequence in the assembly and serves only as a rough approximation.

Assessing assembly contiguity and completeness

Request a detailed protocol

Assembly contiguity statistics were computed using a series of custom shell and R scripts. Fasta files were parsed with Bioawk v1.0 (Li, 2017) and summary statistics were computed in the standard manner with the custom scripts. Contig N50 and NG50 were computed in the standard manner, in the latter case using the long-read based estimates of genome size. In addition to these statistics, we present contiguity in terms of auN. The auN statistic (Li, 2020) is the area under an Nx curve, and can be computed by multiplying the length of each contig (Li) by the proportion of the assembled genome it accounts for (Li /∑Li), then summing these values for all i contigs:

auN=i(LiLijLj)

Contig N50 represents a single point on the Nx curve and may or may not be affected by assembly breaks, but auN is always sensitive to a break in the assembly. Therefore, auN is a fairer statistic for comparison between different versions of the same assembly.

Assembly completeness was assessed with BUSCO v4.0.6 (Seppey et al., 2019), using the OrthoDB v10 Diptera database (Kriventseva et al., 2019) (options: --m geno -l diptera_odb10 --augustus_species fly). Note, the BUSCO version used here is different from what was used during the assembly process. When this work was started, BUSCO v3 was the current version. Version 4 was released while the project was ongoing. For consistency, version three was used during the assembly process for all assemblies, but the completeness of all final assemblies was assessed with BUSCO v4.For D. equinoxialis only, BUSCO v4.1.4 was used instead of v4.0.6 due to the presence of a bug that precluded the use of earlier versions.

Computation of sample heterozygosity and sequence quality from long and short reads

Request a detailed protocol

Sample diversity was estimated by counting the number of non-reference single-nucleotide polymorphisms (SNPs) and indels, called separately from long and short reads. We mapped ONT reads to the finished genome with Minimap2 (option: -ax map-ont) then sorted the output with sambamba 0.8.0 (Tarasov et al., 2015). Variants were called with PEPPER-Margin-Deepvariant r0.4 (Shafin et al., 2021), following the developer’s Singularity container-based ‘Nanopore variant calling’ instructions (https://github.com/kishwarshafin/pepper), to generate both variant call format (VCF) and banded genomic variant call format (gVCF) files, that is, a variant call file including intervals of invariant sites. Similarly, we mapped Illumina reads to each finished genome genome with Minimap2 (option: -ax sr), sorted and removed duplicates with sambamba, then performed variant calling, with output including all invariant sites, with BCFtools v1.12 (Danecek et al., 2021; Li, 2011). For both types of variant calls, we performed additional quality filtering using BCFtools. Only sites with minimum read depth 10, site quality score 30, and (if applicable) genotype quality score 30 filters were kept. The number of callable sites was estimated by adding the number of sites and the lengths of the intervals that passed these quality filters.

Sequence quality was estimated from variant calls following the standard workflow (e.g. Koren et al., 2017; Solares et al., 2018). Error was estimated by counting the number of non-reference variants (SNPs or indels), in either heterozygous or homozygous form, then dividing this count by the number of informative bases for variant calling: Perror = (Number of variants)/(Number of callable sites). A Phred-scaled quality score (QV) was computed in the standard manner: QV = −10 * log10(Perror).

Reference-free consensus quality scores

Request a detailed protocol

Reference-free quality score estimates were computed with Merqury v1.3 (Rhie et al., 2020), following the instructions provided by the developer on the GitHub repository (https://github.com/marbl/merqury). Briefly, we used the tools included with the installation of Merqury to estimate an optimal k-mer size for each genome assembly, at a collision rate of 0.001. Then, we built a k-mer database using the Illumina reads used to polish the genome assembly. Note, in some cases, Illumina reads from a different strain were used, and polishing was only used to correct indels. Finally, we ran the main Merqury script on the assembly of interest to estimate a genome-wide Phred-scaled consensus quality score (QV).

Reference-based quality assessment

Request a detailed protocol

Reference-based quality score estimates were computed ONT Pomoxis v0.3.7 (https://github.com/nanoporetech/pomoxis) for assemblies where a well-annotated counterpart of not only the same species but the same strain was available through the NCBI RefSeq database. Gene and repeat annotations were downloaded from NCBI and coding regions, introns, intergenic regions, and repeats were parsed into BED formatted intervals with bedtools v2.30.0 (Quinlan and Hall, 2010). Introns were computed as the within-gene complement of exons, and intergenic regions were computed as the complement of genic regions. Then, we ran Pomoxis, which aligns each Nanopore-based assembly to the reference genome and assessed differences between the two genomes in 100 kb windows (option: -c 100000). Consensus quality was estimated by counting SNVs, insertions, and deletions and dividing the number of affected base pairs by the length of the alignment. This computation was performed separately for exons, introns, intergenic regions, repeats, and for the whole genome, using the genomic intervals described above.

For manual validation, we first used Pomoxis, as described above, to generate a list of all 1 bp or longer (option: -l 1) indel differences between our Nanopore-based assembly and the Release six assembly (Hoskins et al., 2015) of the D. melanogaster reference strain. The CA 8.2 MHAP version of the PacBio-based (Kim et al., 2014; Koren et al., 2017) D. melanogaster ISO-1 assembly was obtained from GenBank accession GCA_000778455.1. The iso1_onta2_quickmerge_scaffolds version from Solares et al., 2018 was downloaded from the GitHub repository associated with that project (https://github.com/danrdanny/Nanopore_ISO1). We aligned short reads, long reads, and each of the non-reference genomes to the Release six reference genome using Minimap2 (option, for short reads: -ax sr; for long reads: -ax map-ont; for genomes: -ax asm5), then sorted and parsed the output into BAM format with SAMtools. Repeat annotations for the Nanopore-based assembly were generated as described previously, then lifted over into reference coordinates. The liftover was performed with HALtools v2.1 (Hickey et al., 2013). Specifically, we aligned our Nanopore assembly to the Release six assembly with Minimap2 (options: --cx asm5 --cs long), converted the PAF alignment to MAF with Minimap2’s paftools.js program, MAF alignment to HAL with HALtools’ hal2maf program, and executed the liftover with HALtools’ halLiftover program. Alignments and genomic intervals were viewed in the Integrative Genomics Viewer v2.9.4 (Robinson et al., 2011b).

Species tree inference from BUSCO orthologs

Request a detailed protocol

We inferred species relationships using complete and single-copy orthologs identified by the BUSCO analysis. Amino acid sequences were used instead of nucleotide sequences to achieve better alignments in the face of high-sequence divergence (Bininda-Emonds, 2005). Out of 990 single-copy orthologs present in all assemblies, we randomly selected 250 to construct gene trees. The predicted protein sequence of each ortholog was aligned separately with MAFFT v7.453 (Katoh and Standley, 2013), using the E-INS-i algorithm (options: --ep 0 --genafpair --maxiterate 1000). Gene trees were inferred with RAxML-NG v0.9.0 (Kozlov et al., 2019), using the Le and Gascuel, 2008 amino acid substitution model (options: --msa-format FASTA --data-type AA --model LG). The summary method ASTRAL-MP v.5.14.7 (Yin et al., 2019) was run with default settings to reconstruct the species tree. We note that this is not intended to be a definitive phylogenetic reconstruction of species relationships; see Suvorov et al., 2021 for a time-calibrated phylogeny utilizing 158 drosophilid whole genomes.

Analysis of chromosome organization

Request a detailed protocol

Syntenic comparisons were performed by representing the genome assemblies as paths through an undirected graph. The path each genome traverses can be considered a series of connections between single copy orthologous markers (i.e. BUSCOs). Using BUSCO v4 annotations for each final genome, we constructed a 3285 by 3285 symmetric adjacency matrix, with row and column headers (nodes) corresponding to 3285 possible BUSCOs from the diptera_odb10 database. Off-diagonal entries in each matrix (edges) were the number of times two single-copy BUSCOs were found as connected and immediate neighbors in the assemblies. Sequences of three or more BUSCOs were not considered. The graph was then visualized in two dimensions using the ForceAtlas2 graph layout algorithm (Jacomy et al., 2014) as implemented in the ForceAtlas2 R package (https://github.com/analyxcompany/ForceAtlas2). While this method is primarily designed for flexible, user-friendly tuning of graph visualization, it is similar in effect to other nonlinear dimensionality reduction techniques (Böhm et al., 2020). ForceAtlas2 was run with the settings: tolerance=1, gravity=1, iterations=3000. D. equinoxialis was omitted from this analysis due to the BUSCO v4 issues mentioned previously.

Repeat content and genome size analysis

Request a detailed protocol

The contribution of repeat content to genome size variation in Drosophila was examined by comparing the number of bases in each genome annotated as a type of repeat (previously described) to the number of bases not annotated as repetitive sequence. Phylogenetic independent contrasts (Felsenstein, 1985) were computed for the counts of bases in both categories using the R package ape v5.4.1 (Paradis and Schliep, 2019) using the species tree described above with the root age set to 53 million years following the estimate in Suvorov et al., 2021.

Compute containers

Request a detailed protocol

While the overall computational demands of this work were high, the unique computational challenge we faced was the variety of computational resources used for various stages of the assembly process. Assemblies took place across local servers, institutional clusters, and cloud computing resources. A key factor in ensuring reproducibility across computing environments was the use of computing containers, which is like a lightweight virtual machine that can be customized such that sets of programs and their dependencies are packaged together. Specifically, we used the programs Docker and Singularity to manage containers. These programs allow containers to be built and packaged as an image file which is transferred to another computer. A Dockerfile, a text file containing instructions to set up an image, is used to select the Linux operating system and the suite of programs to be installed within a Docker container. Singularity is used to package the Docker container as an image file that can be transferred to and used in a cluster or cloud environment without the need for administrative permissions. Standard commands are then run inside the container environment. The files and instructions necessary to build these containers, which will allow for the exact reproduction of the computing environment in which this work was performed, are provided at: https://github.com/flyseq/drosophila_assembly_pipelines, (copy archived at swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7eadKim, 2021). We hope these files will facilitate the work of researchers new to Nanopore sequencing or the genome assembly process.

Data availability

All sequencing data and assemblies generated by this study are deposited at NCBI SRA and GenBank under NCBI BioProject PRJNA675888. Accession numbers for all data used but not generated by this study are provided in the supporting files. Dockerfiles and scripts for reproducing pipelines and analyses are provided on GitHub (https://github.com/flyseq/drosophila_assembly_pipelines; copy archived at https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead). A detailed wet lab protocol is provided at https://Protocols.io (https://doi.org/10.17504/protocols.io.bdfqi3mw).

The following data sets were generated
    1. Kim BY
    2. Wang JR
    (2020) NCBI BioProject
    ID PRJNA675888. Nanopore-based assembly of many drosophilid genomes.
The following previously published data sets were used
    1. Miller DE
    (2018) NCBI BioProject
    ID PRJNA427774. Sequencing and assembly of 14 Drosophila species.
    1. The Drosophila modENCODE Project
    (2011) NCBI BioProject
    ID 63477. modENCODE Drosophila reference genome sequencing (fruit flies).
    1. Yang H
    (2018) NCBI BioProject
    ID PRJNA484408. DNA-seq of sexed Drosophila grimshawi, Drosophila silvestris, and Drosophila heteroneura.
    1. Bronski M
    (2019) NCBI BioProject
    ID PRJNA554346. Drosophila montium Species Group Genomes Project.
    1. Rane R
    (2018) NCBI BioProject
    ID 476692. Invertebrate sample from Drosophila repleta.
    1. National Institute of Genetics [Japan]
    (2016) NCBI BioProject
    ID PRJDB4817. Genome sequences of 10 Drosophila species.
    1. Ellison C
    (2019) NCBI BioProject
    ID PRJNA550077. Raw genomic sequencing data from 16 Drosophila species.

References

    1. Adams MD
    2. Celniker SE
    3. Holt RA
    4. Evans CA
    5. Gocayne JD
    6. Amanatides PG
    7. Scherer SE
    8. Li PW
    9. Hoskins RA
    10. Galle RF
    11. George RA
    12. Lewis SE
    13. Richards S
    14. Ashburner M
    15. Henderson SN
    16. Sutton GG
    17. Wortman JR
    18. Yandell MD
    19. Zhang Q
    20. Chen LX
    21. Brandon RC
    22. Rogers YH
    23. Blazej RG
    24. Champe M
    25. Pfeiffer BD
    26. Wan KH
    27. Doyle C
    28. Baxter EG
    29. Helt G
    30. Nelson CR
    31. Gabor GL
    32. Abril JF
    33. Agbayani A
    34. An HJ
    35. Andrews-Pfannkoch C
    36. Baldwin D
    37. Ballew RM
    38. Basu A
    39. Baxendale J
    40. Bayraktaroglu L
    41. Beasley EM
    42. Beeson KY
    43. Benos PV
    44. Berman BP
    45. Bhandari D
    46. Bolshakov S
    47. Borkova D
    48. Botchan MR
    49. Bouck J
    50. Brokstein P
    51. Brottier P
    52. Burtis KC
    53. Busam DA
    54. Butler H
    55. Cadieu E
    56. Center A
    57. Chandra I
    58. Cherry JM
    59. Cawley S
    60. Dahlke C
    61. Davenport LB
    62. Davies P
    63. de Pablos B
    64. Delcher A
    65. Deng Z
    66. Mays AD
    67. Dew I
    68. Dietz SM
    69. Dodson K
    70. Doup LE
    71. Downes M
    72. Dugan-Rocha S
    73. Dunkov BC
    74. Dunn P
    75. Durbin KJ
    76. Evangelista CC
    77. Ferraz C
    78. Ferriera S
    79. Fleischmann W
    80. Fosler C
    81. Gabrielian AE
    82. Garg NS
    83. Gelbart WM
    84. Glasser K
    85. Glodek A
    86. Gong F
    87. Gorrell JH
    88. Gu Z
    89. Guan P
    90. Harris M
    91. Harris NL
    92. Harvey D
    93. Heiman TJ
    94. Hernandez JR
    95. Houck J
    96. Hostin D
    97. Houston KA
    98. Howland TJ
    99. Wei MH
    100. Ibegwam C
    101. Jalali M
    102. Kalush F
    103. Karpen GH
    104. Ke Z
    105. Kennison JA
    106. Ketchum KA
    107. Kimmel BE
    108. Kodira CD
    109. Kraft C
    110. Kravitz S
    111. Kulp D
    112. Lai Z
    113. Lasko P
    114. Lei Y
    115. Levitsky AA
    116. Li J
    117. Li Z
    118. Liang Y
    119. Lin X
    120. Liu X
    121. Mattei B
    122. McIntosh TC
    123. McLeod MP
    124. McPherson D
    125. Merkulov G
    126. Milshina NV
    127. Mobarry C
    128. Morris J
    129. Moshrefi A
    130. Mount SM
    131. Moy M
    132. Murphy B
    133. Murphy L
    134. Muzny DM
    135. Nelson DL
    136. Nelson DR
    137. Nelson KA
    138. Nixon K
    139. Nusskern DR
    140. Pacleb JM
    141. Palazzolo M
    142. Pittman GS
    143. Pan S
    144. Pollard J
    145. Puri V
    146. Reese MG
    147. Reinert K
    148. Remington K
    149. Saunders RD
    150. Scheeler F
    151. Shen H
    152. Shue BC
    153. Sidén-Kiamos I
    154. Simpson M
    155. Skupski MP
    156. Smith T
    157. Spier E
    158. Spradling AC
    159. Stapleton M
    160. Strong R
    161. Sun E
    162. Svirskas R
    163. Tector C
    164. Turner R
    165. Venter E
    166. Wang AH
    167. Wang X
    168. Wang ZY
    169. Wassarman DA
    170. Weinstock GM
    171. Weissenbach J
    172. Williams SM
    173. Woodage T
    174. Worley KC
    175. Wu D
    176. Yang S
    177. Yao QA
    178. Ye J
    179. Yeh RF
    180. Zaveri JS
    181. Zhan M
    182. Zhang G
    183. Zhao Q
    184. Zheng L
    185. Zheng XH
    186. Zhong FN
    187. Zhong W
    188. Zhou X
    189. Zhu S
    190. Zhu X
    191. Smith HO
    192. Gibbs RA
    193. Myers EW
    194. Rubin GM
    195. Venter JC
    (2000) The genome sequence of Drosophila melanogaster
    Science 287:2185–2195.
    https://doi.org/10.1126/science.287.5461.2185
    1. Clark AG
    2. Eisen MB
    3. Smith DR
    4. Bergman CM
    5. Oliver B
    6. Markow TA
    7. Kaufman TC
    8. Kellis M
    9. Gelbart W
    10. Iyer VN
    11. Pollard DA
    12. Sackton TB
    13. Larracuente AM
    14. Singh ND
    15. Abad JP
    16. Abt DN
    17. Adryan B
    18. Aguade M
    19. Akashi H
    20. Anderson WW
    21. Aquadro CF
    22. Ardell DH
    23. Arguello R
    24. Artieri CG
    25. Barbash DA
    26. Barker D
    27. Barsanti P
    28. Batterham P
    29. Batzoglou S
    30. Begun D
    31. Bhutkar A
    32. Blanco E
    33. Bosak SA
    34. Bradley RK
    35. Brand AD
    36. Brent MR
    37. Brooks AN
    38. Brown RH
    39. Butlin RK
    40. Caggese C
    41. Calvi BR
    42. Bernardo de Carvalho A
    43. Caspi A
    44. Castrezana S
    45. Celniker SE
    46. Chang JL
    47. Chapple C
    48. Chatterji S
    49. Chinwalla A
    50. Civetta A
    51. Clifton SW
    52. Comeron JM
    53. Costello JC
    54. Coyne JA
    55. Daub J
    56. David RG
    57. Delcher AL
    58. Delehaunty K
    59. Do CB
    60. Ebling H
    61. Edwards K
    62. Eickbush T
    63. Evans JD
    64. Filipski A
    65. Findeiss S
    66. Freyhult E
    67. Fulton L
    68. Fulton R
    69. Garcia AC
    70. Gardiner A
    71. Garfield DA
    72. Garvin BE
    73. Gibson G
    74. Gilbert D
    75. Gnerre S
    76. Godfrey J
    77. Good R
    78. Gotea V
    79. Gravely B
    80. Greenberg AJ
    81. Griffiths-Jones S
    82. Gross S
    83. Guigo R
    84. Gustafson EA
    85. Haerty W
    86. Hahn MW
    87. Halligan DL
    88. Halpern AL
    89. Halter GM
    90. Han MV
    91. Heger A
    92. Hillier L
    93. Hinrichs AS
    94. Holmes I
    95. Hoskins RA
    96. Hubisz MJ
    97. Hultmark D
    98. Huntley MA
    99. Jaffe DB
    100. Jagadeeshan S
    101. Jeck WR
    102. Johnson J
    103. Jones CD
    104. Jordan WC
    105. Karpen GH
    106. Kataoka E
    107. Keightley PD
    108. Kheradpour P
    109. Kirkness EF
    110. Koerich LB
    111. Kristiansen K
    112. Kudrna D
    113. Kulathinal RJ
    114. Kumar S
    115. Kwok R
    116. Lander E
    117. Langley CH
    118. Lapoint R
    119. Lazzaro BP
    120. Lee SJ
    121. Levesque L
    122. Li R
    123. Lin CF
    124. Lin MF
    125. Lindblad-Toh K
    126. Llopart A
    127. Long M
    128. Low L
    129. Lozovsky E
    130. Lu J
    131. Luo M
    132. Machado CA
    133. Makalowski W
    134. Marzo M
    135. Matsuda M
    136. Matzkin L
    137. McAllister B
    138. McBride CS
    139. McKernan B
    140. McKernan K
    141. Mendez-Lago M
    142. Minx P
    143. Mollenhauer MU
    144. Montooth K
    145. Mount SM
    146. Mu X
    147. Myers E
    148. Negre B
    149. Newfeld S
    150. Nielsen R
    151. Noor MA
    152. O'Grady P
    153. Pachter L
    154. Papaceit M
    155. Parisi MJ
    156. Parisi M
    157. Parts L
    158. Pedersen JS
    159. Pesole G
    160. Phillippy AM
    161. Ponting CP
    162. Pop M
    163. Porcelli D
    164. Powell JR
    165. Prohaska S
    166. Pruitt K
    167. Puig M
    168. Quesneville H
    169. Ram KR
    170. Rand D
    171. Rasmussen MD
    172. Reed LK
    173. Reenan R
    174. Reily A
    175. Remington KA
    176. Rieger TT
    177. Ritchie MG
    178. Robin C
    179. Rogers YH
    180. Rohde C
    181. Rozas J
    182. Rubenfield MJ
    183. Ruiz A
    184. Russo S
    185. Salzberg SL
    186. Sanchez-Gracia A
    187. Saranga DJ
    188. Sato H
    189. Schaeffer SW
    190. Schatz MC
    191. Schlenke T
    192. Schwartz R
    193. Segarra C
    194. Singh RS
    195. Sirot L
    196. Sirota M
    197. Sisneros NB
    198. Smith CD
    199. Smith TF
    200. Spieth J
    201. Stage DE
    202. Stark A
    203. Stephan W
    204. Strausberg RL
    205. Strempel S
    206. Sturgill D
    207. Sutton G
    208. Sutton GG
    209. Tao W
    210. Teichmann S
    211. Tobari YN
    212. Tomimura Y
    213. Tsolas JM
    214. Valente VL
    215. Venter E
    216. Venter JC
    217. Vicario S
    218. Vieira FG
    219. Vilella AJ
    220. Villasante A
    221. Walenz B
    222. Wang J
    223. Wasserman M
    224. Watts T
    225. Wilson D
    226. Wilson RK
    227. Wing RA
    228. Wolfner MF
    229. Wong A
    230. Wong GK
    231. Wu CI
    232. Wu G
    233. Yamamoto D
    234. Yang HP
    235. Yang SP
    236. Yorke JA
    237. Yoshida K
    238. Zdobnov E
    239. Zhang P
    240. Zhang Y
    241. Zimin AV
    242. Baldwin J
    243. Abdouelleil A
    244. Abdulkadir J
    245. Abebe A
    246. Abera B
    247. Abreu J
    248. Acer SC
    249. Aftuck L
    250. Alexander A
    251. An P
    252. Anderson E
    253. Anderson S
    254. Arachi H
    255. Azer M
    256. Bachantsang P
    257. Barry A
    258. Bayul T
    259. Berlin A
    260. Bessette D
    261. Bloom T
    262. Blye J
    263. Boguslavskiy L
    264. Bonnet C
    265. Boukhgalter B
    266. Bourzgui I
    267. Brown A
    268. Cahill P
    269. Channer S
    270. Cheshatsang Y
    271. Chuda L
    272. Citroen M
    273. Collymore A
    274. Cooke P
    275. Costello M
    276. D'Aco K
    277. Daza R
    278. De Haan G
    279. DeGray S
    280. DeMaso C
    281. Dhargay N
    282. Dooley K
    283. Dooley E
    284. Doricent M
    285. Dorje P
    286. Dorjee K
    287. Dupes A
    288. Elong R
    289. Falk J
    290. Farina A
    291. Faro S
    292. Ferguson D
    293. Fisher S
    294. Foley CD
    295. Franke A
    296. Friedrich D
    297. Gadbois L
    298. Gearin G
    299. Gearin CR
    300. Giannoukos G
    301. Goode T
    302. Graham J
    303. Grandbois E
    304. Grewal S
    305. Gyaltsen K
    306. Hafez N
    307. Hagos B
    308. Hall J
    309. Henson C
    310. Hollinger A
    311. Honan T
    312. Huard MD
    313. Hughes L
    314. Hurhula B
    315. Husby ME
    316. Kamat A
    317. Kanga B
    318. Kashin S
    319. Khazanovich D
    320. Kisner P
    321. Lance K
    322. Lara M
    323. Lee W
    324. Lennon N
    325. Letendre F
    326. LeVine R
    327. Lipovsky A
    328. Liu X
    329. Liu J
    330. Liu S
    331. Lokyitsang T
    332. Lokyitsang Y
    333. Lubonja R
    334. Lui A
    335. MacDonald P
    336. Magnisalis V
    337. Maru K
    338. Matthews C
    339. McCusker W
    340. McDonough S
    341. Mehta T
    342. Meldrim J
    343. Meneus L
    344. Mihai O
    345. Mihalev A
    346. Mihova T
    347. Mittelman R
    348. Mlenga V
    349. Montmayeur A
    350. Mulrain L
    351. Navidi A
    352. Naylor J
    353. Negash T
    354. Nguyen T
    355. Nguyen N
    356. Nicol R
    357. Norbu C
    358. Norbu N
    359. Novod N
    360. O'Neill B
    361. Osman S
    362. Markiewicz E
    363. Oyono OL
    364. Patti C
    365. Phunkhang P
    366. Pierre F
    367. Priest M
    368. Raghuraman S
    369. Rege F
    370. Reyes R
    371. Rise C
    372. Rogov P
    373. Ross K
    374. Ryan E
    375. Settipalli S
    376. Shea T
    377. Sherpa N
    378. Shi L
    379. Shih D
    380. Sparrow T
    381. Spaulding J
    382. Stalker J
    383. Stange-Thomann N
    384. Stavropoulos S
    385. Stone C
    386. Strader C
    387. Tesfaye S
    388. Thomson T
    389. Thoulutsang Y
    390. Thoulutsang D
    391. Topham K
    392. Topping I
    393. Tsamla T
    394. Vassiliev H
    395. Vo A
    396. Wangchuk T
    397. Wangdi T
    398. Weiand M
    399. Wilkinson J
    400. Wilson A
    401. Yadav S
    402. Young G
    403. Yu Q
    404. Zembek L
    405. Zhong D
    406. Zimmer A
    407. Zwirko Z
    408. Jaffe DB
    409. Alvarez P
    410. Brockman W
    411. Butler J
    412. Chin C
    413. Gnerre S
    414. Grabherr M
    415. Kleber M
    416. Mauceli E
    417. MacCallum I
    418. Drosophila 12 Genomes Consortium
    (2007) Evolution of genes and genomes on the Drosophila phylogeny
    Nature 450:203–218.
    https://doi.org/10.1038/nature06341
    1. Felsenstein J
    (1985)
    Phylogenies and the Comparative Method
    The American Naturalist 125:1–15.
    1. Feng S
    2. Stiller J
    3. Deng Y
    4. Armstrong J
    5. Fang Q
    6. Reeve AH
    7. Xie D
    8. Chen G
    9. Guo C
    10. Faircloth BC
    11. Petersen B
    12. Wang Z
    13. Zhou Q
    14. Diekhans M
    15. Chen W
    16. Andreu-Sánchez S
    17. Margaryan A
    18. Howard JT
    19. Parent C
    20. Pacheco G
    21. Sinding MS
    22. Puetz L
    23. Cavill E
    24. Ribeiro ÂM
    25. Eckhart L
    26. Fjeldså J
    27. Hosner PA
    28. Brumfield RT
    29. Christidis L
    30. Bertelsen MF
    31. Sicheritz-Ponten T
    32. Tietze DT
    33. Robertson BC
    34. Song G
    35. Borgia G
    36. Claramunt S
    37. Lovette IJ
    38. Cowen SJ
    39. Njoroge P
    40. Dumbacher JP
    41. Ryder OA
    42. Fuchs J
    43. Bunce M
    44. Burt DW
    45. Cracraft J
    46. Meng G
    47. Hackett SJ
    48. Ryan PG
    49. Jønsson KA
    50. Jamieson IG
    51. da Fonseca RR
    52. Braun EL
    53. Houde P
    54. Mirarab S
    55. Suh A
    56. Hansson B
    57. Ponnikas S
    58. Sigeman H
    59. Stervander M
    60. Frandsen PB
    61. van der Zwan H
    62. van der Sluis R
    63. Visser C
    64. Balakrishnan CN
    65. Clark AG
    66. Fitzpatrick JW
    67. Bowman R
    68. Chen N
    69. Cloutier A
    70. Sackton TB
    71. Edwards SV
    72. Foote DJ
    73. Shakya SB
    74. Sheldon FH
    75. Vignal A
    76. Soares AER
    77. Shapiro B
    78. González-Solís J
    79. Ferrer-Obiol J
    80. Rozas J
    81. Riutort M
    82. Tigano A
    83. Friesen V
    84. Dalén L
    85. Urrutia AO
    86. Székely T
    87. Liu Y
    88. Campana MG
    89. Corvelo A
    90. Fleischer RC
    91. Rutherford KM
    92. Gemmell NJ
    93. Dussex N
    94. Mouritsen H
    95. Thiele N
    96. Delmore K
    97. Liedvogel M
    98. Franke A
    99. Hoeppner MP
    100. Krone O
    101. Fudickar AM
    102. Milá B
    103. Ketterson ED
    104. Fidler AE
    105. Friis G
    106. Parody-Merino ÁM
    107. Battley PF
    108. Cox MP
    109. Lima NCB
    110. Prosdocimi F
    111. Parchman TL
    112. Schlinger BA
    113. Loiselle BA
    114. Blake JG
    115. Lim HC
    116. Day LB
    117. Fuxjager MJ
    118. Baldwin MW
    119. Braun MJ
    120. Wirthlin M
    121. Dikow RB
    122. Ryder TB
    123. Camenisch G
    124. Keller LF
    125. DaCosta JM
    126. Hauber ME
    127. Louder MIM
    128. Witt CC
    129. McGuire JA
    130. Mudge J
    131. Megna LC
    132. Carling MD
    133. Wang B
    134. Taylor SA
    135. Del-Rio G
    136. Aleixo A
    137. Vasconcelos ATR
    138. Mello CV
    139. Weir JT
    140. Haussler D
    141. Li Q
    142. Yang H
    143. Wang J
    144. Lei F
    145. Rahbek C
    146. Gilbert MTP
    147. Graves GR
    148. Jarvis ED
    149. Paten B
    150. Zhang G
    (2020) Dense sampling of bird diversity increases power of comparative genomics
    Nature 587:252–257.
    https://doi.org/10.1038/s41586-020-2873-9
  1. Software
    1. Green P
    (2009) Phrap
    Phrap.
  2. Software
    1. Kim BY
    (2021)
    Drosophila genome assembly paper workflows, version swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead
    Software Heritage, https://archive.softwareheritage.org/swh:1:rev:4e40d28d0bdcd1bc7e4eabb7709f301df9ad7ead.
  3. Software
    1. Li H
    (2017) Bioawk
    Bioawk.
  4. Book
    1. Seppey M
    2. Manni M
    3. Zdobnov EM
    (2019) BUSCO: Assessing Genome Assembly and Annotation Completeness
    In: Kollmar M, editors. Gene Prediction: Methods and Protocols. Springer. pp. 227–245.
    https://doi.org/10.1007/978-1-4939-9173-0_14
  5. Software
    1. Tyson J
    (2020) Bead-Free Long Fragment LSK109 Library Preparation
    Bead-Free Long Fragment LSK109 Library Preparation.

Decision letter

  1. Graham Coop
    Reviewing Editor; University of California, Davis, United States
  2. Patricia J Wittkopp
    Senior Editor; University of Michigan, United States
  3. Timothy B Sackton
    Reviewer; Harvard University, United States

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

Drosophila species have long served as an important model system for genetics and genomics. The authors have developed an important community resource of high standard genomes for many species across the Drosophila clade. This resource will serve to empower the next generation of Drosophila research and provides an important road map for similar efforts in other groups of organisms.

Decision letter after peer review:

Thank you for submitting your article "Highly contiguous assemblies of 101 drosophilid genomes" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Patricia Wittkopp as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Timothy B Sackton (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The reviewers and we appreciated the impressive community resource that has been carefully brought together and that this acts as an important road map for other clade-wide sequencing efforts.

The strong points of consensus that emerge across the reviews, and in the discussion with the reviewers, were:

1) The need for base accuracy to be incorporated into the quality metrics discussed.

2) The need to place the work into a broad context of large-scale genome efforts to expand the readership outside of Drosophila researchers and show the expanding wave of community efforts focused on high-quality reference genomes. Suggestions included the vertebrate genome project and the earth biogenome project.

3) The reviewers all noted that acknowledgement of previous work in Drosophilidae was lacking. Drosophila researches have clearly been leading this charge for a while and it would be good to place the current effort into prior work more thoroughly.

4) While annotation of these genomes is beyond the scope of the current paper, it would be good to include more of a discussion of the planned road map for annotation going forward.

These points all should be addressed in a revised version. Below we have copied each review in its entirety. The reviewers have read each other's reviews and agree that all of the points are reasonable. Please provide a point by point response to the reviews, with a focus on the above broad goals.

Reviewer #1 (Recommendations for the authors):

– Data availability. I found all (or at least most) of the raw sequencing data. I couldn't find assembly accessions.

– Guidance for future sequencing work

A back of the envelope calculation based on typical core prices for PacBio Sequel II CLR sequencing and Illumina Novaseq PE150 reads suggest a price total price of $14.50 / GB (< $2,175 / 150GB) and $9 / GB (< $6,750 / 750 GB) respectively. To sample a 150MB genome at 100-fold depth (15GB) for both platforms would seem to imply a cost of ~$350 (comparable to the price cited by the authors) or less. Accepting lower depth would lead to concomittant decreases in price / genome. The average sequencing depth for this project appears from Table S1 to be 9.8GB for long reads and 12.8GB for short reads, implying that the PacBio approach might be a bit cheaper. I don't think a detailed price analysis is necessary (or even advisable), but communicating the fact that the authors' approach is one of at least two more or less equally viable approaches would be both valuable and accurate.

– Quality control metrics: sequencing error and sample polymorphism

A description of the consensus error rate of the assemblies would be an important piece of documentation serving two purposes. It permits users to quantify the amount of error they might expect from this particular resource. Relatedly, since many of these samples are conducted on strains for which near isogenic samples are difficult to acquire, measuring and reporting the heterozygosity would help guide users as to the extent of this property of the material from which the assembly was derived, especially if such users intend to use the reference strains described here for conducting genetic work.

– Context via comparison to existing resources

Existing resources for highly contiguous assemblies (operatively, contig N50 > 1MB)

– The total number of existing assemblies is already at least 75 with N50 >= 1MB (57 from NCBI[1], 13 from Miller[2], 4 from Comeault[3], 1 from Rezvykh[4]).

– Obviously, this includes a lot of within-species samples (especially for D. melanogaster, including many assemblies of the reference strain) and many assemblies that have been subsequently improved by the authors. However, the authors are also sampling the same species multiple times (including the D. melanogaster reference strain) to their total count, so this is at least consistent with their counting.

– Importantly, this resource triples unique species possessing highly contiguous assemblies from 34 to 102 and expands species group representation from 8 to 15. Although, the quinaria group, represented by D. innubila, already existed and wasn't sampled again here, they re-sample 7 of the previous 8 species groups and add 7 more.

– As far as I can tell, until now, the most distant relative of D. melanogaster within Drosophilidae with a highly contiguous genome was Scaptodrosophila lebanonensis, which is in the tribe Colocasiomyini. The manuscript adds two additional members of this tribe, Leucophenga varia and Chymomyza costata. So, the number of distant relatives is tripled, but the actual phylogenetic breadth isn't (if my understanding of Drosophilid taxonomy is correct, and it may not be)

Scholarship

The $1,000 Nanopore genome was cited by both [2] and [5].

"Future work to improve biological and taxonomic diversity, particularly for species difficult to culture, should employ single fly sequencing and assembly workflows (Adams et al., 2020)."

An earlier long read precedent can be found in [6].

In addition to enumerating other high-quality Drosophila genomes that already exist, it would be extremely useful to users to see a comparison of the quality of the resources when they have published descriptions (in order to guide authors as to what types of contiguity, completeness, and error they can expect, especially in comparison to this work). At the least, I think the authors should put their work in the context of best previous assemblies for each species (an operative definition of contig N50 > 1 Mb seems consistent with their own thinking), particularly when that work has been published. To be consistent with their own accounting, they might even consider addressing works like [7], and species that have experienced extensive high quality sequencing effort like D. obscura, D. simulans, and D. pseudoobscura. I have attached a table with entries corresponding to every species in Table S2 as well as additional species exceeding contig N50 of 1Mb, including citations and NCBI accession numbers when they could be found.

The authors have actually cited many of these assemblies in other work [8], including genomes that have yet to be published, so it would seem that the authors are aware of them and trust the quality enough to incorporate into their own work, so improving the scholarship should be straightforward.

Typo

– In the abstract. There are 93 species represented, not 95.

Reviewer #2 (Recommendations for the authors):

While this manuscript presents a large amount of valuable new data, and is inherently important for that reason alone, I believe that some key improvements and additional analyses could greatly strengthen this manuscript and really improve the value to the community.

1) Improvements to quality metrics.

This paper reports genomes that are at the low-cost end of the cost/quality tradeoff in genome assembly. This is a extremely valuable contribution, because many other large-scale projects in genomics (most notably, the Vertebrate Genome Project) have focused on the other end of this spectrum. Yet, for many researchers, a low-cost way to produce 10 genomes from related species may be higher value than a "complete" assembly from one species. However, it remains somewhat unclear exactly how good these genomes are, beyond the observation that the gene space is largely complete, and contig N50s are generally high.

Therefore, I think the biggest and most important improvement that would increase the reach and usefulness of this manuscript is improvements to quality metrics. A recent preprint from the Vertebrate Genome Project team (Rhie et al., 2020; https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1.full) provides a number of potentially useful quality metrics that may be worth considering applying here, although of course not all will be relevant to this project, and I realize the computational burden of trying to do everything could be large. Nonetheless, I think it is crucial to be able to give some sense of consensus quality, as base-level errors in assemblies has negative effects on many downstream applications. Based on Koren et al., 2019, there appears to be a large drop in likelihood of disrupting a gene due to a indel error between QV30 and QV40, and Rhie et al., 2020 has a lot more detail on various aspects of consensus quality metrics. I realize that many existing tools, e.g. Merqury (also Rhie et al., 2020, in Genome Biology) make the assumption that Illumina data is available for the same individual as the genome assembly, which is not universally true here (even in approximation, e.g. treating a strain as an individual). Still, some attempt to tackle this problem seems necessary, even if it cannot be done perfectly.

2) Assembly content

Related to the first point, basic descriptions of genome size (e.g., estimated from k-mers) would help to contextualize the resource produced, as would a definition of "near chromosome level contiguity" and validation of which of the newly reported assemblies here reach that threshold (especially as contig N50s vary by several orders of magnitude). Again, I don't think the VGP definitions are the only possible way to approach this question, but there is value in having some systematic summary of overall contiguity.

This paper does not describe the extent to which heterogametic sex chromosomes (not expected in all species based on sampling) or mitochondrial genomes are recovered. Presumably at least the presence of mt scaffolds is picked up in the NCBI screens, and is the kind of information that would be relatively straightforward to add to a table.

3) Drosophila focus.

There is a tension throughout this manuscript between describing a basically Drosophila-specific resource, and describing a more generally applicable approach to low-cost, clade-wide assembly. I think that the latter is really necessary and important, since not every community or group has the resources (in money, samples, compute) or desire (for their scientific questions) to use a "VGP-style" approach (with long reads, short reads, HiC, and optical mapping to produce as close to error-free chromosomal assemblies as possible). But the value of this manuscript as a blueprint for low-cost community genomics is somewhat limited by the Drosophila-centric nature of the results.

The most obvious Drosophila-specific assumptions are the availability of a inbred strain, and a genome size of 100-250 Mb or so, with a few exceptions. Notably, the assemblies of the larger genomes (and the ones derived from wild-caught flies) tend to be worse, with lower contig N50s and auN metrics, more contigs, and more fragmented or missing BUSCOs.

Of course it would be well beyond the scope of this manuscript to attempt to validate any of these approaches in other clades, or provide a simple recipe for how to assembly any possible genome. Nonetheless, it would certainly be possible to broaden the discussion, and be clearer in the text when certain statements are Drosophila specific (e.g., the $350 in sequencing costs assumes a genome on the order of 100-200 Mb).

4) Limitations of the existing resource and future prospects for improvement:

The genomes presented here do not include annotations, or any other form of supplemental resource such as whole genome alignments (e.g. Armstrong et al., 2020), and don't use HiC or any kind of scaffolding to obtain true chromosomal scaffolds. I think these are understandable and defensible choices, given the computational and technical requirements to extend this work in those directions. However, it may be valuable to discuss more explicitly what this resource is and is not. At a minimum, doing so could prevent the corresponding authors from receiving many emails asking where the gene annotations for species X are once this work is published.

Reviewer #3 (Recommendations for the authors):

This is really an impressive resource and has the potential to be widely used both in terms of the data itself and also the methodology. I have several suggestions that may improve the manuscript.

There are several species (willistoni, paulistorum, etc.) that are sequenced more than once without any reference to why. It might be useful to describe that somewhere (Table 1?). As a resource, it would be simpler to use if it was clear when one isolate per species vs. multiple were appropriate to use for analyses.

Lines 59-61 – a bit more detail about modifications here since you are writing methods last.

Lines 107-111 – I'm concerned about the conclusions drawn about repeat content based upon the way the data was analyzed. A comprehensive analysis of repeat content is probably beyond the scope of this manuscript, but without de novo characterization of repeat sequences, I'm worried that satellites and TEs in more distantly related species may be missed if they are lineage restricted. I believe TRF would get at this to a point, but maybe not robustly. Other software like RepeatModeler might be better. However, my suggestion is not that these genomes are individually de novo annotated for repeats. It is just that the conclusions about relationships between contiguity or genome size and repeat content are presented with more caveats.

Figure 2 – This figure is a bit difficult to intuit, so this is another place where more detail in the main text would be useful.

Figure 3 – though the authors argue that this tree is not meant as a robust measure of phylogenetic relationships, it would be nice to put some support values on the tree.

Line 336 – I think the bioawk people would appreciate a citation (though all I can find on the internet is to cite the github page).

Line 340 – Please describe the auN statistic (what is L and what are we summing over?) in more detail. The Li github page describes nicely.

https://doi.org/10.7554/eLife.66405.sa1

Author response

Essential revisions:

The reviewers and we appreciated the impressive community resource that has been carefully brought together and that this acts as an important road map for other clade-wide sequencing efforts.

The strong points of consensus that emerge across the reviews, and in the discussion with the reviewers, were:

1) The need for base accuracy to be incorporated into the quality metrics discussed.

The revised manuscript now contains several new sections with evaluations of sample diversity and base quality. In the text, see the new “Estimates of sample diversity”, “Estimates of sequence quality”, and “Nanopore-based assemblies are highly accurate in coding regions” sections in the Results and Discussion and their accompanying sections in the Methods. The figures associated with these new sections are: Figure 2 along with Figure 2—figure supplement 1, Figure 3 along with Figure 3—figure supplements 1-2, and Supplementary Files 3-5. As with the rest of the manuscript, sample scripts and any raw data underlying the figures is provided in the GitHub repository. We will briefly summarize our new results below.

As the reviewers recognized, a big challenge in conducting quality evaluation properly was the variation in usable data for each assembly. As a brief summary, Illumina data were generated for most strains, downloaded from NCBI for other strains, and for a small number of strains we did not have Illumina reads (due to strain extinction/culling in 2020). A limited number of these genomes had reference assemblies of the same strain. This limited the usefulness of both reference-free and reference-based methods to specific assemblies, for example, we could not effectively utilize Merqury on genomes without short reads. Because of this, we chose to utilize several complementary strategies to evaluate both sample diversity and accuracy using long reads, short reads, and reference assemblies so that: (1) each genome had at least one associated measurement of sample diversity and quality and (2) a broad enough evaluation was performed with each method such that readers could infer a sense of quality even if some data were missing.

Briefly, we first performed variant calling using long reads and short reads separately, to both get a sense of the heterozygosity in each line and to get a count of non-reference variants to estimate the error rate. The number of non-reference variants divided by the number of callable sites was used as a proxy for the error rate similar to how error rates were estimated by Solares et al., (2018) and Koren et al., (2017). Reference-free estimates of sequence quality were computed with Merqury, using any short read data used for the assembly (when available). We find that overall, Nanopore-based assemblies do not seem to meet the often-cited QV40 “reference quality” threshold, but this is not unexpected as the effectiveness of short-read mapping and polishing will vary greatly across the genome.

We therefore followed up on genome-wide quality assessments with reference-based comparisons for the few genomes where we assembled the same line as the NCBI reference genomes. These comparisons allowed us to directly assess how our assemblies differed from the reference genomes in specific genomic regions (CDS, introns, repeats, etc.).

These reference-based comparisons showed our assemblies to indeed be highly accurate in coding regions, with a few important subtleties to note. At first glance, the mean consensus quality for coding sequences is concerningly low, ranging between QV20-QV30 even for D. melanogaster. However, when assessing quality in genomic windows or gene by gene, it quickly becomes clear that most coding sequences in the Nanopore assemblies are identical to the reference genomes.

After seeing this pattern, we thought to perform a manual assessment of every indel difference between our D. melanogaster genome and the reference, bringing in other long-read assemblies of the reference strain to help determine whether these large indels might be real or artifacts of our assembly. We found that nearly all (99.7%) protein-coding genes in our D. melanogaster genome were identical to the reference in coding sequences, and out of the total coding sequence differences between the two genomes, 99% of “erroneous” bases were explained by large indels that were also supported by the other long-read assemblies.

In the “Next Steps” section, we point out these issues and recommend that future work add lower-coverage high-fidelity sequencing to improve genome assemblies outside of complex coding sequences.

2) The need to place the work into a broad context of large-scale genome efforts to expand the readership outside of Drosophila researchers and show the expanding wave of community efforts focused on high-quality reference genomes. Suggestions included the vertebrate genome project and the earth biogenome project.

We have carefully re-examined our manuscript based on the reviewer comments and agree that too many relevant background details were omitted for the sake of brevity. Significant changes have been applied to the Introduction and Results and Discussion to address this shortcoming.

The Introduction is now started by citing the growing wave of large-scale genome assembly projects, to show the timeliness of the work. We also clarify why a cheap approach to high-quality genome assembly is valuable: time and cost are still big limitations to generating clade-scale genome datasets if one is not a part of a well-funded consortium. Ultra-long reads play a key role in constructing high-quality genomes at low cost. We hope this will provide readers with a better sense of the timeliness and significance of this work.

In addition, we have also revised the Next Steps section with better descriptions of forthcoming additions to this resource (annotations and whole-genome alignments). Many of the other large genome assembly projects are building (or have built) similar resources; we hope to both let the reader know these resources are forthcoming but also that we are utilizing modern tools to do so. Specifics are provided below (Point 4).

3) The reviewers all noted that acknowledgement of previous work in Drosophilidae was lacking. Drosophila researches have clearly been leading this charge for a while and it would be good to place the current effort into prior work more thoroughly.

This is an important point and we certainly did not mean to downplay the important role that Drosophilia labs have played in modern genomics. We have added both historical context and more background on the current state of Drosophila genomes to address this shortcoming. The revised Introduction emphasizes the important roles of D. melanogaster, the 12 Drosophila genomes, and the modENCODE project. The quantity and quality of drosophilid genome assemblies that were available prior to this work are more clearly discussed.

Our original plan for adding background on existing drosophilid genome assemblies was to include a new supplementary table listing the genome assemblies currently available on NCBI alongside some quality comparisons. However, a recently released (in the time since this manuscript was submitted) review of arthropod genome assemblies by Hotaling et al., (2021) provides exactly this: lists of genomes, accessions, the technology used to build the assembly, and contiguity/completeness statistics. Seventy-six drosophilid genomes for 75 species (two D. pseudoobscura subspecies) are listed in a file provided at the following link:

https://github.com/pbfrandsen/insect_genome_assemblies

Instead of essentially duplicating this table for this manuscript, we decided to point the reader to Hotaling et al. We compare the contiguity and completeness of our genomes to the other drosophilid assemblies and show that, while not fully chromosome-level, our genomes are comparable to many of the best reference genomes (Figure 1—figure supplement 1).

4) While annotation of these genomes is beyond the scope of the current paper, it would be good to include more of a discussion of the planned road map for annotation going forward.

Related to the last point from Point 2, we have revised the “Next Steps” section to clearly state what kinds of resources we plan to release in the near future.

Briefly, our plan follows the well-established workflows of the UCSC Comparative Genomics toolkits, similar to some recently published genome consortium studies (e.g. Genereux et al., 2020 and Feng et al., 2020). We have a whole-genome alignment with ~170 genomes, built with the Progressive Cactus aligner, in hand and plan to release it shortly. The alignment will greatly facilitate other kinds of analyses. Using the HALtools and PHAST software, we have prepared pipelines for generating PhastCons, PhyloP, and GERP tracks. We have also set up LiftOff (same species) and Comparative Annotation Toolkit (different species) based pipelines as a first pass at gene annotation. In the select cases where clades do not contain a well-annotated representative (e.g. Zaprionus), we plan to generate RNA-seq data for gene annotation.

While this annotation strategy was chosen because we do not have the centralization, personnel, or funding of the large genome consortia, the plan is to make these resources easily accessible for anyone wishing to utilize them or to perform their own iterative improvements such as annotation for specific taxa. Like this study, we will release these resources along with carefully documented workflows.

These points all should be addressed in a revised version. Below we have copied each review in its entirety. The reviewers have read each other's reviews and agree that all of the points are reasonable. Please provide a point by point response to the reviews, with a focus on the above broad goals.

Reviewer #1 (Recommendations for the authors):

– Data availability. I found all (or at least most) of the raw sequencing data. I couldn't find assembly accessions.

The assemblies should now be available on NCBI. The assemblies were submitted to NCBI before the initial draft of this paper was submitted. We apologize for the delay; it seems COVID has significantly delayed the genome submission process.

– Guidance for future sequencing work

A back of the envelope calculation based on typical core prices for PacBio Sequel II CLR sequencing and Illumina Novaseq PE150 reads suggest a price total price of $14.50 / GB (< $2,175 / 150GB) and $9 / GB (< $6,750 / 750 GB) respectively. To sample a 150MB genome at 100-fold depth (15GB) for both platforms would seem to imply a cost of ~$350 (comparable to the price cited by the authors) or less. Accepting lower depth would lead to concomitant decreases in price / genome. The average sequencing depth for this project appears from Table S1 to be 9.8GB for long reads and 12.8GB for short reads, implying that the PacBio approach might be a bit cheaper. I don't think a detailed price analysis is necessary (or even advisable), but communicating the fact that the authors' approach is one of at least two more or less equally viable approaches would be both valuable and accurate.

This is a great point. Although this manuscript and the accompanying methods are very specific to the Nanopore and Illumina hybrid approach, there are certainly other equally viable approaches to be considered if one were planning a similar project today. Long-read sequencing with PromethION or PacBio CLR would almost certainly result in a lower $/GB cost with comparable read lengths. Assembly quality would also benefit greatly from lower coverage of higher-accuracy long reads like PacBio HiFi.

We are hesitant to make a specific recommendation given there have been significant (one might say game-changing) improvements to Nanopore protocols, computational tools, and new flow cell types in the last year, with more promised to come soon. The best practices also depend on sample type, quality, and other factors that can vary from experiment to experiment. There are a few new developments available today that are worth mentioning. It is not uncommon to see Nanopore runs with half the data contained in reads 100kb or longer, and new kits are reported to have substantially increased (reportedly doubled) the throughput of Nanopore flow cells. Even Nanopore-only assemblies are far more accurate than before simply due to improved basecalling methods. For example, in internal testing of Nanopore-only assemblies, we have seen genome-wide consensus accuracy increase by more than 2-fold simply by updating the basecaller. Still, homopolymer indels and regions of poor short read mapping remain problematic in the final assembly.

Because of this, we think it is important (if possible) to start looking past the Nanopore+Illumina approach and at alternative strategies. A Nanopore ultra-long read and lower-coverage high-accuracy long read (e.g. HiFi) approach will be far more effective at improving accuracy outside of coding regions, and seems to be gaining traction in the literature. Combining ultra-long and high-fidelity long reads could be a similarly affordable and more future-proof alternative to Nanopore+Illumina.

We have added this discussion to the “Next Steps” section and hope it is useful to readers of the manuscript that are planning their own assembly project today.

– Quality control metrics: sequencing error and sample polymorphism

A description of the consensus error rate of the assemblies would be an important piece of documentation serving two purposes. It permits users to quantify the amount of error they might expect from this particular resource. Relatedly, since many of these samples are conducted on strains for which near isogenic samples are difficult to acquire, measuring and reporting the heterozygosity would help guide users as to the extent of this property of the material from which the assembly was derived, especially if such users intend to use the reference strains described here for conducting genetic work.

We have added measures of sample diversity and sequence quality to the manuscript; please see Points 1 and 2 above for specific details.

– Context via comparison to existing resources

Existing resources for highly contiguous assemblies (operatively, contig N50 > 1MB)

– The total number of existing assemblies is already at least 75 with N50 >= 1MB (57 from NCBI[1], 13 from Miller[2], 4 from Comeault[3], 1 from Rezvykh[4]).

– Obviously, this includes a lot of within-species samples (especially for D. melanogaster, including many assemblies of the reference strain) and many assemblies that have been subsequently improved by the authors. However, the authors are also sampling the same species multiple times (including the D. melanogaster reference strain) to their total count, so this is at least consistent with their counting.

– Importantly, this resource triples unique species possessing highly contiguous assemblies from 34 to 102 and expands species group representation from 8 to 15. Although, the quinaria group, represented by D. innubila, already existed and wasn't sampled again here, they re-sample 7 of the previous 8 species groups and add 7 more.

– As far as I can tell, until now, the most distant relative of D. melanogaster within Drosophilidae with a highly contiguous genome was Scaptodrosophila lebanonensis, which is in the tribe Colocasiomyini. The manuscript adds two additional members of this tribe, Leucophenga varia and Chymomyza costata. So, the number of distant relatives is tripled, but the actual phylogenetic breadth isn't (if my understanding of Drosophilid taxonomy is correct, and it may not be).

As described in Point 3 above, we added comparisons to genome assemblies on NCBI, using the summaries provided in the very recent review of long-read insect genomes by Hotaling et al., 2021. As there are several genome versions for several species, of different or the same strain, and because we have re-analyzed or added to data previously used by Miller et al., 2018 and Comeault et al., 2020 (the earlier versions are also not on NCBI), we opted to perform genome comparisons using only the representative genome on NCBI. This also made it easier since we directly utilize Hotaling et al.’s table for any comparisons.

Citations to other studies generating high-quality drosophilid genomes, including the references provided by the reviewer, have been added to the Introduction. We also substantially revised the “Taxon Sampling” and the “Next Steps” sections to better reflect how our genomes fit into the phylogeny relative to the ones that exist: that is, that genomes we assembled were mostly from already well-studied groups and that subfamily Steganinae (along with several other clades) remains poorly sampled.

Scholarship

The $1,000 Nanopore genome was cited by both [2] and [5].

"Future work to improve biological and taxonomic diversity, particularly for species difficult to culture, should employ single fly sequencing and assembly workflows (Adams et al., 2020)."

An earlier long read precedent can be found in [6].

References to these studies have been added in the appropriate locations.

In addition to enumerating other high-quality Drosophila genomes that already exist, it would be extremely useful to users to see a comparison of the quality of the resources when they have published descriptions (in order to guide authors as to what types of contiguity, completeness, and error they can expect, especially in comparison to this work). At the least, I think the authors should put their work in the context of best previous assemblies for each species (an operative definition of contig N50 > 1 Mb seems consistent with their own thinking), particularly when that work has been published. To be consistent with their own accounting, they might even consider addressing works like [7], and species that have experienced extensive high quality sequencing effort like D. obscura, D. simulans, and D. pseudoobscura. I have attached a table with entries corresponding to every species in Table S2 as well as additional species exceeding contig N50 of 1Mb, including citations and NCBI accession numbers when they could be found.

The authors have actually cited many of these assemblies in other work [8], including genomes that have yet to be published, so it would seem that the authors are aware of them and trust the quality enough to incorporate into their own work, so improving the scholarship should be straightforward.

As discussed in the Essential revisions, we now compare the contiguity and completeness of Nanopore assemblies to representative genomes currently available on NCBI. Additionally, we conduct reference-based quality assessments for Nanopore genomes that have an NCBI RefSeq counterpart of the same strain. This should give the reader a better sense of how our Nanopore-based assemblies stack up against the highest-quality reference genomes. Importantly, we find coding sequences in Nanopore assemblies to be nearly identical to NCBI RefSeq genomes.

Typo

– In the abstract. There are 93 species represented, not 95.

We initially counted genomes of subspecies (D. malerkotliana malerkotliana and D. malerkotliana pallens, D. pseudoananassae pseudoananassae and D. pseudoananassae nigrens) as separate species. This was incorrect and the number has been revised to 93.

Reviewer #2 (Recommendations for the authors):

While this manuscript presents a large amount of valuable new data, and is inherently important for that reason alone, I believe that some key improvements and additional analyses could greatly strengthen this manuscript and really improve the value to the community.

1) Improvements to quality metrics.

This paper reports genomes that are at the low-cost end of the cost/quality tradeoff in genome assembly. This is an extremely valuable contribution, because many other large-scale projects in genomics (most notably, the Vertebrate Genome Project) have focused on the other end of this spectrum. Yet, for many researchers, a low-cost way to produce 10 genomes from related species may be higher value than a "complete" assembly from one species. However, it remains somewhat unclear exactly how good these genomes are, beyond the observation that the gene space is largely complete, and contig N50s are generally high.

Therefore, I think the biggest and most important improvement that would increase the reach and usefulness of this manuscript is improvements to quality metrics. A recent preprint from the Vertebrate Genome Project team (Rhie et al., 2020; https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1.full) provides a number of potentially useful quality metrics that may be worth considering applying here, although of course not all will be relevant to this project, and I realize the computational burden of trying to do everything could be large. Nonetheless, I think it is crucial to be able to give some sense of consensus quality, as base-level errors in assemblies has negative effects on many downstream applications. Based on Koren et al., 2019, there appears to be a large drop in likelihood of disrupting a gene due to a indel error between QV30 and QV40, and Rhie et al., 2020 has a lot more detail on various aspects of consensus quality metrics. I realize that many existing tools, e.g. Merqury (also Rhie et al., 2020, in Genome Biology) make the assumption that Illumina data is available for the same individual as the genome assembly, which is not universally true here (even in approximation, e.g. treating a strain as an individual). Still, some attempt to tackle this problem seems necessary, even if it cannot be done perfectly.

Briefly, we now include both reference-free and reference-based (if possible) quality assessments of our genomes in the revised text. We found the latter to be highly informative given the non-uniform distribution of errors in the Drosophila genomes. Although overall sequence quality often appears to be quite far from the QV30-40 (i.e. reference quality) level, we show that most coding sequences have reference-quality accuracy and most protein-coding genes are unaffected by indel errors. Moreover, large indels, many of which could be real, have a disproportionate effect on the reference-based error rate, leading to large underestimates of true sequence quality.

2) Assembly content

Related to the first point, basic descriptions of genome size (e.g., estimated from k-mers) would help to contextualize the resource produced, as would a definition of "near chromosome level contiguity" and validation of which of the newly reported assemblies here reach that threshold (especially as contig N50s vary by several orders of magnitude). Again, I don't think the VGP definitions are the only possible way to approach this question, but there is value in having some systematic summary of overall contiguity.

We now provide genome size estimates and genome quality measures (NGx curves, NG50) in Figure 1—figure supplement 3 and Supplementary File 2. Similar to the quality assessment, we ran into issues trying to estimate genome size from k-mers. Not all samples had short reads plus samples with high diversity and/or low coverage had unusual k-mer count histograms leading to odd genome size estimates. Although we have long read data for every genome, the noisiness of Nanopore read (5-15% error rates) precludes k-mer based estimation of genome size.

To get around these issues, we estimated genome size by mapping Nanopore reads back to each assembly and looking at depth of coverage over single-copy BUSCO genes. Our reasoning here was that both noisy long reads and short reads should be mappable to the reference genome, and that single-copy BUSCO genes could safely be assumed to be reliably mappable single-copy regions in each genome. Assuming this, genome size can be roughly estimated from the depth of coverage and the summed length of all Nanopore reads.

These estimates of genome size turned out to be surprisingly consistent with the size of the assemblies. In most cases they were slightly larger than the assembly size. Of course there are caveats here, the most significant one being how we handled datasets with lots of apparent contamination (e.g., the assemblies from wild-collected Hawaiian Drosophila). Here we assumed uniform coverage over all contigs and adjusted the genome size estimate by the proportion of the assembly identified as contaminant sequence.

This paper does not describe the extent to which heterogametic sex chromosomes (not expected in all species based on sampling) or mitochondrial genomes are recovered. Presumably at least the presence of mt scaffolds is picked up in the NCBI screens, and is the kind of information that would be relatively straightforward to add to a table.

Admittedly we did not adopt a rigorous way to identify either sex chromosomes and mitochondrial genomes/contigs from the beginning, although we certainly should have the data to do so for most samples. Another complication is that high-coverage contigs were tagged for removal when running purge_haplotigs, so it is possible that many mitochondrial sequences were removed. This was roughly consistent with the NCBI contamination screen as mtDNA did not show up for many of the genomes. Unfortunately, we have not retained the NCBI contamination screen files, so the information on mtDNA contigs is not as straightforward to recover. We attempted to run the contamination screen through Docker (https://github.com/NCBI-Hackathons/ContamFilter) but there is an issue with the NCBI databases the Docker image points to and we could not get it to run.

Despite these shortsighted initial decisions, we expect this information to be eventually released. At least one person we know of is started working on identifying the sex chromosomes after we released the data. This is not a formal collaboration so we do not have a timeline for this work. We are also planning to run the VGP mitochondrial assembly pipeline (mitoVGP) with our sequences and attempt to properly assemble the mitochondrial genomes. This is not ready to go into this manuscript, but we plan to include it in a future update.

3) Drosophila focus.

There is a tension throughout this manuscript between describing a basically Drosophila-specific resource, and describing a more generally applicable approach to low-cost, clade-wide assembly. I think that the latter is really necessary and important, since not every community or group has the resources (in money, samples, compute) or desire (for their scientific questions) to use a "VGP-style" approach (with long reads, short reads, HiC, and optical mapping to produce as close to error-free chromosomal assemblies as possible). But, the value of this manuscript as a blueprint for low-cost community genomics is somewhat limited by the Drosophila-centric nature of the results.

The most obvious Drosophila-specific assumptions are the availability of a inbred strain, and a genome size of 100-250 Mb or so, with a few exceptions. Notably, the assemblies of the larger genomes (and the ones derived from wild-caught flies) tend to be worse, with lower contig N50s and auN metrics, more contigs, and more fragmented or missing BUSCOs.

Of course it would be well beyond the scope of this manuscript to attempt to validate any of these approaches in other clades, or provide a simple recipe for how to assembly any possible genome. Nonetheless, it would certainly be possible to broaden the discussion, and be clearer in the text when certain statements are Drosophila specific (e.g., the $350 in sequencing costs assumes a genome on the order of 100-200 Mb).

In the revised manuscript, we have tried to make it clearer that the $350 assembly is assuming Drosophila-specific parameters, both in terms of genome size and the amount of genomic DNA one might start the library prep process with. We have also improved the “Next Steps” to improve the discussion around sequencing species that cannot easily be reared in the lab or are only available as a few ethanol-preserved specimens. While the text remains somewhat specific to drosophilids, the discussion of the challenging samples we face and plans to address these challenges should have broader relevance to those not sequencing drosophilids.

4) Limitations of the existing resource and future prospects for improvement:

The genomes presented here do not include annotations, or any other form of supplemental resource such as whole genome alignments (e.g. Armstrong et al., 2020), and don't use HiC or any kind of scaffolding to obtain true chromosomal scaffolds. I think these are understandable and defensible choices, given the computational and technical requirements to extend this work in those directions. However, it may be valuable to discuss more explicitly what this resource is and is not. At a minimum, doing so could prevent the corresponding authors from receiving many emails asking where the gene annotations for species X are once this work is published.

This is mostly addressed in our reply to Point 4, but to add a few specific comments here: we thought this was a very important point and have addressed it in the updated manuscript. We revised the “Next Steps” section to state clearly that whole genome alignments are forthcoming, and that whole genome alignments will allow us to start the annotation process in a logistically efficient manner.

Reviewer #3 (Recommendations for the authors):

This is really an impressive resource and has the potential to be widely used both in terms of the data itself and also the methodology. I have several suggestions that may improve the manuscript.

There are several species (willistoni, paulistorum, etc.) that are sequenced more than once without any reference to why. It might be useful to describe that somewhere (Table 1?). As a resource, it would be simpler to use if it was clear when one isolate per species vs. multiple were appropriate to use for analyses.

To be completely forthcoming, we ended up performing multiple assemblies for a few species without specifically intending to do so. Two D. immigrans strains were sequenced separately due to a miscommunication; Z. tsacasi car7-4 was originally misidentified as Z. tuberculatus; and the additional D. willistoni line (14030-0811.17) was originally sequenced for a grad student project in planned in South America, but this plan was disrupted by the pandemic. Separate lines of D. teissieri, D. paulistorum, and Z. indianus were sequenced because the lines appear to be reproductively isolated (observed by D. Matute).

We have tried to address any confusion the reader might experience by denoting a preferred assembly with an asterisk in Table 1 of the revised manuscript. Clarifying text regarding preferred assemblies has also been added to “Taxon Sampling.” Preferred assemblies were selected on the basis of contiguity, completeness, and whether Illumina data from the same sample was used for the assembly.

Lines 59-61 – a bit more detail about modifications here since you are writing methods last.

Added text clarifies the protocol modifications that improve the yield of ultra-long reads while improving library prep and sequencing efficiency (hence reducing costs).

Lines 107-111 – I'm concerned about the conclusions drawn about repeat content based upon the way the data was analyzed. A comprehensive analysis of repeat content is probably beyond the scope of this manuscript, but without de novo characterization of repeat sequences, I'm worried that satellites and TEs in more distantly related species may be missed if they are lineage restricted. I believe TRF would get at this to a point, but maybe not robustly. Other software like RepeatModeler might be better. However, my suggestion is not that these genomes are individually de novo annotated for repeats. It is just that the conclusions about relationships between contiguity or genome size and repeat content are presented with more caveats.

This is a good point. As mentioned, we would like other groups with more expertise in analyzing repeat content to be able to freely work on these data as that is not the focus of this work. We have added text to the “Repeat content” section to indicate that these results are conditional on what is currently in the repeat databases and that this resource is intended to be a starting point to improve repeat databases rather than a comprehensive characterization of repeats in drosophilid genomes.

Figure 2 – This figure is a bit difficult to intuit, so this is another place where more detail in the main text would be useful.

The main text has been revised to specifically state that the graph layout method is being used to create a map where BUSCOs that are connected in the assemblies will cluster together.

Figure 3 – though the authors argue that this tree is not meant as a robust measure of phylogenetic relationships, it would be nice to put some support values on the tree.

We have added local posterior probabilities (as reported by ASTRAL) to provide a measure of support for each node.

Line 336 – I think the bioawk people would appreciate a citation (though all I can find on the internet is to cite the github page).

This is our mistake, a citation to this tool should have been included. In the revised version of the manuscript, a citation to the GitHub page is included.

Line 340 – Please describe the auN statistic (what is L and what are we summing over?) in more detail. The Li github page describes nicely.

Based on Reviewer 2’s comments, we decided to de-emphasize auN as the main summary statistic for contiguity as most readers will have an immediate intuitive understanding of N50/NG50 but not auN. N50 and NG50 are now presented in Figure 1 and Figure 1—figure supplement 3, respectively.

Nevertheless, we still think auN provides the best comparison of contiguity between assemblies and plan to utilize it in future work, for instance, we plan to update these genomes with improved Nanopore basecalling algorithms. We have thus clarified the calculation of this statistic further in the Methods and still provide auN in Supplementary File 2.

https://doi.org/10.7554/eLife.66405.sa2

Article and author information

Author details

  1. Bernard Y Kim

    Department of Biology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Jeremy R Wang, Daniel R Matute and Dmitri A Petrov
    For correspondence
    bernardkim@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5025-1292
  2. Jeremy R Wang

    Department of Genetics, University of North Carolina, Chapel Hill, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Investigation, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Bernard Y Kim, Daniel R Matute and Dmitri A Petrov
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0673-9418
  3. Danny E Miller

    Department of Pediatrics, Division of Genetic Medicine, University of Washington and Seattle Children’s Hospital, Seattle, United States
    Contribution
    Conceptualization, Data curation, Investigation, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Olga Barmina

    Department of Evolution and Ecology, University of California Davis, Davis, United States
    Contribution
    Resources, Investigation
    Competing interests
    No competing interests declared
  5. Emily Delaney

    Department of Evolution and Ecology, University of California Davis, Davis, United States
    Contribution
    Resources, Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3609-5702
  6. Ammon Thompson

    Department of Evolution and Ecology, University of California Davis, Davis, United States
    Contribution
    Data curation, Validation
    Competing interests
    No competing interests declared
  7. Aaron A Comeault

    School of Natural Sciences, Bangor University, Bangor, United Kingdom
    Contribution
    Resources, Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3954-2416
  8. David Peede

    Biology Department, University of North Carolina, Chapel Hill, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4826-0464
  9. Emmanuel RR D'Agostino

    Biology Department, University of North Carolina, Chapel Hill, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  10. Julianne Pelaez

    Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
    Contribution
    Resources, Validation, Writing - review and editing
    Competing interests
    No competing interests declared
  11. Jessica M Aguilar

    Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  12. Diler Haji

    Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  13. Teruyuki Matsunaga

    Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6433-622X
  14. Ellie E Armstrong

    Department of Biology, Stanford University, Stanford, United States
    Contribution
    Resources, Data curation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  15. Molly Zych

    Molecular and Cellular Biology Program, University of Washington, Seattle, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  16. Yoshitaka Ogawa

    Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Japan
    Contribution
    Resources, Investigation, Methodology
    Competing interests
    No competing interests declared
  17. Marina Stamenković-Radak

    Faculty of Biology, University of Belgrade, Belgrade, Serbia
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  18. Mihailo Jelić

    Faculty of Biology, University of Belgrade, Belgrade, Serbia
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1637-0933
  19. Marija Savić Veselinović

    Faculty of Biology, University of Belgrade, Belgrade, Serbia
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  20. Marija Tanasković

    University of Belgrade, Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, Belgrade, Serbia
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  21. Pavle Erić

    University of Belgrade, Institute for Biological Research "Siniša Stanković", National Institute of Republic of Serbia, Belgrade, Serbia
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0053-1982
  22. Jian-Jun Gao

    School of Ecology and Environmental Science, Yunnan University, Kunming, China
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  23. Takehiro K Katoh

    School of Ecology and Environmental Science, Yunnan University, Kunming, China
    Contribution
    Resources
    Competing interests
    No competing interests declared
  24. Masanori J Toda

    Hokkaido University Museum, Hokkaido University, Sapporo, Japan
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0158-1858
  25. Hideaki Watabe

    Biological Laboratory, Sapporo College, Hokkaido University of Education, Sapporo, Japan
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  26. Masayoshi Watada

    Graduate School of Science and Engineering, Ehime University, Matsuyama, Japan
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  27. Jeremy S Davis

    Department of Biology, University of Kentucky, Lexington, United States
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  28. Leonie C Moyle

    Department of Biology, Indiana University, Bloomington, United States
    Contribution
    Resources, Supervision, Writing - review and editing
    Competing interests
    No competing interests declared
  29. Giulia Manoli

    Neurobiology and Genetics, Theodor Boveri Institute, Biocentre, University of Würzburg, Würzburg, Germany
    Contribution
    Resources
    Competing interests
    No competing interests declared
  30. Enrico Bertolini

    Neurobiology and Genetics, Theodor Boveri Institute, Biocentre, University of Würzburg, Würzburg, Germany
    Contribution
    Resources
    Competing interests
    No competing interests declared
  31. Vladimír Košťál

    Institute of Entomology, Biology Centre, Academy of Sciences of the Czech Republic, Prague, Czech Republic
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  32. R Scott Hawley

    Department of Molecular and Integrative Physiology, University of Kansas Medical Center, Stowers Institute for Medical Research, Kansas City, United States
    Contribution
    Resources, Supervision, Writing - review and editing
    Competing interests
    No competing interests declared
  33. Aya Takahashi

    Department of Biological Sciences, Tokyo Metropolitan University, Hachioji, Japan
    Contribution
    Resources, Data curation, Supervision, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8391-7417
  34. Corbin D Jones

    Biology Department, University of North Carolina, Chapel Hill, United States
    Contribution
    Resources, Supervision, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
  35. Donald K Price

    School of Life Science, University of Nevada, Las Vegas, United States
    Contribution
    Resources, Data curation, Funding acquisition, Writing - review and editing
    Competing interests
    No competing interests declared
  36. Noah Whiteman

    Department of Integrative Biology, University of California, Berkeley, Berkeley, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1448-4678
  37. Artyom Kopp

    Department of Evolution and Ecology, University of California Davis, Davis, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Visualization, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5224-0741
  38. Daniel R Matute

    Biology Department, University of North Carolina, Chapel Hill, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Bernard Y Kim, Jeremy R Wang and Dmitri A Petrov
    For correspondence
    dmatute@email.unc.edu
    Competing interests
    No competing interests declared
  39. Dmitri A Petrov

    Department of Biology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Bernard Y Kim, Jeremy R Wang and Daniel R Matute
    For correspondence
    dpetrov@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3664-9130

Funding

National Institute of General Medical Sciences (F32GM135998)

  • Bernard Y Kim

National Institute of General Medical Sciences (R35GM118165)

  • Dmitri A Petrov

National Institute of Diabetes and Digestive and Kidney Diseases (K01DK119582)

  • Jeremy R Wang

National Science Foundation (DEB-1457707)

  • Corbin D Jones

National Institute of General Medical Sciences (R01GM121750)

  • Daniel R Matute

National Institute of General Medical Sciences (R01GM125715)

  • Daniel R Matute

Google (Google Cloud Research Credits)

  • Bernard Y Kim
  • Jeremy R Wang

National Institute of General Medical Sciences (R35GM122592)

  • Artyom Kopp

National Institute of General Medical Sciences (R35GM119816)

  • Noah Whiteman

Uehara Memorial Foundation (201931028)

  • Teruyuki Matsunaga

Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200178)

  • Marina Stamenković-Radak
  • Mihailo Jelić
  • Marija Savić Veselinović

Ministry of Education, Science and Technological Development of the Republic of Serbia (451-03-68/2020-14/200007)

  • Marija Tanasković
  • Pavle Erić

National Natural Science Foundation of China (32060112)

  • Jian-Jun Gao

Japan Society for the Promotion of Science (JP18K06383)

  • Masayoshi Watada

Horizon 2020 - Research and Innovation Framework Programme (765937-CINCHRON)

  • Giulia Manoli
  • Enrico Bertolini

Czech Science Foundation (19-13381S)

  • Vladimír Košťál

Japan Society for the Promotion of Science (JP19H03276)

  • Aya Takahashi

National Science Foundation (1345247)

  • Donald K Price

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Brandon Cooper, Antonio Serrato-Capuchina, and David Turissini for help with collections and field logistics; Sarah CR Elgin, Wilson Leung, Elena Gracheva, and Sophia Bieser for help with modENCODE fly lines; Jonathan Chang for helpful discussions about phylogenetic methods; Charlotte Helfrich-Förster for providing lab resources for G Manoli and E Bertolini; and John Tyson along with the staff at Circulomics, in particular Kelvin Liu and Michelle Kim, for many illuminating discussions about long read library prep and sequencing.

Senior Editor

  1. Patricia J Wittkopp, University of Michigan, United States

Reviewing Editor

  1. Graham Coop, University of California, Davis, United States

Reviewer

  1. Timothy B Sackton, Harvard University, United States

Version history

  1. Preprint posted: December 15, 2020 (view preprint)
  2. Received: January 11, 2021
  3. Accepted: July 16, 2021
  4. Accepted Manuscript published: July 19, 2021 (version 1)
  5. Version of Record published: August 4, 2021 (version 2)
  6. Version of Record updated: March 18, 2022 (version 3)

Copyright

© 2021, Kim et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 7,497
    Page views
  • 928
    Downloads
  • 50
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Bernard Y Kim
  2. Jeremy R Wang
  3. Danny E Miller
  4. Olga Barmina
  5. Emily Delaney
  6. Ammon Thompson
  7. Aaron A Comeault
  8. David Peede
  9. Emmanuel RR D'Agostino
  10. Julianne Pelaez
  11. Jessica M Aguilar
  12. Diler Haji
  13. Teruyuki Matsunaga
  14. Ellie E Armstrong
  15. Molly Zych
  16. Yoshitaka Ogawa
  17. Marina Stamenković-Radak
  18. Mihailo Jelić
  19. Marija Savić Veselinović
  20. Marija Tanasković
  21. Pavle Erić
  22. Jian-Jun Gao
  23. Takehiro K Katoh
  24. Masanori J Toda
  25. Hideaki Watabe
  26. Masayoshi Watada
  27. Jeremy S Davis
  28. Leonie C Moyle
  29. Giulia Manoli
  30. Enrico Bertolini
  31. Vladimír Košťál
  32. R Scott Hawley
  33. Aya Takahashi
  34. Corbin D Jones
  35. Donald K Price
  36. Noah Whiteman
  37. Artyom Kopp
  38. Daniel R Matute
  39. Dmitri A Petrov
(2021)
Highly contiguous assemblies of 101 drosophilid genomes
eLife 10:e66405.
https://doi.org/10.7554/eLife.66405

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Carolina A Martinez-Gutierrez, Josef C Uyeda, Frank O Aylward
    Research Article

    Microbial plankton play a central role in marine biogeochemical cycles, but the timing in which abundant lineages diversified into ocean environments remains unclear. Here, we reconstructed the timeline in which major clades of bacteria and archaea colonized the ocean using a high-resolution benchmarked phylogenetic tree that allows for simultaneous and direct comparison of the ages of multiple divergent lineages. Our findings show that the diversification of the most prevalent marine clades spans throughout a period of 2.2 Ga, with most clades colonizing the ocean during the last 800 million years. The oldest clades – SAR202, SAR324, Ca. Marinimicrobia, and Marine Group II – diversified around the time of the Great Oxidation Event, during which oxygen concentration increased but remained at microaerophilic levels throughout the Mid-Proterozoic, consistent with the prevalence of some clades within these groups in oxygen minimum zones today. We found the diversification of the prevalent heterotrophic marine clades SAR11, SAR116, SAR92, SAR86, and Roseobacter as well as the Marine Group I to occur near to the Neoproterozoic Oxygenation Event (0.8–0.4 Ga). The diversification of these clades is concomitant with an overall increase of oxygen and nutrients in the ocean at this time, as well as the diversification of eukaryotic algae, consistent with the previous hypothesis that the diversification of heterotrophic bacteria is linked to the emergence of large eukaryotic phytoplankton. The youngest clades correspond to the widespread phototrophic clades Prochlorococcus, Synechococcus, and Crocosphaera, whose diversification happened after the Phanerozoic Oxidation Event (0.45–0.4 Ga), in which oxygen concentrations had already reached their modern levels in the atmosphere and the ocean. Our work clarifies the timing at which abundant lineages of bacteria and archaea colonized the ocean, thereby providing key insights into the evolutionary history of lineages that comprise the majority of prokaryotic biomass in the modern ocean.