Genetics and Genomics

Chromosome-scale genome assembly of the European common cuttlefish Sepia officinalis

Simone Rencken
Georgi Tushev
David Hain
Elena Ciirdaeva
Oleg Simakov
Gilles Laurent author has email address

Max Planck Institute for Brain Research, Frankfurt am Main, Germany
Radboud University, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, Netherlands
Faculty of Biological Sciences, Goethe University, Frankfurt am Main, Germany
Department of Neuroscience and Developmental Biology, University of Vienna, Vienna, Austria

https://doi.org/10.7554/eLife.107393.2

Open access
Copyright information

Figures and data

Sepia officinalis assembly statistics and quality control.
A) Specimen of S. officinalis (credit: Stephan Junek, MPI for Brain Research). B) Overview of the genome assembly workflow. Genome size was estimated from short DNA reads (Illumina) using GenomeScope^77,78. The primary assembly was generated from long DNA reads (PacBio Sequel II) and chromatin conformation capture (Hi-C) reads (Dovetail OmniC) with hifiasm¹⁶⁷. Assembly was scaffolded with YAHS⁸² and residual small scaffolds were manually placed in chromosomes. C) Snail plot of chromosome-scale S. officinalis assembly generated using blobtools2¹⁹¹ showing scaffold statistics (e.g. number of scaffolds, median scaffold length N50), base composition and completeness measured using Benchmarking Universal Single-Copy Orthologs (BUSCO)⁸⁰ against the metazoa_odb12 database. D) Hi-C heatmap showing the 47 chromosome-scale scaffolds with few sequences remaining in unplaced scaffolds. X and y-axes show the genome position in Mbp. The heatmap was generated using juicebox⁸³, 0-7039 observed counts (balanced) are shown.

Statistics of S. officinalis assemblies from two independent datasets, assembled using a common pipeline.

Comparison of two Sepia officinalis chromosome-scale assemblies indicates chromosome number of 1n=46.
Datasets were collected from two S. officinalis animals, one as described in this study (MPIBR), the second by the Darwin Tree of Life consortium (DToL)⁷⁵. Both datasets were assembled using a common pipeline (hifiasm and YAHS). A) Hi-C contact map of the MPIBR primary assembly, scaffolded using YAHS without manual curation. Assembled 47 chromosome scaffolds are shown as blue boxes. B) Hi-C contact map of DToL primary assembly, scaffolded using YAHS without manual curation, showing 49 assembled chromosome scaffolds as blue boxes. C) Whole-genome alignment of both scaffolded assemblies using Winnowmap2⁸⁹, showing DToL on x-axis and MPIBR on the y-axis. The 4 “breakpoints” of chromosomes in either of the assemblies (3 breaks in DToL chromosomes compared to MPIBR, 1 break in MPIBR compared to DToL) are highlighted in different colors. D) Ribbon diagram showing the four breakpoints from C) compared to the chromosome-scale assembly from another cuttlefish, Acanthosepion esculentum (1n=46). The color of breakpoints are the same in panels C+D.

Syntenic comparison of three decapod species.
A) Taxonomy of selected cephalopod species showing their genome size (in gigabases, Gb) and haploid chromosome numbers. Taxonomy information was downloaded from NCBI taxonomy browser, divergence times for Coleoidea and Decapodiformes from¹⁹² and for Sepiidae from⁹⁴. B) Genome-wide syntenic relationship between chromosomes of E. scolopes⁴⁹ (top), D. pealeii⁴⁹ (middle) and S. officinalis (bottom). Colored braids connect syntenic regions across genomes, with chromosomes drawn to physical scale. Euprymna chromosomes 45 and 46 are not shown because they contain too few orthogroups. C) Detailed synteny of Sepia chromosomes 40 (magenta) and 43 (darkblue) shown, that are joined in the other species and cause the different haploid chromosome number in Sepia. Riparian plots were generated using GENESPACE v1.2.3¹⁷⁵.

Genome annotation for Sepia officinalis.
A) Annotation of repeat landscape of the S. officinalis genome, annotated using RepeatModeler⁹⁷. Full repeat landscape is shown on the left, annotated repeats (excluding unclassified or simple repeats) are shown on the right. B-C) Quality control of gene annotation and comparison to two other cuttlefish species using OMArk¹¹⁷. Results shown for Acanthosepion lycidas (GCA_963932145.1, Ensembl Genebuild), Sepia officinalis (BRAKER, this study) and Acanthosepion pharaonis¹⁹³ (BRAKER). Lophotrochozoa was used as the ancestral clade. B) Completeness assessed by the presence of genes conserved in the clade, classified as single or multiple copies (duplicated), or missing. C) Consistency assessed by the proportion of proteins placed in the correct lineage (consistent); placement in incorrect lineages randomly (inconsistent) or to specific species (contamination), or no placement in known gene families (unknown). D) Phylogenetic tree of 13 molluscan species used for analysis of gene families with Orthofinder¹²². Species are colored by clade: purple = coleoid cephalopods, blue = nautiloid (non-coleoid cephalopod), green = non-cephalopod mollusk. E) Heatmap of largest gene families (orthogroups from Orthofinder, with more than 100 genes in any species), ordered from largest gene count across all species on the left. Families with at least one gene in S. officinalis are depicted. Rows show gene counts for each species (color capped at 500 genes), columns show orthogroups and their annotation by eggNOG mapper^120,121 or InterProScan¹¹⁹, if available. Clade colors match D).

Overview of gene annotation of 13 molluscan species used for gene family analysis.

Expression of expanded gene families in tissue bulk RNA-seq data.
Bulk RNA-seq data collected from one adult S. officinalis from different brain tissues (optic lobes - yellow, basal lobes - turquoise, vertical and subvertical lobes - orange, posterior subesophageal mass - purple), retina (red) and skin (blue, from the dorsal mantle). Tissue color code is identical throughout the figure. A) Principal component analysis (PCA) of the data, showing the first 2 PCs, colored by tissue. B) Barplot showing number of differentially expressed (DE) genes (i.e. marker genes) for each tissue, calculated against all other tissues using DESeq2¹⁸⁷. C) Largest gene families (orthogroups) with differential expression in bulk RNA-seq data. Dot size shows the number of DE genes for each tissue. Families with enriched GO terms are highlighted in grey. D)+E) Dotplots of enriched gene ontology^133,134 (GO) terms for large gene families, enriched using clusterProfiler¹⁸⁹ using a hypergeometric test. Dot size shows the number of expressed genes per family with this GO term, x-axis shows the percentage of expressed genes from all genes with this GO term. Dot color shows the adjusted p-value after Benjamini-Hochberg false discovery rate (FDR) correction. CC: cellular component, MF: molecular function, BP: biological process. F) Heatmap of z-scored expression of all DE genes from the largest gene families with enriched GO terms.

HapHiC scaffolding for different numbers of expected chromosome scaffolds show 47 chromosomes as most supported.
Hi-C contact maps from HapHiC⁸¹ are shown for 46, 47, 48, 49 and 50 expected chromosome scaffolds. Assembled chromosomes are shown as blue boxes, Hi-C signal indicating a false (unsupported) merger is shown by cyan arrow, false splits are shown by black arrows. The contact maps differ from the map shown in Figure 1, which was created using YAHS and manual curation.

BUSCO completeness results.
A) Comparison of two S. officinalis chromosome-scale assemblies, which were constructed from two independent datasets (this study: MPIBR, Darwin Tree of Life project: DToL), assembled using a common pipeline (hifiasm¹⁶⁷ with PacBio HiFi and Hi-C reads). Results for the database metazoa_odb12, the zoom in shows only duplicated, fragmented and missing fractions to improve readability. The DToL assemblies have slightly higher completeness than MPIBR, due to the higher sequencing coverage used as input. In both datasets, compared to the primary assembly (“.hic”), the phased haplotypes (“.hic.hap1” and “hic.hap2”) have less duplicated but more missing genes. B) BUSCO results for the mollusca_odb12 database, showing the same trend as in A). C) Comparison of different BUSCO databases odb10 and odb12 on the manually curated assembly (“sepoff241117”). For the mollusca gene sets (top), a strong improvement in completeness was observed between odb10 and odb12, reflecting that the updated gene set is more concise and conserved across species. For the metazoa gene sets (bottom) the completeness was marginally increased for odb12 compared to odb10.

Analysis of raw data at breakpoints between S. officinalis assemblies hints at a technical cause of breakpoints.
A) Coverage of HiC and HiFi data shown for pairs of scaffolds exhibiting breakpoints. Blue shows MPIBR data, orange shows DToL data. For each breakpoint, trans HiC contacts are shown on top across the full scaffold, with terminal 200 kb windows highlighted in yellow. Both terminal windows are shown below with aligned HiFi reads (grey horizontal bars) and normalized HiFi read density. Trans HiC contacts are shown as purple dots. Right grey box: same data shown for the complete breakpoint scaffold of the other assembly, with trans HiC contacts calculated to a size-matched scaffold. B) Distribution of normalized trans HiC contact rate (pairs per Mb²) for random scaffold pairs (“background pairs”, grey) and within scaffolds (“intra scaffold”, green) for MPIBR (left) and DToL (right) data. Values for scaffolds with breakpoints are indicated in blue and orange, respectively. C) Histogram of contact rates from B) shown for random scaffold pairs and breakpoint pairs. Contact rates and empirical p-values of breakpoint pairs are indicated in blue (left, MPIBR) and orange (right, DToL). Joint p-value for three rates for DToL breakpoints is indicated in box (Wilcoxon rank-sum, one-tailed). D) Repeat analysis of 200 kb scaffold ends at breakpoints and control scaffolds (grey box). Overall repeat content (% of base pairs) and type are shown.

Syntenic relationship between S. officinalis and D. pealeii chromosomes.
Dot plot showing finer-resolution syntenic anchor hits (perfectly collinear blast hits within the same orthogroup). Genes are ordered along the chromosomes, gridlines are shown every 1000 genes. Only chromosome pairs with a minimum synteny score of 10 and at least 10 syntenic genes are shown. Synteny analysis and visualization were performed using GENESPACE v1.2.3¹⁷⁵.

Syntenic relationship between S. officinalis and E. scolopes chromosomes.
Dot plot showing finer-resolution syntenic anchor hits (perfectly collinear blast hits within the same orthogroup). Genes are ordered along the chromosomes, gridlines are shown every 1000 genes. Only chromosome pairs with a minimum synteny score of 10 and at least 10 syntenic genes are shown. E. scolopes chromosomes 45 and 46 are not shown because they contain too few orthogroups. Synteny analysis and visualization were performed using GENESPACE v1.2.3¹⁷⁵.

Syntenic comparison of four decapod species hints at a cephalopod sex chromosome.
A) Riparian plot showing synteny relationships of chromosomes from four decapod species, generated using GENESPACE¹⁷⁵ with orthogroups. Euprymna chromosomes 45 and 46 are not shown because they contain too few orthogroups. Chromosome split in S. officinalis compared to other species is shown in purple, putative sex chromosome as identified recently⁹⁶ is shown in cyan. B) Normalized coverage of sequencing data in S. officinalis chromosomes. C) Normalized coverage of short reads to female A. esculentum genome, reproduced from⁹⁶. Decrease in read coverage for chromosome 46 is visible, the putative Z sex chromosome. Read depth was calculated from Illumina gDNA reads in windows of 500,000 bp and normalized to the median coverage of chromosome 1. Box plots showing median divergence (box dividing line), interquartile range (box), and 1.5 times the interquartile range (whiskers). The putative Z chromosome is highlighted in cyan. Chromosomes with significantly reduced read coverage (orange label) were identified by a one-sided Wilcoxon rank-sum test of each chromosome’s normalized depth windows against all remaining chromosomes (Benjamini-Hochberg-corrected, at least 10% decrease in median normalized depth, * p < 0.5, ** p < 0.01, *** p < 0.001).

Gene family expansion analysis.
A) Gene family expansion analysis using CAFE5¹²⁸ with a gamma model (k=3) on all smaller gene families (less than 100 genes in any species). 30 families with the most change in different categories are shown (expanded only in S. officinalis (pink), in all coleoids (orange), in all species (yellow), in non-cephalopod mollusks (green) or overall contraction (blue)). Rows show change (expansion or contraction) of gene families in any species, columns show orthogroups and annotation, if available. Dots show significant change (p < 0.05), gene counts are shown for at any orthogroup with at least 12 genes in any species. B) Gene families with differential expression in bulk RNA-seq data. Dot size shows the number of DE genes for each tissue. D) Dotplots of enriched (GO) terms for large gene families, enriched using clusterProfiler using a hypergeometric test. Dot size shows the number of expressed genes per family with this GO term, x-axis shows percentage of expressed genes from all genes with this GO term. Dot color shows adjusted p-value after Benjamini-Hochberg false discovery rate (FDR) correction. CC: cellular component, MF: molecular function, BP: biological process. D) Heatmap of z-scored expression of all DE genes from the gene families with enriched GO terms.

Sign up for email alerts