Introduction

The Xerces Blue butterfly (Glaucopsyche xerces)1 was native to the coastal sand dunes of San Francisco in association with the common Deerwood (Acmispon glaber), which was the preferred food source for larval stage2. It was notable for its iridescent blue colouration on the dorsal (upper) wing surface, and conspicuous, variable white spots on the ventral surface3. With the growth of San Francisco and the destruction of sand dune habitats, the Xerces Blue became restricted to a few sites in what is now Golden Gate National Recreation Area. The last specimens were reportedly collected by entomologist W. Harry Lange on March 23, 19413. It is considered the first butterfly to have been driven to global extinction by human activities3.

The Xerces Blue and the closely related Silvery Blue (Glaucopsyche lygdamus) were recently proposed to be distinct species based on mtDNA data from a single Xerces Blue specimen4. However, two nuclear genes analysed (ribosomal 28S and histone H3) were invariable and genome-wide data were unavailable for the Xerces Blue, hampered by the inherent difficulties of retrieving genome-wide data from historical insect specimens5,6 and the absence of a suitable reference genome. The genus Glaucopsyche consists of 18 extant species distributed across the temperate regions of the northern hemisphere. To provide a relevant reference, we generated an annotated genome from the Palearctic Green-Underside Blue butterfly Glaucopsyche alexis7. Using DNA extracted from five Xerces Blue and seven Silvery Blue (Glaucopsyche lygdamus) historical specimens from the vicinity of San Francisco, and also from a modern Silvery Blue male from Canada, we generated whole genome resequencing data for both species and investigated their relationships and historical population genetics.

Results

Historic and modern butterfly genomes

We extracted DNA from 12 historical specimens (5 G. xerces, 7 G. lygdamus) (Table S1). One Xerces Blue sample did not yield detectable DNA in two independent extractions. For each of the successful extracts we prepared a single library which was shotgun sequenced on the HiseqX Illumina platform. We mapped 124,101,622 and 184,084,237 unique DNA reads of Xerces Blue and Silvery Blue, respectively, against the G. alexis reference genome (Table 1 and Table S2). The DNA reads exhibited typical ancient DNA features, such as short mean read length (ranging from 47.55 to 67.41 bases on average, depending on the specimen (Fig. S1)) and post-mortem deamination patterns at the 5’ and 3’ ends (Fig. S2 and S3). As listed in the original museum records, we found one Silvery Blue and two Xerces Blue females (Table S2). Inter-individual comparisons suggested no close kinship link among the studied individuals.

Mapping Statistics of the analysed historical specimens.

Mapping statistics of the 4 historical G. xerces (L003, L005, L007 and L009) and the 7 historical G. lydagmus (L002, L004, L006, L008, L011, L012 and L013) specimens mapped against the G. alexis reference genome. Average depth is displayed for the covered regions of each individual.

The historical genomes covered 49.3% (Xerces Blue) and 55.2% (Silvery Blue) of the G. alexis reference genome, largely because repetitive chromosomal regions cannot be confidently assessed with short, ancient DNA sequence reads (Fig. S4 – S7). To estimate the mappable fraction of the reference G. alexis genome, we randomly fragmented it to 50 to 70 nucleotides and mapped the generated fragments back to the complete genome. An average of 57.8% of the G. alexis genome was covered with these read lengths (Table S2). We suggest that reduced coverage from the historical specimens may be due to genomic divergence of G. xerces and G. lygdamus from the G. alexis reference (Fig. S8). The annotation of genes located in those unrecoverable regions provided a putative list of 14 nuclear genes with diverse functions obtained from BLAST, that should be further explored to understand the uniqueness of the extinct species (Table S4).

Phylogenetic relationships

Maximum likelihood phylogenetic inference using whole mitochondrial genomes showed that the Xerces Blue specimens form a monophyletic clade, as do the Silvery Blue specimens (Fig. 1A). We inferred a time-calibrated Bayesian phylogenetic tree from protein-coding genes Analysis and 12 related butterflies in Polyommatinae subfamily, revealing high support for the sister group relationship (posterior probability=1). We found the specie Shijimiaeoides (Sinia) divina inside the Glaucopsyche clade, in agreement with previous phylogenetic studies8. Because there are no known fossils to calibrate the time since divergence, we first used a molecular clock that spanned the range of rates frequently used for arthropod mitochondrial genes (1.5-2.3% divergence/Ma). Our dated analysis yielded an origin of this subgroup of Polyommatinae at 12.4 Ma (8.82-16.27 Ma 95% HPD interval) and divergence of the Xerces Blue from the Silvery Blue at 900,000 years ago (0.61-1.19 Ma 95% HPD interval, Fig. 1B). A second estimate based on larger-scale fossil-based calibrations9 fixed the origin of the subgroup to ca. 33 Ma10, inferred the subsequent divergence of the Xerces Blue and Silvery Blue to 2.40 Ma (1.95-2.73 Ma 95% HPD interval, Fig 1B). The recent speciation of Xerces and Silvery blue is not obviously due to infection with the Wolbachia, as no evidence of infection of the sampled specimens with this alpha-proteobacterium is detected in the raw read data.

Phylogenetic placement of the Xerces Blue.

A: Maximum likelihood tree from whole mitochondrial genomes of Xerces Blue, Silvery Blue and Green-Underside Blue. Node labels are bootstrap support values. B: Time-calibrated phylogeny from Bayesian inference using mitochondrial protein-coding genes of Xerces Blue and related butterflies. Node values show median age estimates from dating analysis with a molecular clock (above nodes) or from fixing the age of the root (below nodes). Bars are 95% HPD intervals for node ages. All posterior probabilities were 1, except for one node annotated in black.

Principal Component Analysis (PCA) using PCAngsd11 and nuclear genome polymorphisms for the three Glaucopsyche species supports the relationships among them; the historical specimens are equally distant to G. alexis in the first PC, explaining 52.81% of the variance (Fig. S9). The second PC separates the Xerces Blue from the Silvery Blue specimens.

Demographic history and diversity

We used the Pairwise Sequentially Markovian coalescent (PSMC) algorithm12 to evaluate the demographic histories of both butterfly species, first exploring the two specimens with highest coverage (L05 and L13) (Fig. 2). We found an increase in effective population size in both species that is roughly coincident with the interglacial Marine Isotopic Stage 7 (approximately from 240,000 to 190,000 years ago13). After this timepoint the trends differ. We estimated a continuous decrease in Xerces Blue population size in parallel to the Wisconsin Glacial Episode, which started about 75,000 years ago. However, both the modern and the historical Silvery Blue do not appear to have been negatively affected by this event (Fig. S10), suggesting different adaptive strategies to cope with cooling temperatures and/or food plant availability.

PSMC plot of one Xerces Blue (Glaucopsyche xerces) (L05) specimen and one Silvery Blue specimen (Glaucopsyche lygdamus).

The two historical samples are those with higher average coverage. Individual PSMC plots were bootstrapped 100 times each (lighter lines). One year of generation time and a mutation rate of μ=1.9×10-9 were used. The peak of the Marine Isotopic Stage 7 interglacial is marked in yellow.

Second, we generated PSMC curves from the remaining lower-coverage individuals and down-sampled data from specimen L05 to 50% and 75% of the total coverage to explore the effects of coverage on estimation of heterozygous sites. Although there was a reduction in the effective population size estimates, as expected, the temporal trajectories in lower-coverage individuals were similar to their respective, higher-coverage Xerces Blue and Silvery Blue references (Fig. S10).

We subsequently explored the heterozygosity of each individual and found that Xerces Blue had 22% less heterozygosity on average than the Silvery Blue historical samples, a difference that is statistically significant (T-test; p=0.0072) (Fig. S8, Table S2). We searched for runs of homozygosity (RoH) that can indicate the existence of inbreeding in a dwindling population. The total fraction of the genome presenting RoH, although limited, is much higher in Xerces Blue (up to 6% of the genome) than in Silvery Blue, especially in short RoH of size between 100 and 500 kb (Fig. 3 and Fig. S11), consistent with background inbreeding. The limited presence of long RoH discards consanguinity as a common scenario in Xerces Blue.

Runs of Homozygosity (RoH) in the genomes of Xerces Blue and Silvery Blue (modern and historical).

a: Percentage of the autosomal genome in RoH by size bins: very short RoH (<100 Kb), short RoH (100-500 Kb), intermediate RoH (500Kb-1Mb) and long (1-5Mb). Short RoH reflect LD patterns, intermediate size RoH describe background inbreeding due to genetic drift and long RoH appear in the case of very recent inbreeding due to consanguinity. Error bars show the standard deviation. b: Distribution of RoH in the autosomal genome of a Xerces specimen, L05 c) Distribution of RoH in the autosomal genome of a Silvery specimen L13.

We identified amino acid-changing alleles that may be suggestive of a deleterious genetic load associated with long-term low population numbers in the Xerces Blue. The average Ka/Ks ratio is higher in Xerces Blue than in Silvery Blue; the former also carries a higher fraction of nonsense and functionally high-to-moderate effect variants in homozygosity and RoHs with an increased concentration of high-to-moderate effect variants (Fig. 4), as predicted with a functional prediction toolbox, SnpEff14.

Functional effect prediction on the fixed amino acid-changing alleles observed in Xerces Blue and Silvery Blue.

a: Wide genome Ka/Ks ratio comparison. b: High-to-moderate effect variant comparison in homozygous sites. c: High-to-moderate effect variant comparison in heterozygous sites. d: Presence of high-to-moderate variants in regions of the genome in RoH. Error bars show the standard deviation.

Discussion

We have used a modern reference genome and ancient DNA genome sequence data from museum specimens to explore the relationships and historical population genetic history of an extinct butterfly, the Xerces Blue; to our knowledge, this is the first ancient genome ever generated from an extinct insect. Based upon a near-complete mtDNA genome from a Xerces Blue specimen, Grewe et al. (2021)4 proposed that the Xerces Blue and the Silvery Blue were distinct species. We confirm this finding using full mitochondrial genomes and extensive nuclear genomic data from multiple specimens. Given the lack of evidence for Wolbachia infection, the recent speciation of Xerces Blue and Silvery Blue seems unrelated to cytoplasmic incompatibility cause by this endosymbiont15,16; a detailed analysis of genomic architectures could help identify barriers to introgression between these species.

Our analyses indicate that the Xerces Blue had experienced a severe demographic decline for tens of thousands of years, likely associated with changing climatic factors. Thus, the destruction of the Xerces Blue habitat by humans was likely the final blow in the extinction process. We provide evidence for low population size in Xerces Blue, correlated with low genetic variation, a higher proportion of runs of homozygosity and increased frequency of deleterious, amino acid-changing alleles 1719. However, there was no genetic evidence of recent inbreeding.

Inbreeding genetic signals in the form of long chromosomal sections with no variation sometimes occur in critically endangered species 20,21 and in extinct species such as the last Mammoths from Wrangel Island22 or the Altai Neanderthal23. The PSMC shows a continuous low effective population size for Xerces Blue; demographic declines are also seen in some extinct species, including Wrangel Mammoths19 but not in others such as the Woolly Rhino that showed a pre-extinction demographic stability and relatively low inbreeding signals24. In many endangered species there is little concordance between genome diversity, population sizes and conservation status21; this decoupling was also observed in the genomes of the extinct passenger pigeon that despite being one of the world’s most numerous vertebrates, showed a surprisingly low genetic diversity25. Despite being notoriously abundant, insects, and in particular butterflies, are very sensitive to climate fluctuations; therefore, we suggest that insects with observations of demographic traits indicative of long-term low effective population size such as those found in Xerces Blue should be considered to be especially vulnerable to extinction events.

Our study further demonstrates the value of museum insect specimens for estimating temporal changes in genetic diversity at a population scale26. Despite being notoriously abundant, insects, and in particular butterflies, are very sensitive to climate fluctuations. We suggest that insects with genetic observations of long-term low effective population size such as those found in Xerces Blue should be considered to be especially vulnerable to extinction events. However, being the insect numbers usually very high, it is likely that their genomic signals of extinction could be different to those described in vertebrates in many cases. Therefore, this is a subject that should be further explored with genomic data from other declining insects.

Methods

Historical butterfly specimens

The Xerces Blue specimens analysed belong to the Barnes collection deposited at the Smithsonian National Museum of Natural History. Two of them were collected on April 26th, 1923. The Silvery Blue specimens were mostly collected between 1927 and 1948, in Haywood City, Santa Cruz, Oakland, San José, Fairfax and Marin County (these locations surround San Francisco Bay) (Table S1).

DNA extraction and sequencing of Xerces Blue and Silvery Blue specimens

All DNA extraction and initial library preparation steps (prior to amplification) were performed in a dedicated clean lab, physically-isolated from the laboratory used for post-PCR analyses. Strict protocols were followed to minimize the amount of human DNA in the ancient DNA laboratory, including the wearing a full body suit, sleeves, shoe covers, clean shoes, facemask, hair net and double gloving, as well as frequent bleach cleaning of benches and instruments. DNA extraction was performed from 12 abdominal samples of historical Xerces Blue and Silvery Blue, as well as a modern Silvery Blue specimen from Canada. Experimental procedures are described in detail in the Supplementary Material.

Glaucopsyche alexis genome sequencing and annotation

Glaucopsyche alexis was chosen as a congeneric reference to compare the demographic histories of both the Xerces Blue and the Silvery Blue. We generated a G. alexis reference genome from a male specimen collected in Alcalá de la Selva in Teruel (Spain). Its genome has a sequence length of 619,543,730 bp on 24 chromosomes – including the Z sex chromosome – and the mitochondrial genome. The genome sequence is biologically complete (BUSCO Lepidoptera completeness 97.1%)27. The G. alexis genome was sequenced at the Sanger Institute as part of the Darwin Tree of Life Project following the extraction, sequencing and assembly protocols developed for Lepidoptera7.

Xerces Blue and Silvery Blue mapping and variant calling

The ancient DNA reads were clipped using AdapterRemoval228, and only reads longer than 25bp were kept. Filtered reads were mapped against the G. alexis assembly with Burrows-Wheeler Aligner (BWA)29, with parameters optimised for the analysis of aDNA (Supplementary Materials). Basic mapping statistics were generated using Qualimap230 (Table S2). We used bedtools31 to assess genome coverage across the reference using windows of 1mbp for the nuclear fraction of the genome (Fig. S4 – S6), as well as depth of coverage (Fig. S7), read length (Fig. S1) and edit distance distribution (Fig. S8). Authenticity of the sequences was assessed by characterising aDNA damage patterns with pmdtools32 and MapDamage233 (Fig S2 and S3).

We used snpAD34, a program for genotype calling in ancient specimens. The mapped sequences were transformed from bam-format into snpAD-format files, priors for base composition estimated, and genotypes were called using standard settings. The VCFs were combined and concatenated with CombineVariants and GatherVcfs from GATK35 and filtered with vcftools36 to keep only sites within the mappable fraction of the genome previously obtained with minimum read depth of 2, max read depth of 30, genotype quality > 30, maximum missingness of 0.6, minor allele frequency of 5% and excluding indels and multiallelic sites.

Genotype likelihoods were obtained with ANGSD37 using the GATK model with the following parameters for all the samples: -uniqueOnly 1 -remove_bads 1 - only_proper_pairs 1 -trim 10 -C 50 -baq 1 -minInd 5 -skipTriallelic 1 -GL 2 -minMapQ 30 -doGlf 2 -doMajorMinor 1 -doMaf 2 -minMaf 0.05 -SNP_pval 1e-6.

Sex determination

The sex of the specimens was determined by differential coverage of the Lepidopteran Z chromosome (females are the heterogametic sex in the Lepidoptera and show reduced coverage on the Z chromosome) (Table S2).

Mitochondrial phylogenetic tree and divergence dating

Haploid variants were called using bcftools38 with a ploidy of 1, filtering low quality indels and variants, after which a consensus sequence was exported. We downloaded 14 complete mitochondrial genomes for Polyommatinae from NCBI (Table S3).

All mitochondrial genomes were annotated with MitoFinder39 using Shijimiaeoides divina as the reference. The 11 protein-coding genes were aligned with the codon-aware aligner MACSE40 and the ribosomal rRNAs were aligned with MAFFT l-ins-i41. We first investigated phylogenetic relationships among five G. xerces and eight G. lygdamus individuals, with G. alexis as the outgroup. We used IQ-TREE242 to select the best fitting nucleotide substitution model for each partition and merge similar partitions43, built a maximum likelihood tree and assessed support with 1000 ultrafast bootstrap replicates44.

To infer a time-calibrated phylogenetic hypothesis, we selected one individual of Xerces Blue (L003) and Silvery Blue (RVcoll10-B005) and analysed with 13 other Polyommatinae species. We used BEAST245 with the bModelTest46 package to perform phylogenetic site model averaging for each of the merged partitions. Because there is no accepted molecular clock rate for butterflies and no fossils to apply in this part of the phylogeny, we used two strategies to apply time constraints to the analysis. First, we used two published molecular clock rates for the mitochondrial COX1 gene (1.5% divergence/Ma estimated for various invertebrates47, and the ‘standard’ insect mitochondrial clock 2.3% divergence/Ma48). We applied a strict clock with a normal prior set up to span 1.5-2.3% with the 95% HPD interval (mean=1.9%, sigma=0.00119). Second, we borrowed the age of the most recent common ancestor of our sampled taxa from fossil-calibrated analyses across butterflies10,49. We fixed the root age to 33 Ma and allowed the remaining node ages to be estimated using a strict clock. Analyses were run twice from different starting seeds for 10 million MCMC generations and trees were sampled every 1000 generations. Runs were checked for convergence with Tracer and all effective sample size (ESS) values were >200. Runs were combined with the BEAST2 package LogCombiner50, after removing the first 10% of topologies as burn-in, and a maximum credibility tree was generated with TreeAnnotator50. Phylogenetic analyses were performed on the National Life Science Supercomputing Center - Computerome 2.0 (www.computerome.dk).

Xerces Blue and Silvery Blue population histories

We used the Pairwise Sequentially Markovian Coalescent (PSMC) model12 to explore the demographic history of both butterfly species. We obtained a consensus fastq sequence of the mappable fraction of the genome for each autosomal chromosome (total of 22 chromosomes of G. alexis assembly). Only positions with a depth of coverage above 4X and below 15X were kept. Posteriorly a PSMC was built using the following parameters: -N25 -t15 -r5 -p “28*2+3+5”. We used 1 year for the generation time and a mutation rate of 1.9×10-9, estimated in Heliconius melpomene51. Considering that calling consensus sequences from low coverage samples (<10x) can underestimate heterozygous sites52, and given the different coverage between samples, we corrected by False Negative Rate the samples with coverage lower than the coverage of L005 (for Xerces Blue) and L013 (for Silvery Blue), as recommended by the developers of the software, so that all samples are comparable with each other. However, since in our dataset we do not reach a coverage >20x, we acknowledge that we are not capturing the whole diversity and thus our PSMC might infer lower historical effective population sizes.

Population stratification and average genome heterozygosity

Principal Component Analysis (PCA) was performed using PCAngsd11 after obtaining genotype likelihoods with ANGSD including all individuals. To assess global levels of heterozygosity, the unfolded SFS was calculated for each sample separately using ANGSD37 and realSFS with the following quality filter parameters: -uniqueOnly 1 - remove_bads 1 -only_proper_pairs 1 -trim 10 -C 50 -baq 1 -minMapQ 30 -minQ 30 - setMaxDepth 200 - doCounts 1 -GL 2 -doSaf 1.

Runs of Homozygosity (RoH)

RoH were called based on the density of heterozygous sites in the genome using the implemented Hidden Markov Model (HMM) in bcftools38 roh with the following parameters: -G30 --skip-indels --AF-dflt 0.4 --rec-rate 1e-9 from the mappable fraction of the genome with the filtered VCF file. We kept the RoH with a phred score > 85. We divided the RoH into different size bins: very short RoH (<100 kb), short RoH (100-500 kb), intermediate RoH (500 kb-1Mb) and long (1-5 Mb or > 5Mb). Short RoH reflect LD patterns, intermediate size RoH describe background inbreeding due to genetic drift and long RoH appear in the case of recent inbreeding53.

Deleterious load

We used the G. alexis annotations to create a SNPeff database that we used to annotate our callings. Using SNPeff14 again and the set of variants discovered by angsd, we predicted the putative effect of those variant in the analysed individuals (Table S2). In addition to wide genome mutations, we specifically focused on mutations present in homozygosis, heterozygosis and the previously annotated RoH.

Unrecoverable regions

To further explore how the genomic divergence can influence our genome reconstruction success, we undertook a similar approach as the genome of the Christmas Island rat54, and explored the chromosomal regions in the G. alexis reference that were significantly depleted of Xerces DNA reads. We used bedtools31 and some in-home bash scripting to calculate the mean coverage per gene of the G. alexis genome for Xerces Blue sequencing DNA reads. We first used bedtools’ algorithms bamtobed and genomecov to estimate the genome-wide per-site coverage of the reference genome in these two species. Then, we extracted the coordinates of all protein coding genes from the annotation file (gff file) and used the intersect to estimate the average coverage of each protein coding gene. We performed a functional analysis of all genes uncovered in G. alexis, excluding those that are present in G. lygdamus with more than 5x coverage (as we were looking for evolutionary novelties in the Xerces Blue lineage alone) using profile-IntersProScan55 and sequence similarity-based (blasp) searches56.

Wolbachia screening

Wolbachia are endosymbiotic alpha-proteobacteria that are present in about 70% of butterfly species and induce diverse reproductive alterations, including genetic barriers when two different strains infect the same population or when two populations – one infected and one uninfected – meet15. As potential evidence for a reproductive barrier promoting the separation of Xerces Blue and Silvery Blue, we searched for Wolbachia DNA reads in our specimens, taking advantage of the high coverage and the shotgun approach. First, we collapsed unique reads from the butterfly-free sequences with BBmap57 and removed from the dataset low complexity sequences using Prinseq58. Afterwards, we used Kraken259 to assign reads against the standard plus human Kraken2 database (bacteria, archaea, fungi, protozoa and viral). The historical specimens did not display enough reads assigned to Wolbachia for us to suspect of the presence of the bacteria in those samples (Table S6).

Acknowledgements

C.L.-F. is supported by a PID2021-124590NB-100 grant (MCIU/AEI/FEDER, UE) of Spain; T.M.-B. is supported by funding from the European Research Council (ERC) (grant agreement No. 864203), BFU2017-86471-P (MINECO/FEDER, UE), “Unidad de Excelencia María de Maeztu”, funded by the AEI (CEX2018-000792-M), Howard Hughes International Early Career, and Generalitat de Catalunya, GRC 2017-SGR-880; R.V. is supported by grant PID2019-107078GB-I00, funded by MCIN/AEI/ 10.13039/501100011033, and by GRC 2017-SGR-991 (Generalitat de Catalunya). We are grateful to the SCIENCE Faculty at University of Copenhagen for free access to Computerome 2.0. This research was funded in whole, or in part, by the Wellcome Trust Grants 206194 and 218328 (MU, MB). For the purpose of Open Access, the author has applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Competing Interest Statement

The authors declare no competing interest.

Author Contributions

C.L.-F., R.V., R.R. and P.R. conceived the project. R.R. studied and sampled the specimens. L.L. and E.L. performed experimental work. T.d-D., C.F., J.S., A.S.G., M.U.-S., C.W., S.C. and P.R. undertook different computational analyses. C.L.-F., T.M.-B., A.N. and M.B. coordinated different computational teams. J.S. worked in visualization. C.L.-F., J.S. and R.R. wrote the manuscript with input from all coauthors.

Data Accessibility

The genetic data generated is publicly available; the accession numbers for the Xerces Blue and Silvery Blue genomes reported in this study are in the European Nucleotide Archive (ENA): PRJEB47122. Data on G. alexis are available in INSDC under BioProject PRJEB43798 and genome assembly accessions GCA_905404095.1 (primary haplotype) and GCA_905404225.1 (secondary, alternate haplotype).