1. Genetics and Genomics
Download icon

The genetic basis for ecological adaptation of the Atlantic herring revealed by genome sequencing

  1. Alvaro Martinez Barrio
  2. Sangeet Lamichhaney
  3. Guangyi Fan
  4. Nima Rafati
  5. Mats Pettersson
  6. He Zhang
  7. Jacques Dainat
  8. Diana Ekman
  9. Marc Höppner
  10. Patric Jern
  11. Marcel Martin
  12. Björn Nystedt
  13. Xin Liu
  14. Wenbin Chen
  15. Xinming Liang
  16. Chengcheng Shi
  17. Yuanyuan Fu
  18. Kailong Ma
  19. Xiao Zhan
  20. Chungang Feng
  21. Ulla Gustafson
  22. Carl-Johan Rubin
  23. Markus Sällman Almén
  24. Martina Blass
  25. Michele Casini
  26. Arild Folkvord
  27. Linda Laikre
  28. Nils Ryman
  29. Simon Ming-Yuen Lee
  30. Xun Xu
  31. Leif Andersson  Is a corresponding author
  1. Uppsala University, Sweden
  2. University of Macau, China
  3. BGI-Shenzhen, China
  4. Qingdao University, China
  5. Stockholm University, Sweden
  6. Southeast University, China
  7. Swedish University of Agricultural Sciences, Sweden
  8. University of Bergen, Norway
  9. Hjort Center of Marine Ecosystem Dynamics, Norway
  10. Institute of Marine Research, Norway
  11. Texas A&M University, United States
Research Article
  • Cited 30
  • Views 4,564
  • Annotations
Cite this article as: eLife 2016;5:e12081 doi: 10.7554/eLife.12081

Abstract

Ecological adaptation is of major relevance to speciation and sustainable population management, but the underlying genetic factors are typically hard to study in natural populations due to genetic differentiation caused by natural selection being confounded with genetic drift in subdivided populations. Here, we use whole genome population sequencing of Atlantic and Baltic herring to reveal the underlying genetic architecture at an unprecedented detailed resolution for both adaptation to a new niche environment and timing of reproduction. We identify almost 500 independent loci associated with a recent niche expansion from marine (Atlantic Ocean) to brackish waters (Baltic Sea), and more than 100 independent loci showing genetic differentiation between spring- and autumn-spawning populations irrespective of geographic origin. Our results show that both coding and non-coding changes contribute to adaptation. Haplotype blocks, often spanning multiple genes and maintained by selection, are associated with genetic differentiation.

https://doi.org/10.7554/eLife.12081.001

eLife digest

The Atlantic herring is one of the most common fish in the world and has been a crucial food resource in northern Europe. One school of herring may comprise billions of fish, but previous studies had only revealed very few genetic differences in herring from different geographic regions. This was unexpected since Atlantic herring is one of the few marine species that can reproduce throughout the brackish Baltic Sea, which can be about a tenth as salty as the Atlantic Ocean.

This unexpected finding could be explained in at least two different ways. Firstly, perhaps Atlantic herring are flexible enough to adapt to very different environments (i.e. high or low salinity) without much genetic change. Secondly, the previous studies only looked at a handful of sites in the Atlantic herring’s genome and so it is possible that genetic differences at other genes control this fish’s adaptation instead.

Now, Martinez Barrio, Lamichhaney, Fan, Rafati et al. have sequenced entire genomes from groups of Atlantic herring and revealed hundreds of sites that are associated with adaptation to the Baltic Sea. The analysis also identified a number of genes that control when these fish reproduce by comparing herring that spawn in the autumn with those that spawn in spring. This is important because natural populations must carefully time when they reproduce to maximize the survival of their young.

These new findings provide compelling evidence that changes in protein-coding genes and stretches of DNA that regulate the expression of other genes both contribute to adaptation in herrings. The analysis also clearly shows that variants of genes that contribute to adaptation were likely to evolve over time by accumulating multiple sequence changes affecting the same gene. Furthermore, these gene variants essentially form a rich “tool-box” that underlies the Atlantic herring’s adaptation to its environment, and different subpopulations of herring were found to have their own optimal sets of gene variants. For instance, autumn-spawning herring and spring-spawning herring from the Baltic Sea both have gene variants that favor adaptation to low salinity. However, autumn-spawning Baltic herring also share gene variants that favor spawning in the autumn with autumn-spawning herring from the North Sea, but not with spring-spawning Baltic herring.

The next step will be to study how the 500 or so genes identified affect adaptation at the molecular level. This will likely involve experiments with other model fish such as zebrafish and sticklebacks. Finally, these new findings can be directly applied to monitor stocks of herring to make herring fisheries more sustainable.

https://doi.org/10.7554/eLife.12081.002

Main text

The Atlantic herring (Clupea harengus) is a pelagic fish that occurs in huge schools, up to billions of individuals. The herring fishery has been crucial for food security and economic development in Northern Europe and currently ranks among the five largest fisheries in the world with nearly 2 million tons fish landed annually (FAO, 2014). The herring is one of few marine fishes that reproduce throughout the Baltic Sea where the salinity drops to 2–3‰ in the Bothnian Bay, compared with 35‰ in the Atlantic Ocean (Figure 1A). This ecological adaptation must be recent because the brackish Baltic Sea has only existed for 10,000 years following the last glaciation (Andrén et al., 2011). Fishery biologists have for more than a century recognized stocks of herring defined by spawning location, spawning time, morphological characters and life history parameters (Iles and Sinclair, 1982; McQuinn, 1997). Several decades of genetic studies based on limited numbers of genetic markers (allozymes, microsatellites or SNPs) have not been able to verify this divergence; extremely low levels of differentiation even between geographically distant populations as well as between spring- and autumn-spawning herring have been observed (Andersson et al., 1981; Ryman et al., 1984; Larsson et al., 2007; 2010, Limborg et al., 2012). It has been proposed that lack of precision in homing behaviour of herring causes sufficient gene flow between stocks to counteract genetic differentiation (McQuinn, 1997). However, in a recent study we constructed an exome assembly and used this in combination with whole genome sequencing of eight population samples and found more than 400,000 SNPs (Lamichhaney et al., 2012). We confirmed lack of differentiation at most loci, whereas a small percentage showed highly significant differentiation. Simulations demonstrated that the distribution of fixation index (FST)-values among herring populations deviated significantly from expectation for selectively neutral loci.

Figure 1 with 1 supplement see all
Demographic history and phylogeny.

(A) Geographic location of samples. The salinity of the surface water in different areas is indicated schematically. Autumn spawners are marked with an asterisk. (B) Demographic history. Black circles indicate effective population size over time estimated by diCal (Sheehan et al., 2013); estimates are averages from four arbitrarily chosen genomic regions. The grey field is confidence interval ( ± 2 sd), while light grey lines show the underlying estimates from each genomic region. (C) Neighbor-joining phylogenetic tree. The evolutionary distance between Atlantic and Pacific herring was calculated using mtDNA cytochrome B sequences; right panel, zoom-in on the cluster of Atlantic and Baltic herring populations. Colour codes for sampling locations are the same as in Figure 1A. (D) Global distribution of FST –values based on 19 populations of Atlantic and Baltic herring. The inset illustrates the tail of the distribution. The mean and median of this distribution are indicated. To reduce the FST sampling variance, we only used SNPs with ≥30x coverage in each population.

https://doi.org/10.7554/eLife.12081.003

Genetic studies of ecological adaptation in natural populations is challenging because genetic differentiation caused by natural selection is often confounded with genetic differences due to genetic drift caused by restricted effective population sizes. An ideal species for studying the genetic basis of ecological adaptation should comprise subpopulations of infinite size and exposed to different ecological conditions. In such a species there is minute genetic drift and genetic differentiation is caused by selection resulting in local adaptation. The herring is close to being such an ideal subject for studies of ecological adaptation due to the extremely low levels of genetic differentiation at most loci as documented in previous studies (Andersson et al., 1981; Ryman et al., 1984; Larsson et al., 2007; 2010; Limborg et al., 2012; Lamichhaney et al., 2012). This unique opportunity together with herring being such a valuable natural resource prompted us to generate a genome assembly and perform genome sequencing of populations adapted to different ecological conditions.

Here we present a high-quality genome assembly for the Atlantic herring, and results of whole genome sequencing of 20 population samples using pooled DNA. The results were verified by individual genotyping using a custom-made 70k SNP array. Our study addresses two fundamentally different types of adaptations; one example of niche expansion (adaptation to low salinity), and one example of sympatric balancing selection (variation in the timing of reproduction). The results provide a comprehensive list of hundreds of independent loci underlying ecological adaptation and shed light on the relative importance of coding and non-coding variation. The results have important implications for sustainable fishery management, and provide a road map for cost effective high-resolution characterization of genetic diversity in natural populations.

Results

Genome assembly and annotation

Clupeiformes represents an early diverging clade of the otomorpha (Near et al., 2012) (Figure 2A). The genome size for herring has been estimated at ~850 Mb (Hinegardner and Rosen, 1972; Ida et al., 1991; Ohno et al., 1969) with no recent whole genome duplications reported. We performed whole genome assembly based on short read sequencing of libraries ranging from 170 bp to 20 kb insert sizes (Supplementary file 1A). The 808 Mb assembly had a scaffold N50 of 1.84 Mb with 23,336 predicted coding gene models. It showed a high degree of completeness based on RNAseq alignments, core gene analyses and comparisons to other fish gene sets (Table 1, Supplementary files 2, 3A–D, Figure 2B, Figure 2—figure supplements 12). The GC content was 44%, and repetitive elements made up 31% of the assembly (Table 1). Alignments of synthetic long reads (SLRs; Illumina) failed to significantly improve the assembly due to coincidental gaps between the assembly and the SLRs, but proved useful in phasing parental alleles (Materials and methods; Figure 2—figure supplements 34) and dramatically improved the discovery of indels larger than 30 bp compared to short Illumina reads (Supplementary file 1F). We identified 150 endogenous retroviruses (ERVs) constituting ~0.14% of the genomic sequence but none included open reading frames in all gag, pol and env genes (Supplementary file 1, Figure 2—figure supplement 5).

Figure 2 with 5 supplements see all
Genome assembly and annotation.

(A) Phylogeny of ray-finned fishes (Actinopterygii) from the Devonian to the present, time-calibrated to the geological time scale based on Near et al. (2012). Geological abbreviations: C (Carboniferous), CZ (Cenozoic), D (Devonian), J (Jurassic), K (Cretaceous), Ng (Neogene), P (Permian), Pg (paleogene) and Tr (Triassic). Dating of the specific rounds of whole genome duplication is based on Glasauer and Neuhauss (2014). Abbreviations: Ts3R (teleost-specific third round) and Ss4R (salmonid-specific fourth round) of duplication. The number of species with a genome assembly available is marked within parentheses after their group’s name. Atlantic herring belongs to Clupeiformes, the order indicated in red letters. (B) Orthologous gene families across four fish genomes (C. harengus, D. rerio, L. chalumnae and G. morhua).

https://doi.org/10.7554/eLife.12081.005
Table 1

Summary of the herring assembly compared to other sequenced fish genomes.

https://doi.org/10.7554/eLife.12081.011
SpeciesHerring (Clupea
harengus)
Zebrafish
(Danio
rerio)
Cod (Gadus
morhua)
Coelacanth
(Latimeria
chalumnae)
Stickleback (Gasteosteus
aculeatus)
Estimated genome size (Mb)8501,454a830b3,530c530d
Assembly size (Mb)8081,412753b2,861e463f
Contig N50 (kb)21.325.02.812.783.2
Scaffold N50 (Mb)1.841.550.690.9210.8
Sequencing technologygIS+IR+IIS
Repeat content30.952.225.427.725.2
%GC content44.136.745.443.044.6
Heterozygosity1/309n.a.1/5001/4351/700
Protein-coding gene count23,33626,45922,15419,03320,787
  1. cGenome size calculated as pg x 0.978 × 109 bp/pg; picogram values taken from Cimino and Bahr (1974)

  2. gI=Illumina sequencing; S=Sanger sequencing; R=Roche 454 n.a.=not available

Population genetics and demographic history

Whole genome pooled sequencing was done using 20 population samples of herring from the Baltic Sea, Skagerrak, Kattegat, North Sea, Atlantic Ocean and Pacific Ocean (Figure 1A; Table 2); the latter sample represents the closely related Pacific herring (Clupea pallasii). Each pool comprised 47–100 fish and was sequenced to ~30x coverage. Furthermore, 16 fish, eight Baltic and eight Atlantic herring (Table 2), were sequenced individually to ~10x coverage. All data were aligned to the reference assembly and SNPs were called after rigorous quality filtering. We found 8.83 million SNPs when Pacific herring was included and 6.04 million among Atlantic and Baltic herring.

Table 2

Samples of herring used for whole genome resequencing.

https://doi.org/10.7554/eLife.12081.012
LocalityaSamplenPositionSalinity (‰)Date
(yy/mm/dd)
Spawning
season
Baltic Sea
Gulf of Bothnia (Kalix)bBK47N 65°52’E 22°43’3800629spring
Bothnian Sea (Hudiksvall)BU100N 61°45’E 17°30’6120419spring
Bothnian Sea (Gävle)BÄV100N 60°43’E 17°18’6120507spring
Bothnian Sea (Gävle)BÄS100N 60°43’E 17°18’6120718summer
Bothnian Sea (Gävle)BÄH100N 60°44’E 17°35’6120904autumn
Bothnian Sea (Hästskär)cBH50N 60°35’E 17°48’6130522spring
Central Baltic Sea (Vaxholm)bBV50N 59°26’E 18°18’6790827spring
Central Baltic Sea (Gamleby)bBG49N 57°50’E 16°27’7790820spring
Central Baltic Sea (Kalmar)BR100N 57°39’E 17°07’7120509spring
Central Baltic Sea (Karlskrona)BA100N 56°10’E 15°33’7120530spring
Central Baltic SeaBC100N 55°24’E 15°51’8111018unknown
Southern Baltic Sea (Fehmarn)bBF50N 54°50’E 11°30’12790923autumn
Kattegat, Skagerrak, North Sea, Atlantic Ocean
Kattegat (Träslövsläge)bKT50N 57°03’E 12°11’20781023unknown
Kattegat (Björköfjorden)KB100N 57°43’E 11°42’23120312spring
Skagerrak (Brofjorden)SB100N 58°19’E 11°21’25120320spring
Skagerrak (Hamburgsund)bSH49N 58°30’E 11°13’25790319spring
North SeabNS49N 58°06’E 06°10’35790805autumn
Atlantic Ocean (Bergen)bAB149N 64°52’E 10°15’35800207spring
Atlantic Ocean (Bergen)cAB28N 60°35’E 05°00’33130522spring
Atlantic Ocean (Höfn)AI100N 65°49’W 12°58’35110915spring
Pacific Ocean
Strait of Georgia (Vancouver)PH50--35121124-
  1. aPlaces where the sample was landed (if known) are given in parenthesis

  2. bSamples from previous study (Lamichhaney et al., 2012)

  3. cEight Baltic herring from the BH sample and eight Atlantic herring from the AB2 sample were used for individual sequencing n=number of fish

Average nucleotide diversity was estimated by counting the frequency of heterozygous sites in the reference individual after stringent filtering for sequence quality and coverage (within one standard deviation of mean coverage). The estimate was one heterozygous site per 309 bp, giving a nucleotide diversity of 0.32%; no estimate based on the 16 herring sequenced individually deviated significantly from this value and there was no significant difference between Atlantic and Baltic herring. The average decay of linkage disequilibrium between loci was very steep, with average r2 falling to 0.1 at a distance of 100 base pairs (Figure 1—figure supplement 1A).

The allele frequency distribution deviated significantly from the one expected for selectively neutral alleles at genetic equilibrium (p<2x10-16, Kolmogorov-Smirnov test), due to an excess of rare alleles (Figure 1—figure supplement 1B) consistent with population expansion. The result is supported by the genome-wide distribution of Tajima’s D, which shows a global shift towards negative values (mean=−0.57 ± 0.01; Figure 1—figure supplement 1C). A demographic analysis using the diCal software (Sheehan et al., 2013) confirmed that herring have experienced an expansion in effective population size, roughly five- to ten-fold, and that the current Ne is on the order of 106 individuals (Figure 1B); the results for Baltic and Atlantic herring were essentially identical. The result indicates that the effective population size minimum occurred at around one to two MYA, after the onset of the Quaternary ice age.

Phylogeny

The neighbor-joining phylogenetic tree including Atlantic, Baltic and Pacific herring shows a large phylogenetic distance between Pacific and Atlantic herring, as compared with the tiny genetic divergence among samples of Atlantic and Baltic herring (Figure 1C). We estimated the split between Atlantic and Pacific herring to ~2.2 million years ago based on mtDNA cytochrome B sequence divergence. The phylogenetic tree is consistent with minute differentiation at selectively neutral loci in Atlantic herring (Ryman et al., 1984; Lamichhaney et al., 2012); all subpopulations in the Eastern North Atlantic may have expanded from a common ancestral population after the last glaciation as indicated by demographic analysis (Figure 1B).

A closer examination of the tight cluster of Atlantic and Baltic herring populations reveals some structure consistent with geographic origin (Figure 1C). Samples from the Baltic Sea cluster on one half while samples from marine waters cluster on the other half of the tree. Only three populations are located at intermediate positions. Two of these are autumn-spawners from the Baltic Sea (BÄH and BF), indicating that autumn-spawning herring are genetically distinct from spring- and summer-spawning herring. The third sample (KT) at an intermediate position was sampled outside the spawning season and at the border between Kattegat and Baltic Sea and may represent a mixed sample of local Kattegat population and fish that spawn in the Baltic Sea but migrate into Kattegat for feeding.

Genetic adaptation to a new niche environment

The Atlantic (Clupea harengus harengus) and Baltic herring (Clupea harengus membras) were classified as subspecies by Linnaeus (1761) in the 18th century. They are adapted to strikingly different environments, in particular regarding salinity that ranges from 2–3‰ in the Gulf of Bothnia to 12‰ in Southern Baltic Sea, whereas salinity in Kattegat, Skagerrak, North Sea and Atlantic Ocean is in the range 20‰–35‰ (Figure 1A; Table 2). To reveal loci underlying genetic adaptation associated with the recent niche expansion into brackish waters after the last glaciation we compared allele frequencies, SNP by SNP, in two superpools: one Atlantic including all populations from Atlantic Ocean, Skagerrak and Kattegat and a pool comprising all samples collected in Baltic Sea; this is justified by low differentiation at neutral loci as documented by the low FST-values when comparing all samples of Atlantic and Baltic herring (Figure 1D). Samples of autumn-spawning herring, a possible confounding factor, were excluded from the analysis. We used a stringent significance threshold of p<1x10-10 (Bonferroni correction, p=8.2x10-9).

We identified 46,045 SNPs that showed an allele frequency difference with p<1x10-10 in the χ2 test (Figure 3A; Supplementary file 3A). An important question is how many independent loci these represent. A conservative estimate of 472 independent loci was obtained (i) by only using SNPs with p<1x10-20, (ii) by taking into account gaps in the assembly and (iii) by using the Comb-P software (Pedersen et al., 2012) to combine strongly correlated SNPs from the same genomic region (see Materials and methods). Figure 3A (lower panel) illustrates one of the most striking associations. For a large part of scaffold 218 there are no significant differences among Atlantic and Baltic samples whereas there are striking allele frequency differences over a 119.4 kb region; this is a characteristic pattern for differentiated regions, indicating that genetic adaptation typically occur as large haplotype blocks, often including multiple genes. A phylogenetic tree based on SNPs showing genetic differentiation between Atlantic and Baltic (Figure 3B) differs profoundly from the tree based on all SNPs (Figure 1C). With the exception of the two autumn-spawning populations BF and BÄH from the Baltic Sea, the position of all other populations match the variation in salinity perfectly with the population samples from the North Sea and Atlantic Ocean (35‰) at one end of the tree and samples from the brackish Baltic Sea (3‰–12‰) at the other end and with samples from Skagerrak (25‰) and Kattegat (20‰) at intermediate positions. The low genetic differentiation among Baltic samples, excluding the two autumn-spawning populations BF and BÄH, suggests that adaptation to brackish waters is a derived state.

Figure 3 with 2 supplements see all
Genetic differentiation between Atlantic and Baltic herring.

(A) Manhattan plot of significance values testing for allele frequency differences between pools of herring from marine waters (Kattegat, Skagerrak, Atlantic Ocean) versus the brackish Baltic Sea. Lower panel, corresponding plot for scaffold 218 only; both P- and FST-values are shown. (B) Neighbor-joining phylogenetic tree based on all SNPs showing genetic differentiation in this comparison (p<10-10). (C) Comparison of allele frequencies in five strongly differentiated regions. The major allele in the AB1 sample (Atlantic Ocean) was used as reference at each SNP. Lower panel, neighbor-joining tree based on haplotypes formed by 128 differentiated SNPs from scaffold 218. (D) Heat map showing copy number variation partially overlapping the HCE gene. Orientation of transcription is marked with an arrow; the position of SNPs significant in the χ2 test is indicated by stars. Population samples and salinity at sampling locations are indicated to the right; abbreviations are explained in Table 2. (E) Strong genetic differentiation between Atlantic and Baltic herring in a region downstream of SLC12A3; statistical significance based on the χ2 test is indicated.

https://doi.org/10.7554/eLife.12081.013

Figure 3C (upper panel) shows estimated allele frequencies for highly differentiated SNPs from five genomic regions in six population samples, each region showing an underlying genetic architecture with large and distinctly defined haplotype blocks. The Atlantic Ocean and North Sea samples are both nearly fixed for the reference allele at these SNPs. In contrast, the samples of Baltic herring were close to fixation for the alternate alleles. Interestingly, the sample (SB) collected in Skagerrak (salinity ~25‰) is most similar to the Atlantic Ocean and North Sea samples, but consistently shows a trend towards more intermediate allele frequencies at these loci.

We developed a 70k custom SNP chip to study differentiated regions in more detail and to use data from individual fish to confirm associations detected by pooled sequencing. The chip included 13,355 neutral SNPs evenly distributed across the genome and 59,205 SNPs showing genetic differentiation between subpopulations. Thirty fish each from 12 populations were used in the SNP screen. There was an excellent correlation between allele frequencies estimated with pooled sequencing and with the SNP chip (Figure 3—figure supplement 1). We constructed a phylogenetic tree (Figure 3C, lower panel) for haplotypes of highly differentiated SNPs from scaffold 218 present among individual fish from six representative populations, after phasing haplotypes using BEAGLE (Browning and Browning, 2007). As expected all fish from Atlantic Ocean and North Sea carried closely related “Atlantic” haplotypes. Two major haplotype groups were present among Baltic herring and with few exceptions Baltic herring carried only “Baltic” haplotypes. Fish from Skagerrak predominantly carried Atlantic haplotypes, but with a considerable proportion of Baltic haplotypes. Phylogenetic trees for other top scaffolds are presented in Figure 3—figure supplement 2.

There are many environmental and ecological differences between Atlantic Ocean and Baltic Sea e.g. temperature variability, eutrophication of the Baltic Sea, zooplankton and predator populations), but the most obvious difference concerns salinity. We used the Bayenv 2.0 (Günther and Coop, 2013) software to reveal which of the 472 independent loci detected with the χ2 test showed the most consistent correlation with salinity. This analysis identified 3,335 SNPs from 122 independent regions with highly significant association to salinity (Supplementary file 3A). Twenty-one of the genes in these regions have previously been associated with hypertension in human and 36 of these genes showed differential expression in sticklebacks kept in freshwater or sea water (Supplementary file 3A).

Here we present three loci with striking association to salinity. Firstly, the 11 kb region in scaffold 899 (Figure 3C) contains a single gene, prolactin receptor (PRLR), that is essential for mammalian reproduction but has a central role for osmoregulation in fish (Manzon, 2002), and possibly in mammals (Schennink et al., 2015). Secondly, strong genetic differentiation was also observed at scaffold 346 (Figure 3A; p<1x10-39). This signal overlaps HCE encoding high choriolytic enzyme. This locus was also identified as one of the most differentiated region in our screen for structural changes (Supplementary file 3B). A 4 kb region including part of the coding sequence showed a massive copy number amplification that had a strong negative correlation with salinity (Figure 3D). The outgroup, Pacific herring, showed an intermediate copy number. Interestingly, the Pacific herring spawns exclusively in shallow nearshore waters (Hay et al., 2009) often in estuaries and tidal zones where salinity varies, in contrast to deeper-spawning Atlantic herring. HCE is a protease, also denoted hatching enzyme, that solubilizes the inner layer of the egg envelope during hatching and adaptive evolution of this protein in relation to salinity has been reported (Kawaguchi et al., 2013). In herring, we found no coding changes implying altered transcriptional regulation. In fact, massive amplification of the promoter region is expected to alter gene expression. Hatching of the egg is probably a particularly challenging stage of development for a marine fish adapting to brackish conditions. Thirdly, a ~65 kb region downstream of solute carrier family 12 (sodium/chloride transporter) member 3 (SLC12A3) shows strong correlation with salinity (Figure 3E, Supplementary file 3A). SLC12A3, which has an established role in regulating osmotic balance, is associated with hypertension in human and shows differential expression in kidney tissue between sticklebacks kept in freshwater or sea water (Wang et al., 2014).

Genetic basis underlying timing of reproduction

Herring spawn from early spring to late fall. Prior to this study it was unknown if spawning time is entirely due to phenotypic plasticity, set by nutritional status and environmental conditions, or if genetic factors contribute (McQuinn, 1997). For example, it has been hypothesized that spawning time in the Baltic Sea is regulated by productivity of the system affecting maturation of fish prior to spawning (Aneer, 1985). To study this important question we collected spawning herring from the same geographic area, close to Gävle (Sweden), in May, July and September (Table 2). Our sampling included two other autumn-spawning populations collected in 1979, one from North Sea and the other from Southern Baltic Sea. We formed two superpools including three autumn-spawning and 10 spring-spawning population samples, respectively; the summer-spawners and one population of non-spawning herring (KT in Table 2) were excluded from the initial analysis. We identified 10,195 SNPs with significant allele frequency differences between pools (p<1x10-10) and 69 regions with copy number variation (p<0.001) (Figure 4A); the highly differentiated SNPs represented at least 125 independent loci based on our strict criteria (see Materials and methods). The result demonstrates for the first time that autumn- and spring-spawning herring are genetically distinct and indicates that genetic factors affect spawning time. In a phylogenetic tree based on these 10,195 SNPs the autumn-spawning populations from the Baltic Sea and North Sea tended to cluster with spring-spawning herring from the Atlantic Ocean (Figure 4B).

Figure 4 with 2 supplements see all
Genetic differentiation between spring- and autumn-spawning herring.

(A) Manhattan plot of significance values testing for allele frequency differences. (B) Neighbor-joining phylogenetic tree based on all SNPs showing genetic differentiation in this comparison (p<10-10). (C) Comparisons of allele frequencies in four strongly differentiated regions. The major allele in the AB1 sample (Atlantic Ocean) was set as reference at each SNP. Scaffolds 190 and 1420 have been merged in this plot since it was obvious that they were overlapping. *The signal in scaffold s1440 is present ~27 kb upstream of SOX11 and ~46 kb downstream of DCDC2/ALLC. (D) Neighbor-joining tree based on haplotypes formed by 70 differentiated SNPs from scaffold 190/1420; same populations as in Figure 4C. (E) Plot of average heterozygosity, per SNP in 5 kb windows, across scaffold 1420 indicating a selective sweep among spring-spawners in the region marked with vertical hatched lines. Autumn-spawning populations are marked by an asterisk.

https://doi.org/10.7554/eLife.12081.016

A general linear mixed model was used to identify which of the 125 independent loci showed the most consistent allele frequency differences between spring and autumn spawners. This analysis revealed 17 independent genomic regions that passed the stringent significance threshold of p<10–10 (Bonferroni correction, p=4.9x10-6) (Supplementary file 3C). We then illustrate the striking allele frequency differences at the four most significant regions using data from six different populations. As observed for the genetic adaptation to declined salinity (above), the most significant regions underlying seasonal reproductive timing typically consists of large haplotype blocks often containing multiple genes. Spring-spawning Atlantic and Baltic herring showed nearly identical allele frequencies at these loci while autumn-spawning herring from Baltic Sea and North Sea showed high frequencies of the alternate alleles (Figure 4C). Remarkably, summer-spawning herring showed a clear trend towards intermediate allele frequencies at all loci, most pronounced for scaffold 481 (Figure 4C). This may either reflect that this sample is an admixture of spring- and autumn-spawning herring or that it represents a distinct population. To explore this we investigated deviations from Hardy-Weinberg equilibrium using the FIT statistics because we expect a heterozygote deficiency if this is a mix of two populations. The results, based on 1,500 SNPs all showing strong genetic differentiation between spring- and autumn-spawners and genotyped individually using the SNP chip, showed that the summer spawners (BÄS) did not deviate markedly from FIT = 0 and in fact to a lesser extent than the spring-spawning population (BÄV) sampled at the same locality (Figure 4—figure supplement 1). For instance, individual genotyping of the highly differentiated SNPs from scaffold 481 (Figure 4C) resulted in mean FIT = −0.10 (excess of heterozygotes) for the summer spawners (BÄS) whereas if the sample had constituted an equal mix of spring- and autumn spawners from the same locality (BÄV and BÄH) the expected FIT-value would have been 0.46 (strong heterozygote deficiency). Thus, the data strongly suggest that these summer spawners represent a distinct population rather than admixture. Spawning time may be fine-tuned by the dosage of alleles affecting spawning time. The three populations from Gävle showed nearly identical allele frequencies at loci with strong genetic differentiation between Atlantic Ocean and Baltic Sea (Figure 3C), whereas they showed dramatic allele frequency differences at loci associated with spawning time (Figure 4C).

We used SNP-chip data to construct a haplotype tree based on highly differentiated SNPs in scaffold 190/1420. Two haplotype groups were strongly associated with spring- and autumn spawning (Figure 4D); haplotype trees for other top scaffolds are in Figure 4—figure supplement 2. The estimated average heterozygosity per polymorphic site across scaffold 1420 indicated a selective sweep among spring-spawning herring but not in autumn-spawning populations (Figure 4E). However, the nucleotide diversity did not show a significant difference between groups (spring: 0.24% ± 0.004%; autumn: 0.27% ± 0.003%). Thus, the number of variable sites are higher among spring-spawning herring, but the average heterozygosity per site is lower. One possible explanation for this observation is that a selective sweep happened at this locus in the past in spring-spawning herring, which was then followed by a population expansion allowing the accumulation of new mutations. This interpretation is supported by strong negative Tajima’s D-values in this region among spring-spawning Atlantic and Baltic herring (Figure 1—figure supplement 1E).

Genetic differences in spawning time are expected to involve photoperiodic regulation of reproduction. Interestingly, our strongest signals (p<1x10-120) in this contrast is located within and up to 25 kb upstream of TSHR encoding thyroid-stimulating hormone receptor, which has a central role in this pathway in birds and mammals (Nakao et al., 2008; Ono et al., 2008; Hanon et al., 2008). Further, a second gene in the same scaffold (190/1420), calmodulin has a role in initiating reproduction following secretion of gonadotropin-releasing hormone (GnRH) (Melamed et al., 2012) downstream of TSHR signalling in photoperiodic regulation of reproduction. SOX11, one of the genes in the associated region in scaffold 1440 (Figure 4C), encodes a transcription factor that controls GnRH expression in GnRH-secreting neurons (Kim et al., 2011). Finally, ESR2a, in scaffold 312, encodes estrogen receptor beta that has a well established function in reproductive biology (Bondesson et al., 2015). Interestingly, a previous experimental study in sticklebacks also indicate that estrogen receptor signaling is involved in photoperiodic regulation of reproduction since treatment with aromatase inhibitors, which leads to an inhibition of the conversion of androgens to estrogens, altered photoperiodic regulation of male sexual maturation (Bornestaf et al., 1997). Also, the expression of ESR2 but not ESR1 is regulated by circadian factors in mice (Cai et al., 2008), consistent with our data suggesting that estrogen receptor beta (encoded by ESR2) is more important than estrogen receptor alpha (encoded by ESR1) for photoperiodic regulation of reproduction.

Adaptive haplotype blocks are maintained by selection

A common feature for the signatures of selection for adaptation to low salinity and for seasonal reproduction in herring is the presence of haplotype blocks (10–200 kb in size) showing strong differentiation (Figures 3C, 4C), despite the rapid decay of linkage disequilibrium at selectively neutral sites (Figure 1—figure supplement 1A). A possible explanation for the pattern is the presence of inversions suppressing recombination as previously shown in three-spined stickleback (Jones et al., 2012). We constructed 3.3 kb Nextera mate pair libraries for two Atlantic and two Baltic herring individuals to scan for inversions with a particular focus on regions under selection. However, few convincing inversion candidates were detected and none coincided with the regions highlighted in Figures 3C, 4C. Thus, inversions do not appear to be an important explanation for the presence of haplotype blocks.

Having excluded inversions as a major explanation for the long haplotype blocks, two other possible explanations were considered. Haplotype blocks may occur as a consequence of recent fast selective sweeps that leads to hitchhiking of neutral polymorphism in close genetic linkage with causal variants (Maynard-Smith and Haigh, 1974; Charlesworth et al., 1997). Alternatively, haplotype blocks involving multiple causal mutations may be maintained by natural selection. These two models give entirely different predictions as regards nucleotide diversity in the differentiated regions of the genome. The hitchhiking model predicts reduced levels of genetic diversity in the differentiated region whereas the haplotype evolution model implies that nucleotide diversity in the differentiated regions, even within populations, may be as high or even higher than in neutral regions because the haplotypes are expected to have been maintained during an evolutionary process. We decided to test this by comparing nucleotide diversity for the 30 most differentiated regions in the contrast Atlantic vs. Baltic within and between one population of Atlantic herring (Bergen) and one population of Baltic herring (Kalix). The nucleotide diversity turned out to be significantly higher in the differentiated regions than in random regions of the genome both within and between populations (Figure 5A). The same conclusion emerged from the analysis of the 30 most differentiated regions between autumn- and spring-spawning herring using the samples collected at the same locality (Gävle) in May and September (Figure 5B). Thus, we conclude that our data on genetic differentiation in herring is consistent with the evolution of haplotype blocks harbouring multiple causal variants. The model also implies that the presence of multiple alleles containing different combinations of causal variants is expected.

Nucleotide diversity within and between samples with different ecological adaptations as regards

(A) salinity and (B) spawning time. For each contrast 30 strongly differentiated regions of the genome and 30 control regions showing no significant differentiation were used. The nucleotide diversity within and between populations for the control regions was estimated around 0.3% consistent with the genome average whereas diversity in differentiated regions was significantly higher. BK=Baltic herring, Kalix; AB=Atlantic herring, Bergen; BÄH=autumn-spawning Baltic herring from Gävle; BÄV, spring-spawning Baltic herring from Gävle; see Table 1. The data are presented as box plots; the central rectangle spans the first to third quartiles of the distribution, and the ‘whiskers’ above and below the box show the maximum and minimum estimates. The line inside the rectangle shows the median.

https://doi.org/10.7554/eLife.12081.019

Genomic distribution of causal variants

Genome-wide analysis combined with strong signatures of selection enabled us to explore the genomic distribution of sequence polymorphisms underlying ecological adaptation. We carried out an enrichment analysis as previously used to identify categories of SNPs showing differentiation between domestic and wild rabbits (Carneiro et al., 2014). We calculated the absolute allele frequency difference (dAF) for different categories of SNPs in the two contrasts Atlantic vs. Baltic and spring- vs. autumn spawning herring and sorted these into bins (dAF 0–0.05, etc.) for different categories of SNPs. In both contrasts the great majority of SNPs (>90%) showed a dAF lower than 0.10 (Figure 6, Supplementary file 3E).

Analysis of delta allele frequency (dAF) for different categories of SNPs.

(A) dAF calculated for the contrast marine vs. brackish water. (B) dAF calculated for the contrast spring- vs. autumn-spawning. The black line represents the total number of SNPs in each dAF bin and coloured lines represent M values of different SNP types. M values were calculated by comparing the frequency of SNPs in a given annotation category in a specific bin with the corresponding frequency across all bins.

https://doi.org/10.7554/eLife.12081.020

Non-synonymous substitutions showed the most striking enrichment in both contrasts and showed a steady increase above dAF=0.15 reaching a two-fold enrichment at dAF>0.50 (Figure 6, Supplementary file 3E). This enrichment must reflect natural selection acting on the protein sequence because synonymous substitutions did not show a similar strong enrichment at high dAF. All non-synonymous substitutions showing dAF>0.50 in any of the two contrasts are compiled in Supplementary file 3F. A striking feature of this list is the common occurrence of multiple high dAF SNPs in the same gene. The 74 non-synonymous changes with dAF>0.50 in the contrast Atlantic vs. Baltic occur in only 29 different genes and the corresponding figure for the contrast spring- vs. autumn-spawning is 21 non-synonymous changes in 9 genes. We excluded the possibility that the presence of multiple non-synonymous changes in many of the genes was explained by errors in gene models (non-coding sequences annotated as exons) by a comparative analysis with other teleosts. We identified the orthologous position for about two thirds of the positions listed in Supplementary file 3F, the great majority of these (58/62) were annotated as coding sequence also in other species (Supplementary file 3F).

SNPs located in the 5’untranslated and 3’untranslated regions (UTRs) showed a more consistent enrichment compared to synonymous changes implying that this enrichment is unlikely to be caused entirely by close linkage to coding sequences under selection. Thus, changes in UTRs have contributed to ecological adaptation in the herring, most likely due to their role in regulating mRNA stability and translation efficiency. In this analysis we combined 5’UTR and 3’UTR SNPs to avoid too small classes for the extremely high dAF. However, an analysis based on all SNPs showing a dAF > 0.1 in the Atlantic vs. Baltic contrast and all SNPs showing a dAF > 0.2 for the spring- vs. autumn-spawning contrast demonstrated that both 5’UTR and 3’UTR SNPs are overrepresented at high dAF and the trend is particularly strong for 5’UTR SNPs (Supplementary file 3G).

The importance of regulatory changes underlying ecological adaptation is evident from the highly significant enrichment of SNPs within 5 kb upstream and downstream of coding sequences (Figure 6, Supplementary file 3E). Further, the excess is particularly pronounced within 1 kb upstream of the coding sequence where the promoter is expected to be located (Supplementary file 3H). The enrichment is not as high as for non-synonymous changes but this does not mean that regulatory changes are less important than coding changes because a much higher proportion of SNPs within the 5 kb region flanking coding sequences are expected to be selectively neutral compared with those causing non-synonymous changes. Thus, it is possible that the enrichment of non-coding SNPs would be much higher if there was a better annotation of the functional significance of non-coding sequences in Atlantic herring.

Intergenic and intronic SNPs were in general underrepresented among SNPs showing high dAF (Figure 6). For the most differentiated SNPs (dAF > 0.50) the intergenic SNPs showed a marked underrepresentation in the Atlantic – Baltic contrast (M=-0.64; p=5.1 x 10-25; Supplementary file 3E) while intronic SNPs were most underrepresented in the spring- vs. autumn-spawning contrast (M=-0.55; p=6.7 x 10-7; Supplementary file 3E).

We also explored the possibility that loss of function-mutations have contributed to ecological adaptation. We identified a total of 469 nonsense mutations but expect that many of these will be false predictions due to errors in the gene model. Eight predicted nonsense mutations had a dAF higher than 0.20 in one of the contrasts and were further examined. Seven of these were unlikely to be correct annotations since the positions were not annotated as coding in zebrafish, and the remaining one had a dAF of 0.21 but was far from statistical significance. Thus, we conclude that gene inactivation is not a common mechanism for ecological adaptation.

Discussion

We have generated an Atlantic herring genome assembly and used this for a comprehensive analysis of the genetic basis for ecological adaptation. Hundreds of independent loci underlying ecological adaptation were revealed by comparing spring- and autumn-spawners as well as populations adapted to marine and brackish waters. The data show that both coding and non-coding changes contribute to ecological adaptation and we find that haplotype blocks spanning up to hundreds of kb show strong genetic differentiation.

The genetic architecture of multifactorial traits and disorders is an important topic in current biology. Genome-wide association studies (GWAS) in humans as well as in livestock have indicated that most multifactorial traits and disorders are controlled by large number of loci each explaining a tiny fraction of trait variation (Wood et al., 2014; Meuwissen et al., 2013). Thus, if ecological adaptation has a similar complex genetic background, in particular in a species with a large population size where each base in the genome is expected to mutate many times each generation, it may be difficult to reveal individual loci underlying adaptation. In contrast, this and our previous study (Lamichhaney et al., 2012) have revealed that genomic regions harbouring a small portion of all SNPs show strong genetic differentiation in the herring whereas the rest of the genome shows very low levels of genetic differentiation. However, there are some important differences between the herring and human data. Firstly, human GWAS reveal loci that contribute to standing genetic variation and therefore includes deleterious alleles that have not yet been eliminated by purifying selection. Secondly, the phenotypic effects of the loci reported here in the herring may be small and the strong genetic differentiation may have accumulated over many generations. There is also plenty of room for natural selection to operate in a species with a large reproductive output like the herring. Thirdly, our study gives no insight in how much of the genetic variation in ecological adaptation these loci control since we do not have information on genotype-phenotype relationships for individual fish. We cannot exclude the possibility that there are additional loci with tiny differences in allele frequency between populations or loci with an extensive allelic heterogeneity that are not detected using our approach. The question how much of the genetic variation the loci reported in this study explains needs to be addressed in future experimental studies.

An important finding was the presence of large haplotype blocks (10–200 kb in size) showing strong genetic differentiation, standing in sharp contrast to the rapid decay of linkage disequilibrium at selectively neutral sites (Figure 1—figure supplement 1A). Although it is expected that the majority of sequence polymorphisms associated with these haplotype blocks are selectively neutral, the data presented here is consistent with a scenario where haplotype blocks evolve over time by the accumulation of multiple, consecutive mutations affecting one or more genes similar to the evolution of haplotypes carrying multiple causal mutations as has been documented in domestic animals (Andersson, 2013) as well as suggested for the evolution of the blunt beak ALX1 haplotype in Darwin’s finches (Lamichhaney et al., 2015). Under this scenario, the shift from one allelic state to another rarely happens through a single mutational event since the fitness of a haplotype depends on the combined effect of multiple sequence polymorphisms affecting function. Furthermore, it is expected that there will be selection for supressed recombination within these regions to avoid that favoured haplotype blocks break up. Our analysis showing that nucleotide diversity is higher within the differentiated regions than in the rest of the genome (Figure 5) strongly supports our hypothesis that the large haplotype blocks are maintained by selection rather than being the consequence of genetic hitchhiking (Maynard-Smith and Haigh, 1974; Charlesworth et al., 1997). The common occurrence of multiple non-synonymous changes in genes showing strong genetic differentiation provides further support for the haplotype evolution model (Supplementary file 3F). The model proposed here is in line with the evolution of complex adaptive alleles in species with large current effective population sizes like modern Drosophila melanogaster populations (Karasov et al., 2010).

A long-standing question in evolutionary biology is the relative importance of genetic variation in regulatory and coding sequences. King and Wilson (1975) argued already 40 years ago that regulatory changes are more important than protein changes for phenotypic differences among primates. The large number of loci associated with ecological adaptation detected in the present study allowed us to explore their genomic distribution. There was a highly significant excess of non-synonymous changes as well as SNPs in UTRs and within 5 kb upstream and downstream of coding sequences among the loci showing strong genetic differentiation (Figure 6). Thus, both coding and non-coding changes contribute to ecological adaptation in the herring. The enrichment was clearly most pronounced for non-synonymous SNPs but it is likely that regulatory changes are in majority among the causal variants because there are more than 10 times as many non-coding as coding changes among the SNPs showing the strongest genetic differentiation (Supplementary file 3F). However, at present we cannot judge the relative importance of coding and non-coding changes, partially due to the strong linkage disequilibrium between coding and non-coding changes and partially because we have no data on the effect size of individual loci. We observed a highly significant excess of several categories of SNPs even for loci with only a 10–15% allele frequency difference between populations (Supplementary file 3E) suggesting that SNPs with such minor changes in allele frequencies contribute to ecological adaptation in the herring. Consistent with previous studies in domestic animals (Carneiro et al., 2014; Rubin et al., 2010), we did not find any indication that gene inactivation has contributed to adaptive evolution.

Timing of reproduction is of utmost importance for fitness in plants and animals and it is well documented that climate change affects reproductive success in both terrestrial (Visser et al., 2015) and aquatic organisms (Edwards and Richardson, 2004). We identified more than 100 independent loci showing strong genetic differentiation between spring- and autumn-spawners. Not all of these are expected to control reproduction since other life history parameters differ between populations. However, several of the most strongly associated regions overlapped genes with a role in photoperiodic regulation of reproduction in birds and mammals, such as thyroid-stimulating hormone receptor (TSHR), calmodulin and SOX11 (Ono et al., 2008; Hanon et al., 2008; Nakao et al., 2008; Melamed et al., 2012; Kim et al., 2011). Photoperiodic regulation in fish is poorly studied, but a recent study showed that the saccus vasculosus brain region is a sensor of changes in day length and suggested that changes in day length affect TSHR expression in this region in Masu salmon (Nakane et al., 2013). Interestingly, strong signatures of selection at TSHR in chicken (Rubin et al., 2010) and sheep (Kijas et al., 2012) may reflect selection against seasonal reproduction in domestic animals.

The population structure of Atlantic herring has been under debate for more than a century (McQuinn, 1997; Iles and Sinclair, 1982). The discussion has concerned the taxonomic status of stocks associated with different spawning and feeding locations, and whether populations are reproductively isolated. Our data are consistent with a metapopulation structure (McQuinn, 1997) in which subpopulations (stocks) are not reproductively isolated. Gene flow combined with large effective population sizes explains low genetic differentiation at selectively neutral loci. Despite this, natural selection is sufficiently strong to cause genetic differentiation at many loci underlying adaptation.

Many populations of marine fish, including the herring, have been severely affected by overfishing (Worm et al., 2006; Dickey-Collas et al., 2010). Our study shows how genomic technologies can be used in a cost-effective manner to make major leaps in characterization of population structure and genetic diversity. The study has important implications for sustainable fishery management of herring by providing a comprehensive list of genetic markers that can be used for stock assessments, including the first molecular tools to distinguish autumn- and spring-spawning herring. These can be used to complement the current use of otoliths (ear bones) microstructures. Moreover, the findings that spring- and autumn-spawners constitute distinct populations imply that fisheries management should aim to protect both populations separately, which is currently not the case in the Baltic Sea (ICES, 2014). Finally, the study also has implications for fish aquaculture due to the interest to alter seasonal reproduction and adaptation to different salinities.

Materials and methods

Genome assembly and annotation

Sample collection

A single Baltic herring (Clupea harengus membras) captured at Forsmark, east of Uppsala, Sweden on September 21, 2011 was used as the reference individual. Skeletal muscle was isolated, placed in 20% glycerol and stored in -80oC until DNA preparation was performed. DNA extraction was carried out with a standard salt precipitation method without vortexing to generate high molecular weight DNA.

Genome sequencing and assembly

Libraries of eight different insert sizes from the reference individual were sequenced on Illumina HiSeq2000 and Illumina MiSeq (chemistry v2) to a total depth of 127-fold coverage of quality-filtered data (Supplementary file 1A). Reads were filtered according to the following criteria: we eliminated (i) read pairs that contained more than 10% Ns in one of their reads; (ii) read pairs with more than 40% low quality bases (quality ASCII-64 ≤ 7); (iii) read pairs containing adapter sequence (with mismatches ≤ 3bp); (iv) for those libraries with insert size longer than the sum of both reads, if reads overlapped, we skipped reads with at least 10 bp overlap and mismatch in more than 10% of the bases overlapping; (v) finally, we avoided reads showing to be PCR duplicates. Genome assembly with SOAPdenovo v2.04 release 238 (Li et al., 2010) resulted in a scaffolded assembly of 834 Mb (Supplementary file 1B). An additional round of gap closing with an in-house pipeline was performed, utilizing both the overlapping HiSeq library and small unscaffolded contigs. To avoid short spurious contigs, which might appear in the assembly process, all contigs smaller than 1 kb were omitted from the final assembly. Potentially redundant contigs (no mismatches and maximal two gaps) were identified by self-aligning the complete assembly by BLAT, resulting in the removal of an additional 28 contigs. After aligning all reads back to the assembly, regions with no coverage (i.e. assembly valleys) were masked with N’s (3.6 Mb in total). In addition, 298,968 nucleotide positions where all the aligned reads conflicted with the assembly were corrected. The final assembly had a total size of 808 Mbp (Supplementary file 1B), which is in reasonable agreement with flow cytometry genome size estimates in Pacific herring (C. pallasii); 0.77–0.98 picograms (Hinegardner and Rosen, 1972, Ida et al., 1991, Ohno et al., 1969) corresponding to an average of ~850 Mb (mean(pg) x 0.978 × 109 bp/pg) (Doležel et al., 2003).

Synthetic long reads

We obtained data from Illumina’s synthetic long-read sequencing service (formerly Moleculo) (McCoy et al., 2014). Illumina prepared five libraries as a service following a detailed published protocol (Voskoboynik et al., 2013). In brief, pieces of genomic DNA of approximately 8 kb per molecule were distributed into 384 wells per library, with multiple molecules per well. The DNA in each well was fragmented, and sequencing adapters and well-specific barcodes were added. After pooling and sequencing on one lane of a HiSeq instrument, reads within each well were assembled separately in order to reconstruct the original molecules in each well. The contigs resulting from this process are called synthetic long reads (SLRs). They were filtered to be at least 1500 bp in size. From the five libraries, we obtained 1.3 million SLRs at an average length of 3.6 kb per read (4.7 Gb total sequence), with 20% longer than 5 kb (Figure 2—figure supplement 3). Since SLRs are assembled consensus sequences, base qualities were high, with an average Phred-scaled quality value of 36.

Reads were mapped with BWA-MEM 0.7.10 (Li and Durbin, 2009), with default parameters. The program was chosen since it was designed to work with long reads, and it also allows local alignments (split alignments of reads to different parts of the reference). This allows, for example, chimeric alignments in which the beginning and end of a read are aligned to different contigs. Overall, the resulting BAM file contained 2,743,529 alignments, corresponding to 2.09 alignments per read on average. There exist alignments for 99.95% of all long reads. Since this 'mapping rate' includes also reads that map only partially (due to local alignment), it is more informative to consider the fraction of mapped bases instead. We consider a base of a long read to be mapped if an alignment exists that includes that base. Under this definition, 4.47 of 4.69 Gbp (95.6%) could be aligned to the reference. On this level, we therefore observe a good agreement between reference and long reads.

Under ideal conditions, each long read would align to the reference in a single alignment that extends from the first to the last base of the read. Not being quite as strict, we considered a read to be fully aligned if there exists a single alignment of the read to the reference that involves at least 95% of the bases in the read. We observed 822,802 (62.68%) fully aligned reads. If we require 99% of the bases to be aligned, 771,192 (58.75%) reads remain. A limitation of this approach of assessing quality is that both the reference and the synthetic long reads are assemblies. An agreement between reference and SLR could therefore mean that both assemblies contain the same error. However, synthetic long reads are assumed to be more reliable since the assembly problem, being restricted to only a fraction of the whole genome, is inherently less complex for SLRs. On the other hand, disagreement between reference and SLR may represent normal variation between alleles, not necessarily an assembly error. Despite their length, SLRs did not improve the assembly since we observed that SLRs typically failed to assemble at the same loci at which the genome assembly contained gaps, preventing these gaps from being spanned and closed. The coverage from SLRs mapped to the assembly was also uneven, with 36.7% of genomic bases not being covered at all (Figure 2—figure supplement 4).

SLRs allowed an improved read-based phasing (Kuleshov et al., 2014) of the 3,896,765 variants called by FreeBayes (version 0.9.8) in the reference individual (not including SLRs). Using GATK’s ReadBackedPhasing (McKenna et al., 2010), 61.0% of those variants could be phased, yielding blocks that contain 14.7 variants on average. Providing the program also with the synthetic long reads improves this to 22.8 variants per block on average. On average, the block size improves from 1,254.32 bp to 2,476.07 bp. The largest phased block consists of 824 variants without SLRs (length=81,861 bp), but 1664 variants when SLRs are included (length=188,052 bp).

Indels in the reference individual were called separately with Illumina data and with SLR data using FreeBayes (version 0.9.8). The availability of long, high-quality reads makes it possible to call medium-sized indels. This type of indels are too long to be found using short reads, but too short to be found by techniques relying on mate-pair mapping distance so by using long reads we should improve our chances of finding them. Calling variants on the reference individual from only short reads results in 537,186 indels of quality 100 or better. Only 580 of those have a length of 30 bp or more. Feeding the variant caller with the SLRs resulted in 8,372 additional indels of length 30 bp or more (Supplementary file 1F).

Mappability calculation

We calculated per base mappability, a measure of base pair uniqueness in the reference sequence, with the GEM library (Marco-Sola et al., 2012). We translated the program’s output to a bounded score per base pair along the genome. Thereafter, we binned these scores using 1 kb non-sliding windows in order to identify regions of the genome that are either highly unique or repetitive. By doing so, we could collate this information track and other annotations when searching for structural changes to better interpret the results.

CEGMA

The completeness of the assembled genome was evaluated by analysing a set of 248 ultra-conserved eukaryotic genes using hidden Markov models (HMM) as implemented in CEGMA (v2.4) (Parra et al., 2007, 2009). 84% of the core genes were scored as 'complete' in the assembly (>70% aligned), and only 2.5% were missing from the assembly (<30% aligned), which indicates that the gene space is well represented in the assembly (Supplementary file 1C).

RNA sequencing, transcriptome assembly and annotation

Liver and kidney tissue were collected from the reference individual and stored in RNAlater. After extraction of RNA, using Qiagen RNeasy Fibrous Tissue Mini Kit, and poly-A selection, strand-specific dUTP libraries were produced. Fragments with an insert size of 200 bp were then sequenced by Illumina Hi-Seq instrument using 101 cycles per run producing ~200 million paired-end reads for each library (Supplementary file 1A).

We reconstructed the herring transcriptome combining RNAseq data for liver and kidney, from the reference herring, in addition to the previously published transcriptome from skeletal muscle from another Baltic herring individual (Lamichhaney et al., 2012).

Both the Swedish genome annotation platform and the BGI team then carried out genome annotation using a custom annotation pipeline. This process utilised various programs detailed in Figure 2—figure supplement 1. A combination of data from evidence sources (protein homology, transcripts, repeats) and ab initio predictions were used to discover 23,336 coding gene models (Supplementary file 1D).

Genome annotation

A custom annotation pipeline using the Maker package (version 2.31.6) (Cantarel et al., 2008) was applied to combine evidence data (protein homology, transcripts, repeats) and ab initio predictions into gene annotations (Figure 2—figure supplement 1). First, using TopHat2 (v2.0.9) (Kim et al., 2013) and cufflinks (v2.2.1) (Trapnell et al., 2010), we reconstructed individual genome-guided RNA-seq assemblies of the reference herring transcriptome from liver and kidney tissues, and the previously published transcriptome from skeletal muscle from another Baltic herring individual (Lamichhaney et al., 2012). In order to obtain a high-confidence set of coding transcripts, we set the minimum isoform fraction (-F) to 0.25 in cufflinks (this allows isoforms to be reported if they accounted for 25% of the expression in a given sample) and the pre-mRNA fraction (-j) to 0.6 (suppressing reads that are spanned by splice junctions and are expressed at 60% or lower when compared to the splice junction). While this approach may loose some data, we found it to yield satisfactory results with regards to noise levels and spurious transcription, judged by coding potential and structure (31,374 transcripts built from muscle, 50,404 from kidney and 41,285 from liver). In a complementary approach, we computed a de novo assembly of normalized, merged samples of liver and kidney, using the Trinity package (Grabherr et al., 2011) with default settings (248,721 transcripts).

We also collected further evidences at the protein level by querying the Uniprot database (Magrane and Consortium, 2011) for sequences belonging to the Teleostei taxonomic group (n=46,288 proteins). Proteins gathered had to be supported at either the proteomic or transcriptomic level and could not be fragmented. Additionally, we downloaded the Uniprot-Swissprot reference data set (on 2014-05-15) (n=545,388 proteins) for a wider taxonomic coverage with high-confidence proteins. The two protein data sets yielded a total of 587,735 non-redundant sequences that provided guidance to the putative structure and CDS phases of annotated loci.

To increase the accuracy of the annotation and annotate repeats, we used the existing reference repeat library included in the RepeatMasker (v4.0.3) (Smit et al., 2015) enhanced by novel repeats detected with the RepeatModeler package (v1.0.8) (Smit and Hubley, 2010). Candidate repeat sequences identified were vetted against a set of putative transposon sequences contained within our protein data set (referred as the 587,735 non-redundant set above) to exclude any nucleotide motif stemming from low-complexity coding sequences. After augmenting this repeat library, we utilized RepeatMasker and RepeatRunner (Stanke et al., 2008) to assign repeat sequences to genomic loci. RepeatRunner is a program that integrates RepeatMasker with BLASTX allowing the analysis of highly divergent partial or complete repeats as well as protein coding parts within retroelements and retroviruses undetected by RepeatMasker. Overall, 30.9% of the assembly contains various families of repeats (Supplementary file 1E).

We generated different gene builds using the Maker pipeline (version 2.31.6) (Holt and Yandell, 2011). We first constructed a so-called evidence build, which was computed directly from the sequence data we had compiled (transcripts and proteins) without using any ab initio predictions. This yielded a set of 19,762 gene models and 24,551 mRNAs, subsequently referred to as candidate 4 (rc4). However, evidence-based annotation alone is limited by the available sequence data, potentially returning fragmented structures or missing loci entirely. To prevent this happening, we next performed an ab initio aided re-annotation of the initial gene build. Information from the evidence build was used to curate 1100 non-redundant, high-confidence transcript models (i.e. full length as judged by synteny to genes in zebrafish as well as transcriptome and protein alignments) and to train the augustus gene predictor (version 2.7) (Stanke et al., 2008).

With the aid of the ab initio model, we performed a second pass (re-annotation) and replaced any existing loci where a longer putative CDS could be predicted by the gene finder or filled in gene predictions where sufficient evidence was lacking for the construction of evidence models. This improved annotation was called rc5 and yielded 22,380 genes (and 26,259 mRNAs), adding 2767 novel loci compared to the previous rc4 build. In contrast, 357 loci had been omitted from the re-annotation but were present in the rc4 so we added them back to the rc5, upgrading the gene count to 22,737 (and 26,712 mRNAs). In order to verify the added value of this two-step build strategy, we utilized a quality check referred as Annotation Edit Distance (AED) comparing mRNAs from rc4 and rc5. This metric means to quantify the congruency between a gene annotation and its supporting evidence. The closer an AED score is to 0, the 'better' the annotation is. The comparison of the density distribution of AED scores between gene builds at this step (Figure 2—figure supplement 2) shows that the complementary usage of the ab initio predictor achieves an overall improved congruency between models and supporting evidence.

Finally, as an additional polishing step, we used the PASA package (Haas et al., 2003) in combination with de-novo assembled transcripts from the liver and kidney tissue samples with the aim to further refine individual gene models and check all rc5-derived transcript models for proper coding potential (Haas et al., 2003). This final gene set release (rc6) yielded a total count of 23,336 genes. Statistical evaluation of the final rc6 set was performed using the GAG package (Hall et al., 2014). Based on this data, around 8% of the genome retains coding potential (Supplementary file 1D). Apart from the protein coding genes, we also discovered a total number of 993 tRNAs with a total gene length of 73,203 bp using tRNAscan (v1.3.1) (Lowe and Eddy, 1997).

With the initial gene build completed, we proceeded to infer putative functions for all coding mRNAs. To this end, we first predicted functional domains using InterProscan (v5.7–48) (Jones et al., 2014) to retrieve Interpro (Hunter et al., 2012), PFAM (Finn et al., 2014) and GO (Ashburner et al., 2000) annotations. In order to assign protein and gene names to this dataset, we performed a BLASTp (version 2.2.28+) search with each of the predicted protein sequences against the Uniprot/Swissprot reference data set (downloaded on 2014-05-15) with default e-value parameters (1x10-5). Outputs from both analyses were parsed using the Annie annotation tool (Tate et al., 2014) to extract and reconcile relevant metadata into predictions. Only 4,838 transcripts (3,375 genes) remained with no functional information available.

The standard annotation pipeline combining evidence based and ab initio predictions applied to our present assembly predicted a total of 23,336 gene models in the herring genome (Supplementary file 1D). We performed clustering of orthologous genes among 15 species (H. sapiens, D. rerio, L. chalumnae, G. morhua, T. rubripes, T. nigroviridis, O. latipes, C. harengus, G. aculeatus, P. marinus, O. niloticus, L. oculatus, X. maculatus, A. mexicanus, P. formosa), downloaded from Ensembl 78, with OrthoMCL (version 2.0.9) and used granularity of 1.5 as recommended for the mcl algorithm (Li et al., 2003). We first filtered the proteomes to keep only the longest isoform per gene. Then we filtered by length, keeping only those with more than 50 amino acid residues. None of the herring proteins were removed but 208 proteins from the three proteomes of the 4way-fish comparison were removed. We found that the herring assembly contained 14,107 orthologous gene families, 9,634 of which were common to four fish genomes (C. harengus, D. rerio, L. chalumnae, G. morhua), and 573 of these gene families were specific only to the herring genome (Figure 2B). The difference between both figures is likely to be a reflection of the different annotation statuses for current fish genomes, some of them more fragmented and poorly annotated, possibly yielding some spurious clustering in the 15way comparison.

Endogenous retroviruses (ERVs)

Analyses of the herring genome using RetroTector (Sperber et al., 2007) identified 150 endogenous retroviruses (ERVs) in scaffolds or contigs larger than 12 kb constituting about 0.13% of the genomic sequence (Supplementary file 2), none of which presented open reading frames in all gag, pol and env genes. The number of identified ERVs is somewhat lower than in most vertebrate hosts but comparable to other fish genomes (Hayward et al., 2015). Epsilon retroviruses such as the Walleye dermal sarcoma virus (WDSV) are typical findings in fish genomes and 8 epsilon-like ERVs could be determined by phylogenetic analysis (Figure 2—figure supplement 5). Additionally, three ERVs group together with the Snakehead fish retrovirus (SnRV) and 51 ERVs form a large basal clade in the phylogenetic tree without known reference sequences, possibly a transition between known retroviral sequences and gypsy retrotransposons represented by Cer1 in the root of the tree.

Genome resequencing and data analyses

Sampling

Tissue samples from 47–100 fish per population were collected from different localities in the Baltic Sea, Skagerrak, Kattegat, the Atlantic Ocean and the Pacific Ocean (Table 2). Genomic DNA was isolated by standard procedures and DNA from all individuals per sampling location was pooled in equimolar concentrations. We also used eight DNA samples of Baltic herring collected close to Gävle, Sweden and eight Atlantic herring collected close to Bergen, Norway for individual sequencing.

Resequencing, alignment and SNP calling

Sequencing libraries (average fragment size about 400 bp) were constructed for each population pool and the 16 individuals, and 2x100 bp paired-end reads were generated using Illumina HiSeq2000 sequencers. The amount of sequence per pool was targeted to ~30x coverage. These sequences in addition to the data from eight herring populations from our previous study (Lamichhaney et al., 2012) (Table 2) were aligned to the herring reference genome using BWA-MEM v0.7.1 (Li and Durbin, 2009). SNP calling was done using a standard GATK pipeline (McKenna et al., 2010). The quality filtering of the raw variant calls was done using GATK using the following cut-offs, QD < 2.0, MQ < 40.0, FS > 60.0, MQRankSum <-12.5, ReadPosRankSum < -8.0 and DP < 100. In addition, only SNPs with an average sequence coverage of 30-50x in each pool were retained for downstream analysis to avoid regions of the genome that are difficult to sequence using current technology and regions oversampled due to for instance duplications. Similar filtering criteria were used for individual sequences using GATK (QUAL < 100, MQ < 50.0, MQRankSum < -4.0, ReadPosRankSum < -2.0, QD < 2.0, HaplotypeScore > 10.0, FS > 60.0, DP < 12, DP > 720.0).

Population genetics and demographic history

The filtered SNP dataset was used to estimate genetic diversity within and between populations using Plink (Purcell et al., 2007) and to generate neighbor-joining trees using Phylip (Felsenstein, 1989). The split between Atlantic and Pacific herring was dated based on sequence divergence for mtDNA cytochrome B sequence using the molecular clock calibration for this sequence from fish (Burridge et al., 2008). Average nucleotide diversity was calculated using the 16 individual sequences, by counting the number of heterozygous sites in each individual and dividing by the total length of the used scaffolds. In order to avoid edge effects, only scaffolds longer than one Mb were included in the calculation. Decay of linkage disequilibrium, measured as correlation between genotypes, and Tajima’s D were calculated using VCFtools (Danecek et al., 2011).

In order to compare the observed allele frequency distribution with the expected one at genetic equilibrium we performed a simulation study. We first used the coalescence simulation software 'ms' (Ewing and Hermisson, 2010) to generate 100,000 independent segregating loci in a large population (n=1000) under either a constant population size model (command: “ms 1000 100000 -s 1”), or a model with an expansion event in the recent past (command. “ms 1000 100000 -s 1 -eN. 1. 35”). Thereafter we randomly sampled 32 chromosomes in order to get a distribution that was comparable to our empirical sample, which contains data from 16 individuals. The sampling procedure was repeated 10 times, allowing for estimation of the variation in each allele frequency bin. However, due to the large number of sampled loci, this was close to zero. The demographic history of the Atlantic herring was explored using the diCal software (Sheehan et al., 2013). This software implements an extension of the Pairwise Sequentially Markovian Coalescent (PSMC) method (Li and Durbin, 2009). The extension makes it possible to jointly analyse several individuals, which increases the total number of coalescence events and thus improves resolution, particularly for recent time periods. In absence of mutation rate rates from Atlantic herring or any closely related species, we used a mutation transition matrix based on human-chimp data, while the point mutation rate and the recombination rate per base were assumed to be 1.25 x 10-8, in accordance with Sheehan et al. (2013). For the discrete intervals at which effective population size was estimated (the input 'p' parameter in diCal), we used the string “3+2+2+2+2+2+2+2+2+2+3”, for a total of 11 independent time periods.

We analysed the Baltic and Atlantic populations separately, using phased sequence data from eight individuals (i.e. 16 unique chromosomes) from each population. Due to memory usage constraints, we performed the analysis for a limited number of genomic regions, each containing approximately 105 SNPs, arbitrarily chosen from parts of the genome that did not show strong signs of selection. A representative population history was then reconstructed by averaging across the analysed regions.

Screening for signatures of selection

We classified populations based on their adaptations to different environmental variables, marine (Atlantic Ocean) vs. brackish (Baltic Sea) waters, and different spawning seasons (spring vs. autumn). We then performed contingency χ2 tests using the allele frequency estimates at each locus to identify genomic regions showing significant allele frequency differences between populations sorted according to these contrasts. In order to control for the inflation in P values at positions with high coverage, we normalised the reference and variant allele read counts at these positions using genome-wide expected coverage. The Bonferroni correction threshold was p=8.2x10-9 and p<1x 10-10 was chosen as the stringent significance threshold. We also estimated pooled heterozygosity as previously described (Rubin et al., 2010) in 5 kb genomic window across the whole genome in each population to identify genomic regions with reduced heterozygosity that may have been targets of selection. All significant SNPs associated with spawning or adaptation to variation in salinity (p<10-20) were clustered as one independent genomic region under selection using the following steps. 1) We rescaled the coordinates by subtracting gaps from SNP positions along each scaffold. 2) We combined strongly correlated SNPs using the Comb-p software (Pedersen et al., 2012) and requested that independent regions should be separated by a distance of at least 20 kb with no SNPs reaching the significance threshold (p<10-20).

Bayenv2 (Günther and Coop, 2013) was used to scan the genome for correlation between salinity and population differentiation. This Bayesian method tests for correlation between allele frequency patterns and environmental factors; we used the salinity at each sampling locality as given in Table 2. First, we generated a variance-covariance matrix among all populations (except Pacific) with 100,000 Markov Chain Monte Carlo (MCMC) iterations based on a subset of randomly chosen SNPs (2500). Then, we calculated the correlation between allele frequencies and salinity across all 19 populations for significant SNPs (46,045 SNPs) detected using the χ2 test by running 100,000 MCMC iterations. Bayenv2 reports two statistics: i) Bayes’ factor (BF) and ii) Pearson correlation (r). We defined the significance threshold of these two statistics based on their distribution; log10 (BF)≥4 and |r|≥0.8.

In order to identify the loci showing the most consistent allele frequency differences between autumn and spring spawning herring we used a generalized linear mixed model analysis (GLMER, implemented in the R package 'lme4') (Bates et al., 2014), which took into account the variability between populations within groups. The following R-code was used:

glmer(y ~ Group+(1|populations), family=binomial(), nAGQ = 10, data=input_file), where, Group = autumn or spring spawning population, Population = specific population, Family = error distribution used in the model, nAGQ = the number of points per axis for evaluating the adaptive Gauss-Hermite approximation to the log-likelihood, Default = 1, that corresponds to the Laplace approximation. As the differences in allele frequencies between the populations within the groups were small in our dataset, we did not use the Laplace approximation, but used a greater nAGQ (i.e. 10) for more thorough likelihood maximization procedure. The Bonferroni correction threshold was p=4.9x10-6 and cut-off of p<1x10-10 was chosen as the stringent significance threshold.

Genomic distribution of genetic variants

SnpEff (v.3.4) (Cingolani et al., 2012) was used to annotate the genomic distribution of variants and classify them into different categories (non-synonymous, synonymous, UTR, 5 kb upstream, 5 kb downstream, intronic and intergenic). For both (1) Atlantic vs. Baltic and (2) spring- vs. autumn-spawning contrasts, we calculated the absolute allele frequency difference (dAF) and sorted them into bins (dAF -0.05, etc.) for each of these categories of SNPs. For unbiased estimation of dAF, SNPs with missing calls in at least one population were removed. The SnpEff prediction from less confident annotations (for instance, missing start and stop codons in the transcript) were excluded from the analysis. The expected number of SNPs for each category in each bin was calculated as p(category) X n(bin), where p(category) is the proportion of a specific SNP category in the entire genome and n(bin) is the total number of SNPs in a given bin. Log2 fold change of the observed SNP count for each category in each bin was compared against the expected SNP count (M-value) and statistical significance of the deviations from the expected values was tested with a standard χ2 tests.

Detection of structural changes

We extracted depth of coverage for all populations using GATK:DepthOfCoverage (McKenna et al., 2010), after filtering out reads with mapping quality below 20. We compared populations using 1 kb non-overlapping windows where all pools were normalized against the AB1 sample that showed highest average depth. In short, we created a correction factor per population and applied it on the depth of coverage value for each window. For all the contrasts, we performed an analysis of variance (ANOVA) as described (Carneiro et al., 2014).

We compared populations for two contrasts: 1) Atlantic vs. Baltic and 2) spring vs. autumn spawning. For the Atlantic-Baltic contrast we scanned 878,278 windows of which 79,809 windows were excluded due to low depth across all populations. 3,780 windows showed significant difference with a p<0.001. Stringent filtering based on p<0.001 and an |M-value|>0.6 resulted in the detection of 707 loci of which 491 had mappability above 0.5. The size of the structural changes ranged from 1–26 kb (Supplementary file 3B). In the second contrast, spawning, we compared populations for spawning time where we analysed 799,346 windows after filtering low depth regions. With the same criteria applied in salinity contrast, we identified 69 loci ranging in size from 1–19 kb (Supplementary file 3D).

We searched for inversions using mate-pair (2x100 bp) Nextera libraries (Illumina) with an average insert size of 3–4 kb generated from four individuals: two Atlantic (one female and male) and two Baltic (one female and male) sequenced to 3X coverage using Illumina HiSeq2000, After trimming and filtering low quality reads, we aligned the reads on herring genome by BWA-MEM (Li and Durbin, 2009). We used DELLY (Rausch et al., 2012) and BreakDancer (Chen et al., 2009) with default parameters, except that mapping quality was set to 10. We used bitwise flag in BAM files to extract deviant reads indicative of inversions overlapping sweep regions.

Genotyping of individual fish using high density SNP array

We designed an Affymetrix custom genotyping array with 72,560 SNPs, tiling the best strand for each SNP. Due to the fact that A/T and G/C SNPs require twice the features on the array that other markers require, the array layout covered 82,569 probe pair sets. These covered the majority of SNPs showing significant genetic differentiation together with the best 2000 monomorphic nucleotide sites to validate the individual plate/sample performance (in the dish quality metric, DQC). We submitted 36-mer nucleotide sequences around target SNPs to the manufacturer. SNPs flanked by other sequence polymorphisms were avoided as well as regions containing repetitive or low mappability sequences, Finally, we deprioritized A/T and C/G SNPs, as they take twice as much room on the array. Array experiments were performed by the Array and Analysis Facility, SciLifeLab (Uppsala, Sweden) according to standard protocol (Affymetrix Axiom 2.0 Assay Manual Workflow User Guide, P/N 702990 Rev3, Affymetrix).

Genotype calling was performed using the Affymetrix Power Tools (APT, version 1.15.2) followed by SNP-Polisher (version 1.4.0), according to the Best Practice Analysis Workflow described in Axiom Genotyping Solution Data Analysis Guide (P/N 702961 Rev. 2, Affymetrix). The DQC threshold was set to 0.95 according to the manufacturer´s instructions, based on previous data from herring arrays. Only the samples passing a Sample Call Rate (CR)>97% were retained after first genotyping round and were subjected to genotyping step 2. Twelve samples were excluded when using both DQC (n=2) and CR (n=10) filtering. At this point, all plates run were considered to be high quality plates. We utilized SNPolisher, an R package, for the purpose of genotyping the remaining samples. SNPolisher generates the different groups of clusters, AA, AB, BB and off target variant (OTV), with corresponding cluster plots for each SNP and evaluate quality. Thereby, this post-process analysis identifies the best probeset and assign to each one of six categories (PolyHighRes, MonoHighRes, NoMinorHom, CallRateBelowThreshold, OTVs, Other). It also calculates QC metrics for each SNP, classifies SNPs into categories, changes dubious SNP calls, and tests for intensity shifts between batches. We took only converted polymorphic SNPs (PolyHighRes, n= 41,575) into account for further analyses. Genotypes at these positions are referred to as high quality genotypes.

This SNP array was used to genotype 360 individual samples, 30 random fish from 12 of the populations included in pooled sequencing. A total of 348 samples retained enough quality to be used for downstream analysis.

References

  1. 1
  2. 2
  3. 3
  4. 4
    The Development of the Baltic Sea Basin During the Last 130 ka
    1. T Andrén
    2. S Björck
    3. E Andrén
    4. D Conley
    5. L Zillén
    6. J Anjar
    (2011)
    In: J Harff, S Bjorck, P Hoth, editors. The Baltic Sea Basin. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 75–97.
    https://doi.org/10.1007/978-3-642-17220-5_4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
    Cytometry Part A,51A
    1. J Doležel
    2. J Bartoš
    3. H Voglmayr
    4. J Greilhuber
    (2003)
    127–128, Letter to the editor, Cytometry Part A,51A.
  22. 22
  23. 23
  24. 24
    Yearbook Fishery and Aquaculture Statistics
    1. FAO
    (2014)
    Rome: Fao.
  25. 25
    Phylip - phylogeny inference package (version 3.2)
    1. J Felsenstein
    (1989)
    Cladistics 5:164–166.
  26. 26
  27. 27
    Definition of the zebrafish genome using flow cytometry and cytogenetic mapping
    1. JL Freeman
    2. A Adeniyi
    3. R Banerjee
    4. S Dallaire
    5. SF Maguire
    6. J Chi
    7. BL Ng
    8. C Zepeda
    9. CE Scott
    10. S Humphray
    11. J Rogers
    12. Y Zhou
    13. LI Zon
    14. NP Carter
    15. F Yang
    16. C Lee
    (2007)
    BMC Genomics, 8, 10.1186/1471-2164-8-195.
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
    Pan-vertebrate comparative genomics unmasks retrovirus macroevolution
    1. A Hayward
    2. CK Cornwallis
    3. P Jern
    (2015)
    Proceedings of the National Academy of Sciences of the United States of America 112:464–469.
    https://doi.org/10.1073/pnas.1414980112
  36. 36
  37. 37
    Maker2: An annotation pipeline and genome-database management tool for second-generation genome projects
    1. C Holt
    2. M Yandell
    (2011)
    BMC Bioinformatics, 12, 10.1186/1471-2105-12-491.
  38. 38
    The zebrafish reference genome sequence and its relationship to the human genome
    1. K Howe
    2. MD Clark
    3. CF Torroja
    4. J Torrance
    5. C Berthelot
    6. M Muffato
    7. JE Collins
    8. S Humphray
    9. K McLaren
    10. L Matthews
    11. S McLaren
    12. I Sealy
    13. M Caccamo
    14. C Churcher
    15. C Scott
    16. JC Barrett
    17. R Koch
    18. GJ Rauch
    19. S White
    20. W Chow
    21. B Kilian
    22. LT Quintais
    23. JA Guerra-Assunção
    24. Y Zhou
    25. Y Gu
    26. J Yen
    27. JH Vogel
    28. T Eyre
    29. S Redmond
    30. R Banerjee
    31. J Chi
    32. B Fu
    33. E Langley
    34. SF Maguire
    35. GK Laird
    36. D Lloyd
    37. E Kenyon
    38. S Donaldson
    39. H Sehra
    40. J Almeida-King
    41. J Loveland
    42. S Trevanion
    43. M Jones
    44. M Quail
    45. D Willey
    46. A Hunt
    47. J Burton
    48. S Sims
    49. K McLay
    50. B Plumb
    51. J Davis
    52. C Clee
    53. K Oliver
    54. R Clark
    55. C Riddle
    56. D Elliot
    57. D Eliott
    58. G Threadgold
    59. G Harden
    60. D Ware
    61. S Begum
    62. B Mortimore
    63. B Mortimer
    64. G Kerry
    65. P Heath
    66. B Phillimore
    67. A Tracey
    68. N Corby
    69. M Dunn
    70. C Johnson
    71. J Wood
    72. S Clark
    73. S Pelan
    74. G Griffiths
    75. M Smith
    76. R Glithero
    77. P Howden
    78. N Barker
    79. C Lloyd
    80. C Stevens
    81. J Harley
    82. K Holt
    83. G Panagiotidis
    84. J Lovell
    85. H Beasley
    86. C Henderson
    87. D Gordon
    88. K Auger
    89. D Wright
    90. J Collins
    91. C Raisen
    92. L Dyer
    93. K Leung
    94. L Robertson
    95. K Ambridge
    96. D Leongamornlert
    97. S McGuire
    98. R Gilderthorp
    99. C Griffiths
    100. D Manthravadi
    101. S Nichol
    102. G Barker
    103. S Whitehead
    104. M Kay
    105. J Brown
    106. C Murnane
    107. E Gray
    108. M Humphries
    109. N Sycamore
    110. D Barker
    111. D Saunders
    112. J Wallis
    113. A Babbage
    114. S Hammond
    115. M Mashreghi-Mohammadi
    116. L Barr
    117. S Martin
    118. P Wray
    119. A Ellington
    120. N Matthews
    121. M Ellwood
    122. R Woodmansey
    123. G Clark
    124. J Cooper
    125. J Cooper
    126. A Tromans
    127. D Grafham
    128. C Skuce
    129. R Pandian
    130. R Andrews
    131. E Harrison
    132. A Kimberley
    133. J Garnett
    134. N Fosker
    135. R Hall
    136. P Garner
    137. D Kelly
    138. C Bird
    139. S Palmer
    140. I Gehring
    141. A Berger
    142. CM Dooley
    143. Z Ersan-Ürün
    144. C Eser
    145. H Geiger
    146. M Geisler
    147. L Karotki
    148. A Kirn
    149. J Konantz
    150. M Konantz
    151. M Oberländer
    152. S Rudolph-Geiger
    153. M Teucke
    154. C Lanz
    155. G Raddatz
    156. K Osoegawa
    157. B Zhu
    158. A Rapp
    159. S Widaa
    160. C Langford
    161. F Yang
    162. SC Schuster
    163. NP Carter
    164. J Harrow
    165. Z Ning
    166. J Herrero
    167. SM Searle
    168. A Enright
    169. R Geisler
    170. RH Plasterk
    171. C Lee
    172. M Westerfield
    173. PJ de Jong
    174. LI Zon
    175. JH Postlethwait
    176. C Nüsslein-Volhard
    177. TJ Hubbard
    178. H Roest Crollius
    179. J Rogers
    180. DL Stemple
    (2013)
    Nature 496:498–503.
    https://doi.org/10.1038/nature12111
  39. 39
  40. 40
    919, Report of the Baltic Fisheries Assessment Working Group (WGBFAS)., ICES HQ, Copenhagen, Denmark
    1. ICES
    (2014)
    919, Report of the Baltic Fisheries Assessment Working Group (WGBFAS)., ICES HQ, Copenhagen, Denmark.
  41. 41
    Karyotypes and cellular DNA contents of three species of the subfamily Clupeinae
    1. H Ida
    2. N Oka
    3. K Hayashigaki
    (1991)
    Japanese Journal of Ichthyology 38:289–294.
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
    Fauna Suecica
    1. C Linnaeus
    (1761)
    Fauna Suecica, Stockholm.
  62. 62
  63. 63
    Uniprot knowledgebase: A hub of integrated protein data
    1. M Magrane
    2. U Consortium
    (2011)
    Database, 2011, 10.1093/database/bar009.
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
    The saccus vasculosus of fish is a sensor of seasonal changes in day length
    1. Y Nakane
    2. K Ikegami
    3. M Iigo
    4. H Ono
    5. K Takeda
    6. D Takahashi
    7. M Uesaka
    8. M Kimijima
    9. R Hashimoto
    10. N Arai
    11. T Suga
    12. K Kosuge
    13. T Abe
    14. R Maeda
    15. T Senga
    16. N Amiya
    17. T Azuma
    18. M Amano
    19. H Abe
    20. N Yamamoto
    21. T Yoshimura
    (2013)
    Nature Communications, 4, 10.1038/ncomms3108.
  73. 73
  74. 74
  75. 75
    Chromosomes Today
    1. S Ohno
    2. J Muramoto
    3. J Klein
    4. NB Atkin
    (1969)
    Diploid-tetraploid relationship in clupeoid and salmonoid fish, Chromosomes Today, Edinburgh, Oliver & Boyd.
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
    Repeatmodeler open-1.0
    1. A Smit
    2. R Hubley
    (2010)
    Repeatmodeler open-1.0.
  88. 88
    Repeatmasker
    1. AFA Smit
    2. R Hubley
    3. P Green
    (2015)
    Repeatmasker.
  89. 89
  90. 90
  91. 91
  92. 92
    Annie: The annotation information extractor
    1. R Tate
    2. B Hall
    3. T Derego
    4. S Geib
    (2014)
    Annie: The annotation information extractor.
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
    Defining the role of common variation in the genomic and biological architecture of adult human height
    1. AR Wood
    2. T Esko
    3. J Yang
    4. S Vedantam
    5. TH Pers
    6. S Gustafsson
    7. AY Chu
    8. K Estrada
    9. J Luan
    10. Z Kutalik
    11. N Amin
    12. ML Buchkovich
    13. DC Croteau-Chonka
    14. FR Day
    15. Y Duan
    16. T Fall
    17. R Fehrmann
    18. T Ferreira
    19. AU Jackson
    20. J Karjalainen
    21. KS Lo
    22. AE Locke
    23. R Mägi
    24. E Mihailov
    25. E Porcu
    26. JC Randall
    27. A Scherag
    28. AA Vinkhuyzen
    29. HJ Westra
    30. TW Winkler
    31. T Workalemahu
    32. JH Zhao
    33. D Absher
    34. E Albrecht
    35. D Anderson
    36. J Baron
    37. M Beekman
    38. A Demirkan
    39. GB Ehret
    40. B Feenstra
    41. MF Feitosa
    42. K Fischer
    43. RM Fraser
    44. A Goel
    45. J Gong
    46. AE Justice
    47. S Kanoni
    48. ME Kleber
    49. K Kristiansson
    50. U Lim
    51. V Lotay
    52. JC Lui
    53. M Mangino
    54. I Mateo Leach
    55. C Medina-Gomez
    56. MA Nalls
    57. DR Nyholt
    58. CD Palmer
    59. D Pasko
    60. S Pechlivanis
    61. I Prokopenko
    62. JS Ried
    63. S Ripke
    64. D Shungin
    65. A Stancáková
    66. RJ Strawbridge
    67. YJ Sung
    68. T Tanaka
    69. A Teumer
    70. S Trompet
    71. SW van der Laan
    72. J van Setten
    73. JV Van Vliet-Ostaptchouk
    74. Z Wang
    75. L Yengo
    76. W Zhang
    77. U Afzal
    78. J Arnlöv
    79. GM Arscott
    80. S Bandinelli
    81. A Barrett
    82. C Bellis
    83. AJ Bennett
    84. C Berne
    85. M Blüher
    86. JL Bolton
    87. Y Böttcher
    88. HA Boyd
    89. M Bruinenberg
    90. BM Buckley
    91. S Buyske
    92. IH Caspersen
    93. PS Chines
    94. R Clarke
    95. S Claudi-Boehm
    96. M Cooper
    97. EW Daw
    98. PA De Jong
    99. J Deelen
    100. G Delgado
    101. JC Denny
    102. R Dhonukshe-Rutten
    103. M Dimitriou
    104. AS Doney
    105. M Dörr
    106. N Eklund
    107. E Eury
    108. L Folkersen
    109. ME Garcia
    110. F Geller
    111. V Giedraitis
    112. AS Go
    113. H Grallert
    114. TB Grammer
    115. J Gräßler
    116. H Grönberg
    117. LC de Groot
    118. CJ Groves
    119. J Haessler
    120. P Hall
    121. T Haller
    122. G Hallmans
    123. A Hannemann
    124. CA Hartman
    125. M Hassinen
    126. C Hayward
    127. NL Heard-Costa
    128. Q Helmer
    129. G Hemani
    130. AK Henders
    131. HL Hillege
    132. MA Hlatky
    133. W Hoffmann
    134. P Hoffmann
    135. O Holmen
    136. JJ Houwing-Duistermaat
    137. T Illig
    138. A Isaacs
    139. AL James
    140. J Jeff
    141. B Johansen
    142. Å Johansson
    143. J Jolley
    144. T Juliusdottir
    145. J Junttila
    146. AN Kho
    147. L Kinnunen
    148. N Klopp
    149. T Kocher
    150. W Kratzer
    151. P Lichtner
    152. L Lind
    153. J Lindström
    154. S Lobbens
    155. M Lorentzon
    156. Y Lu
    157. V Lyssenko
    158. PK Magnusson
    159. A Mahajan
    160. M Maillard
    161. WL McArdle
    162. CA McKenzie
    163. S McLachlan
    164. PJ McLaren
    165. C Menni
    166. S Merger
    167. L Milani
    168. A Moayyeri
    169. KL Monda
    170. MA Morken
    171. G Müller
    172. M Müller-Nurasyid
    173. AW Musk
    174. N Narisu
    175. M Nauck
    176. IM Nolte
    177. MM Nöthen
    178. L Oozageer
    179. S Pilz
    180. NW Rayner
    181. F Renstrom
    182. NR Robertson
    183. LM Rose
    184. R Roussel
    185. S Sanna
    186. H Scharnagl
    187. S Scholtens
    188. FR Schumacher
    189. H Schunkert
    190. RA Scott
    191. J Sehmi
    192. T Seufferlein
    193. J Shi
    194. K Silventoinen
    195. JH Smit
    196. AV Smith
    197. J Smolonska
    198. AV Stanton
    199. K Stirrups
    200. DJ Stott
    201. HM Stringham
    202. J Sundström
    203. MA Swertz
    204. AC Syvänen
    205. BO Tayo
    206. G Thorleifsson
    207. JP Tyrer
    208. S van Dijk
    209. NM van Schoor
    210. N van der Velde
    211. D van Heemst
    212. FV van Oort
    213. SH Vermeulen
    214. N Verweij
    215. JM Vonk
    216. LL Waite
    217. M Waldenberger
    218. R Wennauer
    219. LR Wilkens
    220. C Willenborg
    221. T Wilsgaard
    222. MK Wojczynski
    223. A Wong
    224. AF Wright
    225. Q Zhang
    226. D Arveiler
    227. SJ Bakker
    228. J Beilby
    229. RN Bergman
    230. S Bergmann
    231. R Biffar
    232. J Blangero
    233. DI Boomsma
    234. SR Bornstein
    235. P Bovet
    236. P Brambilla
    237. MJ Brown
    238. H Campbell
    239. MJ Caulfield
    240. A Chakravarti
    241. R Collins
    242. FS Collins
    243. DC Crawford
    244. LA Cupples
    245. J Danesh
    246. U de Faire
    247. HM den Ruijter
    248. R Erbel
    249. J Erdmann
    250. JG Eriksson
    251. M Farrall
    252. E Ferrannini
    253. J Ferrières
    254. I Ford
    255. NG Forouhi
    256. T Forrester
    257. RT Gansevoort
    258. PV Gejman
    259. C Gieger
    260. A Golay
    261. O Gottesman
    262. V Gudnason
    263. U Gyllensten
    264. DW Haas
    265. AS Hall
    266. TB Harris
    267. AT Hattersley
    268. AC Heath
    269. C Hengstenberg
    270. AA Hicks
    271. LA Hindorff
    272. AD Hingorani
    273. A Hofman
    274. GK Hovingh
    275. SE Humphries
    276. SC Hunt
    277. E Hypponen
    278. KB Jacobs
    279. MR Jarvelin
    280. P Jousilahti
    281. AM Jula
    282. J Kaprio
    283. JJ Kastelein
    284. M Kayser
    285. F Kee
    286. SM Keinanen-Kiukaanniemi
    287. LA Kiemeney
    288. JS Kooner
    289. C Kooperberg
    290. S Koskinen
    291. P Kovacs
    292. AT Kraja
    293. M Kumari
    294. J Kuusisto
    295. TA Lakka
    296. C Langenberg
    297. L Le Marchand
    298. T Lehtimäki
    299. S Lupoli
    300. PA Madden
    301. S Männistö
    302. P Manunta
    303. A Marette
    304. TC Matise
    305. B McKnight
    306. T Meitinger
    307. FL Moll
    308. GW Montgomery
    309. AD Morris
    310. AP Morris
    311. JC Murray
    312. M Nelis
    313. C Ohlsson
    314. AJ Oldehinkel
    315. KK Ong
    316. WH Ouwehand
    317. G Pasterkamp
    318. A Peters
    319. PP Pramstaller
    320. JF Price
    321. L Qi
    322. OT Raitakari
    323. T Rankinen
    324. DC Rao
    325. TK Rice
    326. M Ritchie
    327. I Rudan
    328. V Salomaa
    329. NJ Samani
    330. J Saramies
    331. MA Sarzynski
    332. PE Schwarz
    333. S Sebert
    334. P Sever
    335. AR Shuldiner
    336. J Sinisalo
    337. V Steinthorsdottir
    338. RP Stolk
    339. JC Tardif
    340. A Tönjes
    341. A Tremblay
    342. E Tremoli
    343. J Virtamo
    344. MC Vohl
    345. P Amouyel
    346. FW Asselbergs
    347. TL Assimes
    348. M Bochud
    349. BO Boehm
    350. E Boerwinkle
    351. EP Bottinger
    352. C Bouchard
    353. S Cauchi
    354. JC Chambers
    355. SJ Chanock
    356. RS Cooper
    357. PI de Bakker
    358. G Dedoussis
    359. L Ferrucci
    360. PW Franks
    361. P Froguel
    362. LC Groop
    363. CA Haiman
    364. A Hamsten
    365. MG Hayes
    366. J Hui
    367. DJ Hunter
    368. K Hveem
    369. JW Jukema
    370. RC Kaplan
    371. M Kivimaki
    372. D Kuh
    373. M Laakso
    374. Y Liu
    375. NG Martin
    376. W März
    377. M Melbye
    378. S Moebus
    379. PB Munroe
    380. I Njølstad
    381. BA Oostra
    382. CN Palmer
    383. NL Pedersen
    384. M Perola
    385. L Pérusse
    386. U Peters
    387. JE Powell
    388. C Power
    389. T Quertermous
    390. R Rauramaa
    391. E Reinmaa
    392. PM Ridker
    393. F Rivadeneira
    394. JI Rotter
    395. TE Saaristo
    396. D Saleheen
    397. D Schlessinger
    398. PE Slagboom
    399. H Snieder
    400. TD Spector
    401. K Strauch
    402. M Stumvoll
    403. J Tuomilehto
    404. M Uusitupa
    405. P van der Harst
    406. H Völzke
    407. M Walker
    408. NJ Wareham
    409. H Watkins
    410. HE Wichmann
    411. JF Wilson
    412. P Zanen
    413. P Deloukas
    414. IM Heid
    415. CM Lindgren
    416. KL Mohlke
    417. EK Speliotes
    418. U Thorsteinsdottir
    419. I Barroso
    420. CS Fox
    421. KE North
    422. DP Strachan
    423. JS Beckmann
    424. SI Berndt
    425. M Boehnke
    426. IB Borecki
    427. MI McCarthy
    428. A Metspalu
    429. K Stefansson
    430. AG Uitterlinden
    431. CM van Duijn
    432. L Franke
    433. CJ Willer
    434. AL Price
    435. G Lettre
    436. RJ Loos
    437. MN Weedon
    438. E Ingelsson
    439. JR O'Connell
    440. GR Abecasis
    441. DI Chasman
    442. ME Goddard
    443. PM Visscher
    444. JN Hirschhorn
    445. TM Frayling
    446. Electronic Medical Records and Genomics (eMEMERGEGE) Consortium
    447. MIGen Consortium
    448. PAGEGE Consortium
    449. LifeLines Cohort Study
    (2014)
    Nature Genetics 46:1173–1186.
    https://doi.org/10.1038/ng.3097
  99. 99

Decision letter

  1. Magnus Nordborg
    Reviewing Editor; Vienna Biocenter, Austria

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]

Thank you for submitting your work entitled "The genetic basis for ecological adaptation revealed by genome sequencing of the Atlantic herring" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Nick Barton and Magnus Nordborg, who is a member of our Board of Reviewing Editors. The evaluation was overseen by Diethard Tautz as the Senior Editor. Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that this manuscript cannot be considered further for publication in eLife. However, we would consider a completely reworked resubmission which addresses the concerns of the reviewers, and in particular goes much further in distinguishing between causal from non-causal alleles.

The consensus opinion of the reviewers is that this is a really interesting study in an excellent study system, but that the analysis is insufficient and several of the conclusions not supported. In particular, the reviewers were not convinced that you can actually identify the causal sites, and several conclusions rest on this. However, exploring the limits of what may be concluded is also of great interest, especially given the quality of the data.

In essence, the analysis does not do the data justice.

Reviewer #1:

This is a substantial piece of work addressing an important question using a beautiful and promising system. It is certainly the case that identifying loci that show evidence having been under selection is much easier in gigantic population with little population structure, as appears to be the case for herring.

Unfortunately, the paper suffers from several conceptual/logical flaws and I do not think the major conclusions are supported.

The confusion begins in the first sentence of the abstract, where it is stated that "Ecological adaptation is of major relevance…, but the underlying genetic factors are typically hard to study in natural populations due to confounding between population structure and signatures of selection".

It is indeed true that population structure makes it hard to identify signatures of selection, but who says that genes involved in ecological adaptation will exhibit such signatures? Generally, such signals arise as a consequence of strong selection on individual genes. For a quantitative trait, is not clear that there will be any such genes, and is also not clear that any such genes found will explain much of the variation. Furthermore, there is no phenotype in the present study, nor any heritability. Thus the comparison with human height (Discussion section) does not make sense. You have no idea how much of the variation for fitness your SNPs explain. Indeed it is formally possibly that all your SNPs are associated with defense against a parasite with a life-cycle that depends on salinity. I obviously don't believe this, but the point remains that in order to discuss the genetic architecture of adaptation, there needs to be genetics in the study. For example, a reciprocal transplant study that quantified the fitness effects of the identified loci.

Speaking of associations, I don't understand how the threshold for calling frequency differences significant was set (Figure 1—figure supplement 1). This is not explained ("used a QQ-plot" is not an explanation).

It is also stated that the presence of haplotype blocks was striking. Why? Presumably these are simply the consequence of selection? Are they longer than you would expect? Why? What is the evidence that they contain multiple causal sites as opposed to being the result of linkage drag?

Regardless of why they are there, the existence of these blocks mandate extreme caution when trying to decide which sites are causal. It is by now clear that local haplotype structure, especially when coupled with allelic heterogeneity causes large number of local spurious associations, and that, as a consequence, the most strongly associated SNP is NOT likely to be the causal one: it is likely to be an intermediate-frequency SNP that happens to tag the underlying variants well…

A consequence of this is that most of the conclusions about what kinds of SNPs do what (Subsection “Genomic distribution of causal variants”) are not supported. You find few significantly associated non-synonymous substitutions, but this could simply be because they are all rare, and can only be found by proxy.

It does make sense to compare different kinds of regions, as you do, but everything has to be controlled for allele frequency.

On a more population genetics level, I was struck by your low nucleotide diversity coupled with a rapid decay of linkage disequilibrium. Unless mutation- and recombination rates are very different from other organisms, they make no sense. My guess is that your nucleotide diversity is underestimated because you are detecting heterozygotes in single individuals, and that your rate of linkage disequilibrium decay is overestimated because of bad SNPs.

Third paragraph subsection “Genetic basis underlying timing of reproduction”: This something not right about this argument. Why would rare SNPs affect nucleotide diversity one way or another?

Subsection “Genomic distribution of causal variants”: "Some of these will be false positives due to close linkage to true positives but many will be true positives" I discuss this above, but it would be important to realise that, without assumptions about the meaning of "some" and "many", it is not possible to conclude any of what you conclude further down.

Same section: "Enrichment" compared to what?

Reviewer #2:

This paper uses whole-genome sequencing from 20 population pools, supported by SNP genotyping of ~360 individuals, to identify variants associated with differences between Baltic and Atlantic, and between autumn and spring spawning herring. This is an impressive set of data, and as the authors argue, the recent divergence, large population size and low differentiation makes herring exceptionally well suited to identifying genes responsible for adaptation. Thus, the study could in principle be a very nice contribution to eLife. However, the interpretation of the results seems quite naive: more detailed and rigorous argument is needed to be convincing.

SNP and structural variants associated with divergence are clustered into haplotype blocks that may span multiple genes, and up to 200kb. It seems to be assumed that all (or at least, many) of the variants within each block are causal, on the grounds that within the Atlantic population, LD decays very rapidly, over a hundred bp or so. This would be very interesting, implying that complex adaptive alleles build up as a result of successive substitutions in the same region. However, careful arguments are needed to exclude the more obvious alternative, that selection has raised specific haplotypes to high frequency. Suppose that the original population was indeed at linkage equilibrium. Under selection in a new environment, many alleles might increase at the same locus – either from standing variation or from new mutations. Such "soft sweeps" might not be detected in these data. However, at a subset of loci, one or a few haplotypes might increase, and their size would reflect the time since adaptation – presumably a few thousand generations, corresponding to perhaps 10-100kb. Even if two or three successive causal alleles were involved, the great majority of SNPs would still be neutral. Seeing LE within the base population does not address this issue; and the ascertainment bias means that only more or less hard sweeps can be detected.

Paragraph six subsection “Genetic adaptation to a new niche environment”: There needs to be a proper test of whether it is surprising that significant variants cover genes in certain functional classes – I have no idea how many genes may be involved in osmoregulation, for example. This is especially an issue because haplotype blocks may cover multiple genes. A proper permutation test is needed here.

Subsection “Genomic distribution of causal variants”: The section on genomic distribution of causal variants was confused. It should be possible to make a rigorous statistical estimate of the fraction of markers that are likely to be causal from the extent of enrichment, but this needs to be done carefully.

Reviewer #3:

This is a comprehensive and nice study and well-written paper, with possible implications for inference on ecological adaptation of species other than the Atlantic herring.

My major comment is that I was surprised by how the paper was written with respect to the literature. After reading Lamichhaney et al. 2012 from the same group it is clear that the current study is a kind of (nice) follow-up from that. Yet the authors only refer to Lamichhaney et al. 2012 once in the Introduction and once in the Results and not at all in the Discussion. In my opinion, the authors should be upfront of what they found in L2012, how this study goes well beyond that, and how the current results confirms or contradicts previously reported results.

My quick reading of Lamichhaney et al. 2012 is that the authors reported: (i) low level population differentiation, (ii) evidence for local adaptation (salinity), (iii) evidence for selection on haplotype blocks, (iv) candidate genes under natural selection (salinity) and (v) evidence of natural selection of reproduction (spawning season).

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for submitting your work entitled "The genetic basis for ecological adaptation of the Atlantic herring revealed by genome sequencing" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Nick Barton and Magnus Nordborg, who is a member of our Board of Reviewing Editors, and the evaluation has been overseen Diethard Tautz as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

As emphasized in the decision letter encouraging you to resubmit this manuscript, the reviewers unanimously agree that this is one of the best data sets in existence for addressing questions about the architecture of local adaptation. The resubmitted version is greatly improved, but is still somewhat frustrating in that it was felt that much more could be done with the data. However, it was recognized that this would take the analysis well beyond the current state of the art, and that it would not be constructive to demand this. We actually do not yet have the intellectual framework for thinking about these kinds of data. It says on page 5 that "the results provide a comprehensive and detailed view on the genetic architecture underlying ecological adaptation". So what is the answer? How far can we go without reciprocal transplant studies?

Better then to publish, and be clear about what has and has not been demonstrated.

Essential revisions:

The distinction between causal and non-causal SNPs needs to be more clearly spelled out. The paper identifies a very large number of SNPs that differ significantly between groups. These tend to be clustered on the genome, and also enriched for certain functional types. Some clusters may arise by chance, whilst others may include one or more causal alleles, which tend to be in candidate loci. The enrichment analysis shows the clearest evidence for selection, but it seems to us that this could be consistent with multiple causal alleles at ~ 20 loci, rather than the "thousands" of SNPs cited in the abstract. It should be possible to estimate the minimum number of causal loci consistent with the observed clustering and enrichment, at least roughly.

The existence of long haplotypes distinguishing the different population is emphasized, but it is not clear whether these are in any sense unusual given the low effective population (surprisingly low given the vast census sizes in this species). It should be possible to use standard coalescent simulation to at least test whether the observed haplotype lengths are long compared to neutral expectations under the estimated demography and a range of plausible values for the recombination rate.

https://doi.org/10.7554/eLife.12081.036

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

The consensus opinion of the reviewers is that this is a really interesting study in an excellent study system, but that the analysis is insufficient and several of the conclusions not supported. In particular, the reviewers were not convinced that you can actually identify the causal sites, and several conclusions rest on this. However, exploring the limits of what may be concluded is also of great interest, especially given the quality of the data. In essence, the analysis does not do the data justice.

First we would like to thank the reviewers for their constructive criticism of our paper that has clearly helped us to improve the paper. The most important change in the revised version is that we have included a much improved analysis of the genomic distribution of causal changes and this shows that we had underestimated the role of changes in the coding sequence based on the previous analysis.

Reviewer #1:

This is a substantial piece of work addressing an important question using a beautiful and promising system. It is certainly the case that identifying loci that show evidence having been under selection is much easier in gigantic population with little population structure, as appears to be the case for herring. Unfortunately, the paper suffers from several conceptual/logical flaws and I do not think the major conclusions are supported. The confusion begins in the first sentence of the abstract, where it is stated that "Ecological adaptation is of major relevance…, but the underlying genetic factors are typically hard to study in natural populations due to confounding between population structure and signatures of selection". It is indeed true that population structure makes it hard to identify signatures of selection, but who says that genes involved in ecological adaptation will exhibit such signatures? Generally, such signals arise as a consequence of strong selection on individual genes. For a quantitative trait, is not clear that there will be any such genes, and is also not clear that any such genes found will explain much of the variation.

We have changed the sentence in the Abstract to “…due to confounding between genetic differentiation caused by natural selection and by genetic drift in subdivided populations” to make a more specific statement. Furthermore, we think this is a major merit of the paper that it demonstrates that very convincing evidence of natural selection can be detected in this system.

Furthermore, there is no phenotype in the present study, nor any heritability. Thus the comparison with human height (Discussion section) does not make sense. You have no idea how much of the variation for fitness your SNPs explain. Indeed it is formally possibly that all your SNPs are associated with defense against a parasite with a life-cycle that depends on salinity. I obviously don't believe this, but the point remains that in order to discuss the genetic architecture of adaptation, there needs to be genetics in the study. For example, a reciprocal transplant study that quantified the fitness effects of the identified loci.

We disagree with the statement that there is no phenotype or genetics in this study. We have sampled fish from the same locality spawning in May, July and September and these were phenotypically classified as spring-, summer- and autumn-spawners. Our sequence analysis revealed highly significant allele frequency differences at hundreds of loci between these groups. This is a genetic result! However, the reviewer is 100% correct that we cannot estimate heritabilities since we do not have genotype-phenotype relationships for individual fish. We have therefore revised the Discussion and now explain that we cannot determine how much of the phenotypic variation these loci control.

Speaking of associations, I don't understand how the threshold for calling frequency differences significant was set (Figure 1—figure supplement 1). This is not explained ("used a QQ-plot" is not an explanation).

We have decided to rather use a Bonferroni correction since that is a conservative threshold in a study like this. This means that the significance threshold has changed from p=10x10-20 to 10x10-10, but this did not have any major effects on the main conclusions in this study. We still keep the Q-Q plot as Figure 1—figure supplement 1D to demonstrate that the distribution of P-values in the association deviates dramatically from the expected distribution under the null hypothesis of no genetic differentiation.

It is also stated that the presence of haplotype blocks was striking. Why? Presumably these are simply the consequence of selection? Are they longer than you would expect? Why? What is the evidence that they contain multiple causal sites as opposed to being the result of linkage drag?

We agree that we cannot exclude the possibility that linkage drag is an important explanation for haplotype blocks so we have modified the text in the Discussion. However, we do think that the new data showing the presence of multiple non-synonymous substitutions in many of the genes showing the strongest genetic differentiation support the hypothesis of haplotype evolution involving several causal changes within the haplotype blocks.

Regardless of why they are there, the existence of these blocks mandate extreme caution when trying to decide which sites are causal. It is by now clear that local haplotype structure, especially when coupled with allelic heterogeneity causes large number of local spurious associations, and that, as a consequence, the most strongly associated SNP is NOT likely to be the causal one: it is likely to be an intermediate-frequency SNP that happens to tag the underlying variants well… A consequence of this is that most of the conclusions about what kinds of SNPs do what (Subsection “Genomic distribution of causal variants”

) are not supported. You find few significantly associated non-synonymous substitutions, but this could simply be because they are all rare, and can only be found by proxy. It does make sense to compare different kinds of regions, as you do, but everything has to be controlled for allele frequency.

This section has been completely revised and the text rewritten. In fact our new improved analysis gives a different and more correct picture about what types of genetic changes are underlying ecological adaptation in herring.

On a more population genetics level, I was struck by your low nucleotide diversity coupled with a rapid decay of linkage disequilibrium. Unless mutation- and recombination rates are very different from other organisms, they make no sense. My guess is that your nucleotide diversity is underestimated because you are detecting heterozygotes in single individuals, and that your rate of linkage disequilibrium decay is overestimated because of bad SNPs.

The nucleotide diversity is calculated in both the ~70x coverage reference individual as well as the 16 ~10x coverage individual samples with very consistent results, indicating that the diversity is not substantially underestimated due to lack of power to detect heterozygous sites. Also, while perhaps low in some contexts, the observed value is higher than for other species of fish (See Table 1).

Regarding the genotype quality and LD decay, we observe strong concordance between SNPs called from sequencing and SNP chip (in individuals ~90% of heterozygote sites have matching calls (data not shown), and see Supplementary Figure 3 for the correlation between allele frequency in pooled samples and SNP-chip data). Admittedly, the SNP-chip markers are not a random subset of all SNPs and are likely to be of better quality than the genomic average, but we still have reasons to believe that our genotypes are generally good; we have also used the SNP chip for some pedigree analysis (data not shown). Also, inaccurate genotype calls are not likely to correlate with physical distance between SNPs, and will therefore not affect the shape of the LD decay curve. So, perhaps mutation and/or recombination data differ from other organisms?

Third paragraph subsection “Genetic basis underlying timing of reproduction” This something not right about this argument. Why would rare SNPs affect nucleotide diversity one way or another?

The text has been revised. The message is that spring-spawning herring possess a larger number of variable sites but the average heterozygosity per site is lower.

Subsection “Genomic distribution of causal variants”: "Some of these will be false positives due to close linkage to true positives but many will be true positives" I discuss this above, but it would be important to realise that, without assumptions about the meaning of "some" and "many", it is not possible to conclude any of what you conclude further down.

This entire section has been revised and rewritten.

Same section: "Enrichment" compared to what?

The section has been rewritten and a new Figure 5 has been prepared. Now it is better explained that we see an enrichment compared with the genome-wide frequency.

Reviewer #2:

This paper uses whole-genome sequencing from 20 population pools, supported by SNP genotyping of ~360 individuals, to identify variants associated with differences between Baltic and Atlantic, and between autumn and spring spawning herring. This is an impressive set of data, and as the authors argue, the recent divergence, large population size and low differentiation makes herring exceptionally well suited to identifying genes responsible for adaptation. Thus, the study could in principle be a very nice contribution to eLife. However, the interpretation of the results seems quite naive: more detailed and rigorous argument is needed to be convincing. SNP and structural variants associated with divergence are clustered into haplotype blocks that may span multiple genes, and up to 200kb. It seems to be assumed that all (or at least, many) of the variants within each block are causal, on the grounds that within the Atlantic population, LD decays very rapidly, over a hundred bp or so. This would be very interesting, implying that complex adaptive alleles build up as a result of successive substitutions in the same region. However, careful arguments are needed to exclude the more obvious alternative, that selection has raised specific haplotypes to high frequency. Suppose that the original population was indeed at linkage equilibrium. Under selection in a new environment, many alleles might increase at the same locus – either from standing variation or from new mutations. Such "soft sweeps" might not be detected in these data. However, at a subset of loci, one or a few haplotypes might increase, and their size would reflect the time since adaptation – presumably a few thousand generations, corresponding to perhaps 10-100kb. Even if two or three successive causal alleles were involved, the great majority of SNPs would still be neutral. Seeing LE within the base population does not address this issue; and the ascertainment bias means that only more or less hard sweeps can be detected.

For sure, we do not think that all variants in a haplotype block is causal, in fact we assume that the majority of highly differentiated SNPs are linked to causal SNPs. However, we think it is unlikely that these fairly large haplotype blocks under selection are due to single causal variants but we agree that it is difficult to exclude this possibility and we have therefore softened the discussion of this issue.

Paragraph six subsection “Genetic adaptation to a new niche environment”: There needs to be a proper test of whether it is surprising that significant variants cover genes in certain functional classes – I have no idea how many genes may be involved in osmoregulation, for example. This is especially an issue because haplotype blocks may cover multiple genes. A proper permutation test is needed here.

We have attempted to carry out proper GO analysis to address this but our conclusion is that the GO annotations of fish genes are too poor to make this a powerful analysis. However, we find it encouraging that several of the top hits make perfect sense, such as the prolactin receptor and high choriolytic enzyme genes associated with adaptation to Baltic Sea – Atlantic Ocean that differs dramatically as regards salinity, and the TSHR, SOX11, CALM and ESRB loci associated with spawning time which all have an established role in reproductive biology.

Subsection “Genomic distribution of causal variants”: The section on genomic distribution of causal variants was confused. It should be possible to make a rigorous statistical estimate of the fraction of markers that are likely to be causal from the extent of enrichment, but this needs to be done carefully.

This section has been completely revised and the text rewritten.

Reviewer #3:

This is a comprehensive and nice study and well-written paper, with possible implications for inference on ecological adaptation of species other than the Atlantic herring. My major comment is that I was surprised by how the paper was written with respect to the literature. After reading Lamichhaney et al. 2012 from the same group it is clear that the current study is a kind of (nice) follow-up from that. Yet the authors only refer to Lamichhaney et al. 2012 once in the Introduction and once in the Results and not at all in the Discussion. In my opinion, the authors should be upfront of what they found in Lamichhaney et al. 2012, how this study goes well beyond that, and how the current results confirms or contradicts previously reported results. My quick reading of Lamichhaney et al. 2012, is that the authors reported: (i) low level population differentiation, (ii) evidence for local adaptation (salinity), (iii) evidence for selection on haplotype blocks, (iv) candidate genes under natural selection (salinity) and (v) evidence of natural selection of reproduction (spawning season).

We have revised the Introduction and now cite our previous work more extensively in the Introduction.

[Editors’ note: the author responses to the re-review follow.]

Summary: As emphasized in the decision letter encouraging you to resubmit this manuscript, the reviewers unanimously agree that this is one of the best data sets in existence for addressing questions about the architecture of local adaptation. The resubmitted version is greatly improved, but is still somewhat frustrating in that it was felt that much more could be done with the data. However, it was recognized that this would take the analysis well beyond the current state of the art, and that it would not be constructive to demand this. We actually do not yet have the intellectual framework for thinking about these kinds of data. It says on page 5 that "the results provide a comprehensive and detailed view on the genetic architecture underlying ecological adaptation". So what is the answer? How far can we go without reciprocal transplant studies?

We agree that we really do not know how much of the genetic variance our loci control so we have changed this sentence as follows:

“The results provide a comprehensive list of hundreds of independent loci underlying ecological adaptation and shed light on the relative importance of coding and non-coding variation.”

Better then to publish, and be clear about what has and has not been demonstrated. Essential revisions: The distinction between causal and non-causal SNPs needs to be more clearly spelled out. The paper identifies a very large number of SNPs that differ significantly between groups. These tend to be clustered on the genome, and also enriched for certain functional types. Some clusters may arise by chance, whilst others may include one or more causal alleles, which tend to be in candidate loci. The enrichment analysis shows the clearest evidence for selection, but it seems to us that this could be consistent with multiple causal alleles at ~ 20 loci, rather than the "thousands" of SNPs cited in the abstract. It should be possible to estimate the minimum number of causal loci consistent with the observed clustering and enrichment, at least roughly.

In the current version of the paper we provide conservative estimates of the minimum number of independent loci detected in this study. We requested a statistical support of at least 1x10-20 to consider a region to make sure that the number of false positives is very low if any. We took into account gaps in the assembly, we looked at the correlation between SNP P-values within regions and we required a distance of at least 20 kb without highly significant SNPs to close a region. Using these criteria we identified 472 independent loci associated with the Atlantic-Baltic contrast and 125 independent loci distinguishing autumn- and spring spawners. It is clear from Figures 3A and 4A that the number of loci cannot be as few as ~20.

The existence of long haplotypes distinguishing the different population is emphasized, but it is not clear whether these are in any sense unusual given the low effective population (surprisingly low given the vast census sizes in this species). It should be possible to use standard coalescent simulation to at least test whether the observed haplotype lengths are long compared to neutral expectations under the estimated demography and a range of plausible values for the recombination rate.

We think the most important question to explore is if the large haplotype blocks are caused by strong, fairly recent selective sweeps or the evolution of haplotypes harbouring multiple causal alleles. We took advantage of the fact that these two models give entirely different predictions as regards the nucleotide diversity within the differentiated regions within populations. The selective sweep/hitchhiking model is expected to lead to a loss of nucleotide diversity whereas the haplotype evolution model predicts that nucleotide diversity in the differentiated regions can be as high or even higher than in the rest of the genome if these haplotype blocks have been maintained by balancing selection. Our analysis presented in the revised section “Adaptive haplotype blocks are maintained by selection” and in the new Figure 5 shows that nucleotide diversity is in fact higher in the regions showing strong differentiation even within populations. We therefore conclude that our data are consistent with evolution of adaptive haplotypes carrying multiple causal variants. We think this is an important improvement of the paper.

https://doi.org/10.7554/eLife.12081.037

Article and author information

Author details

  1. Alvaro Martinez Barrio

    1. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    2. Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
    Contribution
    AMB, Contributed to the assembly and genome annotation, Performed bioinformatic analysis of population data, Wrote the paper
    Contributed equally with
    Sangeet Lamichhaney, Guangyi Fan and Nima Rafati
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5064-2093
  2. Sangeet Lamichhaney

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    SL, Performed bioinformatic analysis of population data, Wrote the paper
    Contributed equally with
    Alvaro Martinez Barrio, Guangyi Fan and Nima Rafati
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4826-0349
  3. Guangyi Fan

    1. State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China
    2. BGI-Shenzhen, Shenzen, China
    Contribution
    GF, Contributed to the assembly and genome annotation
    Contributed equally with
    Alvaro Martinez Barrio, Sangeet Lamichhaney and Nima Rafati
    Competing interests
    The authors declare that no competing interests exist.
  4. Nima Rafati

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    NRa, Performed bioinformatic analysis of population data
    Contributed equally with
    Alvaro Martinez Barrio, Sangeet Lamichhaney and Guangyi Fan
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3687-9745
  5. Mats Pettersson

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    MP, Performed bioinformatic analysis of population data
    Competing interests
    The authors declare that no competing interests exist.
  6. He Zhang

    1. BGI-Shenzhen, Shenzen, China
    2. College of Physics, Qingdao University, Qingdao, China
    Contribution
    HZ, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  7. Jacques Dainat

    1. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    2. Bioinformatics Infrastructure for Life Sciences, Uppsala University, Uppsala, Sweden
    Contribution
    JD, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  8. Diana Ekman

    Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
    Contribution
    DE, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  9. Marc Höppner

    1. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    2. Bioinformatics Infrastructure for Life Sciences, Uppsala University, Uppsala, Sweden
    Contribution
    MH, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  10. Patric Jern

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    PJ, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  11. Marcel Martin

    Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
    Contribution
    MM, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0680-200X
  12. Björn Nystedt

    Science for Life Laboratory, Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
    Contribution
    BN, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  13. Xin Liu

    BGI-Shenzhen, Shenzen, China
    Contribution
    XLi, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  14. Wenbin Chen

    BGI-Shenzhen, Shenzen, China
    Contribution
    WC, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  15. Xinming Liang

    BGI-Shenzhen, Shenzen, China
    Contribution
    XLia, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  16. Chengcheng Shi

    BGI-Shenzhen, Shenzen, China
    Contribution
    CS, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  17. Yuanyuan Fu

    1. BGI-Shenzhen, Shenzen, China
    2. School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
    Contribution
    YF, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  18. Kailong Ma

    BGI-Shenzhen, Shenzen, China
    Contribution
    KM, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  19. Xiao Zhan

    BGI-Shenzhen, Shenzen, China
    Contribution
    XZ, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  20. Chungang Feng

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    CF, Performed experimental work
    Competing interests
    The authors declare that no competing interests exist.
  21. Ulla Gustafson

    Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
    Contribution
    UG, Performed experimental work
    Competing interests
    The authors declare that no competing interests exist.
  22. Carl-Johan Rubin

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    C-JR, Performed bioinformatic analysis of population data
    Competing interests
    The authors declare that no competing interests exist.
  23. Markus Sällman Almén

    Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    Contribution
    MSA, Analysis and interpretation of data, Contributed unpublished essential data or reagents
    Competing interests
    The authors declare that no competing interests exist.
  24. Martina Blass

    Department of Aquatic Resources, Institute of Coastal Research, Swedish University of Agricultural Sciences, Öregrund, Sweden
    Contribution
    MB, Provided population samples
    Competing interests
    The authors declare that no competing interests exist.
  25. Michele Casini

    Department of Aquatic Resources, Institute of Marine Research, Swedish University of Agricultural Sciences, Lysekil, Sweden
    Contribution
    MC, Provided population samples
    Competing interests
    The authors declare that no competing interests exist.
  26. Arild Folkvord

    1. Department of Biology, University of Bergen, Bergen, Norway
    2. Hjort Center of Marine Ecosystem Dynamics, Bergen, Norway
    3. Institute of Marine Research, Bergen, Norway
    Contribution
    AF, Provided population samples
    Competing interests
    The authors declare that no competing interests exist.
  27. Linda Laikre

    Department of Zoology, Stockholm University, Stockholm, Sweden
    Contribution
    LL, Provided population samples
    Competing interests
    The authors declare that no competing interests exist.
  28. Nils Ryman

    Department of Zoology, Stockholm University, Stockholm, Sweden
    Contribution
    NRy, Provided population samples
    Competing interests
    The authors declare that no competing interests exist.
  29. Simon Ming-Yuen Lee

    State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China
    Contribution
    SM-YL, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  30. Xun Xu

    BGI-Shenzhen, Shenzen, China
    Contribution
    XX, Contributed to the assembly and genome annotation
    Competing interests
    The authors declare that no competing interests exist.
  31. Leif Andersson

    1. Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
    2. Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden
    3. Department of Veterinary Integrative Biosciences, Texas A&M University, Texas, United States
    Contribution
    LA, Conceived the study, Led bioinformatic analysis of population data, Wrote the paper, Acquisition of data, Analysis and interpretation of data
    For correspondence
    leif.andersson@imbim.uu.se
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4085-6968

Funding

European Research Council (Bateson)

  • Leif Andersson

Swedish Research Council Formas (Research grant)

  • Leif Andersson

Knut och Alice Wallenbergs Stiftelse (Research grant)

  • Leif Andersson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are grateful to Greg Barsh, Nick Barton, Örjan Carlborg, Miguel Carneiro, Peter and Rosemary Grant, Magnus Nordborg, Lars Rönnegård, Matthew Webster and an anonymous reviewer for valuable comments on the manuscript, and to Sam Barsh and David Tengbjörk for expert technical assistance. The work was funded by the ERC project BATESON, the Swedish Research Council and Formas. The BGI group was supported by the Special Project on the Integration of Industry, Education and Research of Guangdong Province (2013B090800017) and the Shenzhen Special Program on Future Industrial Development (JSGG20141020113728803). AMB, MM, DE and BN are supported by a grant from the Knut and Alice Wallenberg Foundation to the Wallenberg Advanced Bioinformatics Infrastructure. Sequencing was performed by the SNP&SEQ Technology Platform, supported by Uppsala University and Hospital, SciLifeLab and Swedish Research Council (80576801 and 70374401). SNP typing was performed by the Array and Analysis Facility, SciLifeLab, Uppsala, Sweden. Computer resources were provided by UPPMAX, Uppsala University.

Reviewing Editor

  1. Magnus Nordborg, Vienna Biocenter, Austria

Publication history

  1. Received: October 5, 2015
  2. Accepted: April 6, 2016
  3. Version of Record published: May 3, 2016 (version 1)

Copyright

© 2016, Martinez Barrio et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,564
    Page views
  • 865
    Downloads
  • 30
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    John Grey Monroe et al.
    Research Article
    1. Evolutionary Biology
    2. Genetics and Genomics
    Eric CH Chen et al.
    Research Article