Copy number variation and population-specific immune genes in the model vertebrate zebrafish
Figures

Structure of zebrafish NLRs and a map showing the origin of wild zebrafish samples.
(A) Generalized, schematic representation of the domain architecture of an NLR-C protein. Each box represents a translated exon. The N-terminal repeats, the death-fold domain, as well as the B30.2 domain only occur in subsets of NLR-C genes. The number of N-terminal repeats and leucine-rich repeats can vary. Domains that can be either present or absent in different NLRs are surrounded by square brackets. (B) Sampling sites for wild zebrafish. All sites are located near the Bay of Bengal. Final sequenced sample sizes are indicated in parentheses. The map is based on geographic data collected and published by AQUASTAT from the Food and Agriculture Organization of the United Nations (FAO, 2021). The population DP is marked with an asterisk because its analysis and results are presented only in figure supplements.

Total counts of NLRs found per individual, shown for each population.
Black diamonds on the box plots denote means, horizontal lines denote medians. Left side: two laboratory strains; right side: three wild populations.
-
Figure 2—source data 1
Source tables for Figure 2 and its supplements.
- https://cdn.elifesciences.org/articles/98058/elife-98058-fig2-data1-v2.xlsx
-
Figure 2—source data 2
Sequences and target locations of RNA baits.
- https://cdn.elifesciences.org/articles/98058/elife-98058-fig2-data2-v2.xlsx

Sequencing and assembly statistics of circular consensus sequence (CCS) reads from NLR exons.
(A1) Absolute numbers of CCS reads from NLR exons per sequenced individual. (A2) Lengths of the CCS reads that map to NLR genes. (B1) Absolute numbers of assembled contigs containing an NLR exon per sequenced individual. Triangles above TU mark the numbers of NLR exons found in the reference genome. (B2) Lengths of the individual assembled contigs that contain an NLR exon. Outliers not shown in the boxplots. The black diamonds on boxplots denote means, horizontal black lines denote medians.

Assembled NLRs in the reference genome GRCz11.
(A) Proportions of unique FISNA-NACHT and NLR-B30.2 sequences that were successfully mapped to the reference genome GRCz11 with a mapping quality of 60, by population. (B) Distribution of mapping qualities for all unique NLR sequences that aligned to GRCz11, showing that most map either with very high (60) or very low quality.

Identification of B30.2 domains associated with zebrafish NLRs.
(A) Nucleotide sequence logo (on top, continued on bottom left) and amino acid logo (on bottom right) for a small, highly conserved 47 bp exon that precedes B30.2 in zebrafish NLRs and not in other genes. The first nucleotide of the exon was removed to generate the correct amino acid translation. Logos were created with Weblogo, the height of each base represents its information content in bits (Crooks et al., 2004). (B) Absolute numbers of contigs containing a B30.2 exon per sequenced individual, split by presence/absence of the NLR-specific exon. The black diamonds on boxplots mark the means. (C) Genomic distribution of FISNA-NACHT domains in the GRCz11 reference genome. (D) Genomic distribution of B30.2 domains in the GRCz11 reference genome.

Copy number variation of NLR genes.
(A) Sequence data from each individual zebrafish (vertical axis) was aligned to FISNA-NACHT exon sequences of the pan-NLRome (horizontal axis). Grayscale intensity shows, for each NLR, the proportion of NLR-aligning data in each given fish that matches this specific gene. Darker gray indicates a higher likelihood of this NLR being represented in multiple copies in the particular individual. Light gray indicates a single copy, white indicates absence. For clarity, only the 1235 FISNA-NACHT exons for which at least one fish had a minimum of 10 reads mapped to it are shown. (B) Numbers of pan-NLRome sequences (based on FISNACHT diagnosis) found in all three, two, or only one wild population. (C) Relative numbers of fish in which pan-NLRome sequences were found in wild populations. ’Core’ pan-NLRome: genes which are found in at least 80% of the sample (from a total of 57 wild fish); ’shell’: genes in at least 20%; ’cloud’: rare genes found in less than 20% of the sample. (D) Observed and estimated sizes of population-specific pan-NLRomes. Data points (filled circles and squares) show the average number of totally discovered NLR genes (as identified via their FISNA-NACHT domain) when investigating fish. The dashed line is obtained by non-linear fit of the data to the function given in Equation 2. For all populations, the hypothetical pan-NLRome size – when extrapolating – is finite (see Table 1).
-
Figure 3—source data 1
Source tables for Figure 3 and its supplements.
- https://cdn.elifesciences.org/articles/98058/elife-98058-fig3-data1-v2.xlsx

Comparison of copy number variation in FISNA-NACHT and NLR-B30.2 exons.
(A1, A2) Numbers of private and shared NLR sequences in wild populations. (B1, B2) Numbers of unique NLR sequences (each with one or more copies per individual) found in fish of the sequenced strains. Black diamonds on the box plots denote means, horizontal lines denote medians. (C1, C2) Population-specific pan-NLRomes and sets of NLR-B30.2 domains. Data points (filled circles and squares) show the average number of totally discovered NLR genes (as identified via their FISNA-NACHT domain) in individuals. The dotted lines represent the result of non-linear curve fitting (detailed in ‘Materials and methods’).

Copy number variation of NLR genes, including the DP population.
(A) Sequence data from each individual zebrafish (x-axis) was aligned to FISNA-NACHT exon sequences of the pan-NLRome (y-axis). Grayscale intensity shows, for each NLR, the proportion of NLR-aligning data in each given fish that matches this specific gene. Darker colors can be interpreted as potentially having multiple copies. Lighter colors indicate a single copy, white color means that the sequence was not present. For clarity, only the 1235 FISNA-NACHT for which at least one fish had 10 reads mapped to it are shown. (B) Relative numbers of fish in which the pan-NLRome sequences were found in wild populations. Some belong to the core pan-NLRome (in at least 80% of fish), while others are classified as shell (in at least 20% of fish) or cloud (less than 20%). (C) Numbers of unique NLR sequences (each with one or more copies per individual) found in fish of the sequenced strains. Black diamonds on the box plot denote means, horizontal lines denote medians. (D, E) Principal component analysis of scaled-per-individual NLR (FISNA-NACHT) copy numbers. The first two components appear to separate data based on differences between wild and laboratory zebrafish (PC1), and based on geographic distance (PC2).

Single-nucleotide variation in NLR exons.
Pairwise nucleotide diversity () and Watterson’s estimator of the scaled mutation rate () for FISNA-NACHT (A) and NLR-associated B30.2 (B) exons. (C) Proportion of exons without any single nucleotide polymorphisms. (D) Ratio of . Only exons with at least one single-nucleotide polymorphism are shown. The dotted, horizontal line marks a ratio of 1, the expected value under neutrality and constant population size. The black diamonds on box plots denote means, horizontal lines denote medians.
-
Figure 4—source data 1
Source tables for Figure 4 and its supplement.
- https://cdn.elifesciences.org/articles/98058/elife-98058-fig4-data1-v2.xlsx

Single-nucleotide polymorphisms of different NLR exons shown by population, including DP.
(A) Nucleotide diversity () and Watterson estimator () for FISNA-NACHT exons. (B) Nucleotide diversity () and Watterson estimator () for NLR-associated B30.2 exons. (C) Proportion of exons which are completely monomorphic. (D) Ratio of . Only exons with at least one variant are shown. The black, dotted line marks a ratio of 1. The black diamonds on box plots denote means, horizontal lines denote medians.
Tables
Values of fitted parameters and saturation limits for FISNA-NACHT and NLR-B30.2 exons, by population.
Population | FISNA-NACHT | NLR-B30.2 | ||||||
---|---|---|---|---|---|---|---|---|
- | α | β | Limit | Quantile* | α | β | Limit | Quantile* |
TU | 178.274 | 1.43356 | 519.548 | 118 | 53.8579 | 1.40774 | 164.73 | 164 |
CGN | 257.207 | 1.62786 | 569.367 | 23 | 78.7156 | 1.61283 | 177.246 | 25 |
DP | 309.14 | 1.01231 | 25284 | 2930† | 69.3609 | 0.87454 | ∞ | na |
KG | 436.761 | 1.2152 | 2288.41 | 2060 | 145.715 | 1.1418 | 1113.23 | 6.41e6 |
SN | 479.892 | 1.26093 | 2152.12 | 3907 | 145.548 | 1.10183 | 1514.35 | 3.75e9 |
CHT | 416.712 | 1.18893 | 2451.81 | 1.12e5 | 135.677 | 1.11911 | 1218.54 | 1.41e8 |
-
*
Sample size required to capture 90% of the population’s pan-NLRome.
-
†
DP required sample size refers to only 10% (instead of 90%) of its hypothetical pan-NLRome size.
Reagent type (species) or resource | Designation | Source or reference | Identifiers | Additional information |
---|---|---|---|---|
Strain (Danio rerio) | Cologne zebrafish; CGN; KOLN | Other | 8 Cologne fish, AG Hammerschmidt, University of Cologne | |
Strain (D. rerio) | Tübingen zebrafish; TU | Other | 8 Tübingen fish, AG Hammerschmidt, University of Cologne | |
Biological sample (D. rerio) | DP | Other | 20 wild fish, Dandiapalli, India (22.22155, 84.79430) | |
Biological sample (D. rerio) | CHT | Other | 20 wild fish, Chittagong, Bangladesh (22.47400, 91.78300) | |
Biological sample (D. rerio) | KG | Other | 20 wild fish, Leturakhal, India (22.26189 87.27881) | |
Biological sample (D. rerio) | SN | Other | 20 wild fish, Santoshpur, India (22.93765 88.55311) | |
Sequence-based reagent | Baits; RNA baits; hybridization baits | Daicel Arbor Biosciences | Cat# Mybaits-1-24 | Sequences available in Figure 2—source data 2 |
Commercial assay or kit | MagAttract HMW DNA Kit | QIAGEN | Cat# 67563 | |
Commercial assay or kit | NucleoSpin Tissue Kit | MACHEREY-NAGEL | Cat# 740952.50 | |
Commercial assay or kit | NEBNext Ultra II DNA Library Prep Kit | New England Biolabs | Cat# E7645L | |
Sequence-based reagent | NEBNext Multiplex Oligos for Illumina | New England Biolabs | Cat# E7335L | Index Primers Set 1 |
Commercial assay or kit | Kapa HiFi Hotstart Readymix | Kapa Biosystems | Cat# 07958935001 | |
Commercial assay or kit | PreCR Repair Mix | New England Biolabs | Cat# M0309L | |
Commercial assay or kit | SMRTbell Template Prep Kit 1.0-SPv3 | Pacific Biosciences | Cat# 100-991-900 | |
Other | GRCz11 | NCBI RefSeq | RefSeq:GCF_000002035.6 | Zebrafish reference genome |
Other | M220 miniTUBE, Red | Covaris | Cat# 4482266 | Used to shear DNA on Covaris ultrasonicator |
Other | DB MyOne Streptavidin C1 | Thermo Fisher Scientific | Cat# 65001 | Used to retrieve bait-bound DNA fragments |
Other | AMPure XP | Beckman Coulter | Cat# A63881 | Size selection beads |
Other | Ampure PB | Pacific Biosciences | Cat# 100-265-900 | PacBio-compatible size selection beads |
Software, algorithm | lima | Pacific Biosciences | lima:v1.0.0; lima:v1.8.0; lima:v1.9.0; lima:v1.11.0 | |
Software, algorithm | ccs | Pacific Biosciences | ccs:v4.2.0 | |
Software, algorithm | pbmarkdup | Pacific Biosciences | pbmarkdup:v1.0.0 | |
Software, algorithm | pbmm2 | Pacific Biosciences | pbmm2:v1.3.0 | |
Software, algorithm | samtools | https://doi.org/10.1093/bioinformatics/btp352 | samtools:v1.7 | |
Software, algorithm | EMBOSS | https://doi.org/10.1016/s0168-9525(00)02024-2 | EMBOSS:v6.6.0.0 | |
Software, algorithm | HMMER | https://doi.org/10.1093/bioinformatics/btt403 | HMMER:v3.2.1 | |
Software, algorithm | blastn | https://doi.org/10.1186/1471-2105-10-421 | blastn:v2.11.0+ | |
Software, algorithm | hifiasm | https://doi.org/10.1038/s41592-020-01056-5 | hifiasm:v0.15.4-r347 | |
Software, algorithm | get_homologues | https://doi.org/10.1128/AEM.02411-13 | get_homologues:x86_64–20220516 | |
Software, algorithm | deepvariant | https://doi.org/10.1038/nbt.4235 | deepvariant:r1.0 | |
Software, algorithm | GLnexus | https://doi.org/10.1101/343970 | Glnexus:v1.2.7–0-g0e74fc4 | |
Software, algorithm | vcftools | https://doi.org/10.1093/bioinformatics/btr330 | vcftools:v0.1.16 |
Sequencing scheme for the zebrafish samples.
Libraries sequenced after the introduction of an improved (long run) sequencing chemistry are marked with LR. Samples that yielded no data after sequencing are marked with asterisks.
Individuals | Library | Sequencer |
---|---|---|
TU01, TU02, TU03, TU06 | TU L1 | Sequel |
TU08, TU10, TU12, TU14 | TU L2 | Sequel |
CGN1, CGN2, CGN3, CGN4 | CGN L1 | Sequel |
CGN5, CGN6, CGN7, CGN8 | CGN L2 | Sequel |
DP07, DP09, DP10, DP12 | DP L1 | Sequel |
DP15, DP20, DP23, DP24, DP25, DP28, DP31, DP34 | DP L2 | Sequel (LR) |
DP03, DP05, DP13, DP16, DP21, DP29, DP31, DP33 | DP L3 | Sequel (LR) |
KG35, KG41, KG42, KG43 | KG L1 | Sequel |
KG03, KG05, KG07, KG12, KG14, KG15, KG18, KG19 | KG L2 | Sequel (LR) |
KG20, KG22, KG24, KG26, KG29, KG32, KG33, KG44 | KG L3 | Sequel (LR) |
SN21, SN23, (SN24*), SN26 | SN L1 | Sequel |
SN03, SN04, SN08, SN09, SN10, SN11, SN12, SN24 | SN L2 | Sequel II (LR) |
SN13, SN14, SN15, SN16, SN17, SN18, SN19, SN20 | SN L3 | Sequel II (LR) |
CHT19, CHT23, CHT26, CHT28 | CHT L1 | Sequel |
CHT01 - CHT07, (CHT13*) | CHT L2 | Sequel II (LR) |
CHT08, CHT10 - CHT12, CHT14 - CHT16, (SN25*) | CHT L3 | Sequel II (LR) |
PCR program used for barcoding.
For library amplification, the same program was used with 26 or 31 cycles.
Step | Temperature (°C) | Duration | |
---|---|---|---|
Initialization | 98 | 4 min | |
Denaturation | 98 | 30 s | {x 12} |
Annealing | 65 | 30 s | |
Elongation | 72 | 12 min | |
Final elongation | 72 | 20 min | |
Storage | 4 | ∞ |
qPCR program for the evaluation of enrichment efficiency.
Step | Temperature (°C) | Duration | |
---|---|---|---|
Initialization | 95 | 12 min | |
Denaturation | 95 | 15 s | {x 40} |
Annealing | 65 | 20 s | |
Elongation | 72 | 20 s |
Sequences of qPCR primers used for evaluation of target enrichment.
Gene | Direction | Sequence |
---|---|---|
il1 | + | 5’-tgg-tga-acg-tca-tca-tcg-cc-3’ |
il1 | - | 5’-tcc-agc-acc-tct-ttt-tct-cca-a-3’ |
foxo6 intron | + | 5’-agt-tct-gtg-tgg-gaa-cag-gg-3’ |
foxo6 intron | - | 5’-gtg-cat-ctt-tag-cgt-tgg-ct-3’ |
NLR group 1 | + | 5’-cct-gac-aca-ggt-caa-caa-aac-a-3’ |
NLR group 1 | - | 5’-gat-tgt-ctt-ttc-ctt-cag-ccc-ag-3’ |
NLR group 2 | + | 5’-tgg-att-ggg-ctg-aag-gga-aa-3’ |
NLR group 2 | - | 5’-agg-ttc-agt-cct-tta-gtc-tct-gg-3’ |
NLR group 3 | + | 5’-ctg-ctg-gag-gtg-aaa-gat-cag-ac-3’ |
NLR group 3 | - | 5’-gat-tgt-tga-gca-gtg-agc-agg-a-3’ |
NLR group 4 | + | 5’-tac-ctg-gac-aag-aca-aag-cca-3’ |
NLR group 4 | - | 5’-ctc-ctt-ctc-ttc-agc-cca-gtc-3’ |