Effective population size does not explain long-term variation in genome size and transposable element content in animals

  1. Alba Marino  Is a corresponding author
  2. Gautier Debaecker
  3. Anna-Sophie Fiston-Lavier
  4. Annabelle Haudry
  5. Benoit Nabholz  Is a corresponding author
  1. ISEM, Université de Montpellier, CNRS, IRD, France
  2. Université Claude Bernard Lyon 1, LEHNA UMR 5023, CNRS, France
  3. Institut Universitaire de France, France
  4. Université Claude Bernard Lyon 1, LBBE UMR 5558, France
7 figures, 4 tables and 3 additional files

Figures

Phylogeny of the 807 species including ray-finned fishes (Actinopteri), birds (Aves), insects (Insecta), mammals (Mammalia), and molluscs (Mollusca).

Bars correspond to TE content (bp, blue), genome size (bp, green), and dN/dS estimations (values between 0 and 1, yellow). Branch lengths are amino-acid substitutions calculated on BUSCO genes. The tree was plotted with iTOL (Letunic and Bork, 2021).

Correlation between assembly sizes and C-values for 365 species with contig N50 ≥50 kb.

The grey slope corresponds to the WLS used to predict the expected C-values (reported in the equation). The dark-red dashed slope marks the hypothetical 1:1 relationship. FCM = Flow Cytometry, FD = Feulgen Densitometry, FIA = Feulgen Image Analysis.

Figure 2—source data 1

C-value records and assembly sizes used to train the WLS.

Measurements made with flowcytometry (FCM), Feulgen densitometry (FD), and Feulgen image analysis (FIA) were issued from https://www.genomesize.com/ for species with contig N50 ≥ 50 kb.

https://cdn.elifesciences.org/articles/100574/elife-100574-fig2-data1-v1.xlsx
Figure 3 with 1 supplement
Genomic proportion occupied by repeats in 29 dipteran species as estimated by EarlGrey and by the dnaPipeTE wrapper pipelines.

The genome percentage is calculated proportionally to the assembly size in the case of EarlGrey, while it is calculated in relation to the genome size estimated in this study in the case of dnaPipeTE. DNA = DNA elements; RC = Rolling Circle; LTR = Long Terminal Repeats; LINE = Long Interspersed Nuclear Elements; SINE = Short Interspersed Nuclear Elements. ‘Other’ includes simple repeats, microsatellites, RNAs. ‘Unknown’ includes all repeats that could not be classified.

Figure 3—source data 1

Quantity of repeated elements for 29 dipteran genomes as estimated by EarlGrey and dnaPipeTE.

Repeat content is reported overall and by order, both in base pairs and genomic percentage. Due to the different method, percentages are based on assembly size for EarlGrey and on genome size for dnaPipeTE.

https://cdn.elifesciences.org/articles/100574/elife-100574-fig3-data1-v1.xlsx
Figure 3—figure supplement 1
Genomic length occupied by repeats in 29 dipteran species as estimated by EarlGrey and by the dnaPipeTE wrapper pipelines.

Kbps are reported as the total base pairs annotated in the assembly in the case of EarlGrey, and as the estimated coverage from reads sampling in the case of dnaPipeTE. DNA = DNA elements; RC = Rolling Circle; LTR = Long Terminal Repeats; LINE = Long Interspersed Nuclear Elements; SINE = Short Interspersed Nuclear Elements. ‘Other’ includes simple repeats, microsatellites, RNAs. ‘Unknown’ includes all repeats that could not be classified.

Figure 4 with 3 supplements
Relationship between overall TE content, genome size, and dN/dS.

(A) Relationship between overall TE content and genome size (N=672, log-transformed): slope = 0.718, adjusted-R2=0.751, p-value <0.001. (B) Relationship between genome size and dN/dS (N=785): slope = 6.100, adjusted-R2=0.275, p-value <0.001. (C) Relationship between TE content and dN/dS (N=672): slope = 4.253, adjusted-R2=0.092, p-value <0.001. Statistics refer to linear regression, see figure supplements and Tables 1 and 3 for Phylogenetic Independent Contrasts results.

Figure 4—figure supplement 1
PIC regression of overall TE content as a predictor of genome size across the full dataset.

Slope = 0.219, adjusted−R2=0.417, p−value <0.001. Variables were log−transformed previous to regression.

Figure 4—figure supplement 2
PIC regressions of Coevol dN/dS estimated from the GC3-poor geneset as predictor of genome size, overall and recent TE content.

(A) Genome size: slope = –0.287, adjusted-R2=0.004, p-value = 0.039. (B) Overall TE content: slope = –0.903, adjusted-R2=0.004, p-value = 0.050. (C) Recent TE content: slope = –1.225, adjusted-R2=0.003, p-value = 0.089. Genomic traits were log-transformed previous to regression.

Figure 4—figure supplement 3
PIC regressions for Coevol dN/dS estimated from the GC3-poor geneset as predictor of body mass and longevity.

(A) Body mass: slope = 11.422, adjusted-R2=0.087, p-value <0.001. (B) Longevity: slope = 2.970, adjusted-R2=0.050, p-value <0.001. LHTs were log-transformed previous to regression.

Comparison of Bio ++ dN/dS estimated from the full and pruned phylogenies.

To obtain the pruned phylogeny, branches longer than 1 and shorter than 0.01 amino-acid substitutions were removed, leaving 485 tips. Pearson’s r=0.962, p-value <0.001. The corresponding dN/dS values are included in Table 3—source data 1.

Comparison of Bio ++ and Coevol dN/dS estimations.

(A) GC3-poor geneset (N=785): Pearson’s r=0.902, p-value <0.001. (B) GC3-rich geneset (N=785): Pearson’s r=0.887, p-value <0.001. The corresponding dN/dS values are included in Table 3—source data 1.

Author response image 1

Tables

Table 1
Correlation between genome size and overall TE content based on phylogenetic independent contrasts.

Statistics are shown relative to the overall dataset and to each clade. Variables were log-transformed previous to regression. Original values used to infer PIC statistics are included in Table 3—source data 1. * 0.05<p ≤ 0.01; ** 0.01<p ≤ 0.001; *** p<0.001. Significant correlations are highlighted in bold.

Regression coefficientAdjusted-R2p-value
Overall Dataset0.219 ***0.417 ***<0.001
Actinopteri0.300 ***0.610 ***<0.001
Aves0.042 ***0.039 **0.001
Insecta0.356 ***0.626 ***<0.001
Mammalia0.200 ***0.526 ***<0.001
Mollusca0.605 ***0.895 ***<0.001
Table 2
Coevol correlations between genomic traits – genome size, TE content, and recent TE content – and LHTs.

Different LHTs are shown according to availability for a clade. Posterior probabilities lower than 0.1 indicate significant negative correlations; posterior probabilities higher than 0.9 indicate significant positive correlations. Expected, significant correlations are marked in bold black; significant correlations opposite to the expected trend are marked in bold red.

Coevol correlationsActinopteriAvesInsectaMammaliaMollusca
GC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-rich
Genome sizeCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPP
Body length0.0150.560.0150.56–0.0650.26–0.0810.21
Basal metabolic rate (ml/O2/hour)*0.0280.600.0640.72
Age at first birth–0.2230.02–0.2560.01
Population density0.2030.970.2000.96
Maximum longevity0.1350.850.1420.870.1160.910.1610.94–0.1060.14–0.1090.15
Mass0.0300.610.0400.650.20110.1850.99–0.0770.23–0.0970.16
Metabolic rate (W)–0.2740.003–0.2100.020.0470.680.0780.76
Sexual maturity–0.0110.450.0130.520.0390.640.0480.67–0.1760.05–0.1800.05
Depth range–0.0130.470.0000.51
Overall TE content0.79310.77910.29310.3351.000.84210.83410.6791.000.6781.000.90910.8951
Recent TE content0.66410.63210.23710.2831.000.80910.80410.4841.000.5111.000.89910.8971
TE contentCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPP
Body length–0.1030.17–0.0730.26–0.2380.013–0.2740.01
Basal metabolic rate (ml/O2/hour)*0.1640.890.2530.97
Age at first birth–0.4270–0.4730.00
Population density0.2740.980.2710.98
Maximum longevity0.1370.850.1550.870.0770.790.030.62–0.2310.013–0.2770.02
Mass–0.0910.22–0.0510.32–0.0220.38–0.0120.43–0.2330.017–0.2780.01
Metabolic rate (W)0.1640.940.1680.920.1970.940.2690.98
Sexual maturity–0.0310.430.0110.53–0.1000.2–0.130.11–0.3290–0.3430.00
Depth range–0.0260.430.0040.52
Recent TE contentCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPP
Body length–0.1330.12–0.0850.22–0.2970.00–0.3050.01
Basal metabolic rate (ml/O2/hour)*0.2290.940.3080.99
Age at first birth–0.4860.00–0.5060.00
Population density0.2380.940.2270.95
Maximum longevity0.1380.850.1760.90.0630.730.0200.57–0.3340.00–0.3470.01
Mass–0.1530.10–0.0960.21–0.0260.36–0.0070.47–0.2860.00–0.3080.01
Metabolic rate (W)0.1420.90.1440.880.2650.970.3130.99
Sexual maturity–0.0110.470.0510.62–0.1260.15–0.1510.1–0.3290.00–0.3260.01
Depth range–0.0360.390.0400.61
  1. *

    PanTHERIA.

  2. AnAge.

Table 2—source data 1

Genome sizes, all LHTs records, overall, and recent TE contents for the selected 807 species.

These correspond to the original values provided to the input matrix of characters to run Coevol.

https://cdn.elifesciences.org/articles/100574/elife-100574-table2-data1-v1.xlsx
Table 3
PIC results for the correlations of LHTs, genome size, and TE content against dN/dS.

Results for Bio ++ dN/dS are shown for the full dataset and for the phylogeny deprived of the longest (>1 amino-acid substitutions) and shortest (<0.01 amino-acid substitutions) terminal branches. Results for Coevol dN/dS are relative to the GC3-poor geneset. Only body mass and longevity are reported as LHTs (for an overview of all traits, see Table 2). For genomic traits, statistics are reported relative to the overall dataset and to each clade. Expected significant correlations of dN/dS with LHTs and genomic traits are marked in bold black; significant correlations opposite to the expected trend are marked in bold red. * 0.05 < p ≤ 0.01; ** 0.01 < p ≤ 0.001; *** p < 0.001.

PIC correlationsBio++ (full phylogeny)Bio++ (trimmed phylogeny)Coevol
Regression coefficientAdjusted-R²p-valueRegression coefficientAdjusted-R²p-valueRegression coefficientAdjusted-R²p-value
Body mass (log gr)~dN/dSOverall Dataset6.865 ***0.036 ***<0.00123.905 ***0.044 ***<0.00111.422 ***0.087 ***<0.001
Longevity (log years)~dN/dSOverall Dataset2.147 **0.025 **0.00411.349 ***0.097 ***<0.0012.970 ***0.050 ***<0.001
Genome size (log bp)~dN/dSOverall Dataset0.1990.0010.1750.114–0.0020.858–0.287 *0.004 *0.039
Actinopteri0.9150.0160.0660.687–0.0090.7010.282–0.0050.583
Aves0.1090.0010.2700.7090.0130.0730.238–0.0010.407
Insecta1.085–0.0020.411–3.7030.0050.204–1.241 *0.026 *0.018
Mammalia0.071–0.0050.7010.220–0.0110.777–0.1650.0030.227
Mollusca3.504–0.0320.6983.504–0.0320.698–1.488–0.0320.699
TE content (log bp)~dN/dSOverall Dataset0.7980.0040.062–0.216–0.0020.899–0.903 *0.004 *0.050
Actinopteri2.1390.0150.0981.393–0.0120.7513.407 *0.046 *0.013
Aves0.340–0.0020.513–1.492–0.0040.5512.2850.0060.129
Insecta1.744–0.0040.528–3.096–0.0070.6020.063–0.0070.960
Mammalia1.0010.00270.2381.023–0.0120.691–2.113 **0.063 **0.001
Mollusca22.9300.0260.22522.9300.0260.2250.881–0.0500.936
Recent TE content (log bp)~dN/dSOverall Dataset1.963 **0.012 **0.0030.691–0.0020.727–1.2250.00290.089
Actinopteri2.2710.0130.1131.241–0.0120.7934.365 **0.061 **0.005
Aves0.545–0.0010.384–0.385–0.0060.8904.982 **0.028 **0.006
Insecta1.725–0.0040.530–4.792–0.0040.4280.536–0.0060.668
Mammalia4.115 *0.024 *0.0313.192–0.0060.460–3.151 *0.024 *0.032
Mollusca14.730–0.0240.48114.730–0.0240.4812.986–0.0470.8026
Table 3—source data 1

BUSCO Duplicated scores, genome sizes, body mass and longevity records, dN/dS estimations from Bio ++ and Coevol, overall and recent TE contents for the selected 807 species.

These are the original values used to infer PIC correlations.

https://cdn.elifesciences.org/articles/100574/elife-100574-table3-data1-v1.xlsx
Table 4
Correlation coefficients (CC) and posterior probabilities (PP) estimated by Coevol with the GC3-poor and GC3-rich genesets for the coevolution of dN/dS with life history and genomic traits.

Different LHTs are shown according to availability for a clade. Posterior probabilities lower than 0.1 indicate significant negative correlations; posterior probabilities higher than 0.9 indicate significant positive correlations. Expected significant correlations of dN/dS with LHTs and genomic traits are marked in bold black; significant correlations opposite to the expected trend are marked in bold red. The original LHTs values used as input for Coevol are the same as those reported in Table 2—source data 1.

Coevol correlationsActinopteri dN/dSAves dN/dSInsecta dN/dSMammalia dN/dSMollusca dN/dS
GC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-richGC3-poorGC3-rich
CCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPPCCPP
Body length (cm)0.2460.920.2890.970.37110.2710.99
Basal metabolic rate (ml/O₂/hour)*-0.2580.02-0.3370.01
Age at first birth (days)0.42810.3841
Population density (individuals/km²)-0.3610.001-0.1880.09
Maximum longevity (years)0.4350.970.3300.950.3030.920.1090.750.09840.620.1570.670.2950.990.3761
Mass (g)0.2230.890.3080.970.0960.840.1550.950.02430.520.2210.760.41210.3161
Metabolic rate (Watt)0.0300.57-0.2440.070.2010.78-0.0380.45-0.2830.02-0.3770.003
Sexual maturity (days)0.3200.910.2860.880.0320.580.2840.99-0.2440.190.0720.580.46810.3291
Depth range0.3880.930.5841
Genome size (bp)0.0340.59-0.0270.420.0540.690.0720.77-0.3300.03-0.1100.24-0.0410.35-0.1400.120.1420.650.3060.82
Overall TE content (bp)0.1920.880.1670.840.0650.73-0.1950.03-0.2030.16-0.0170.45-0.2200.08-0.2490.06-0.1190.39-0.0210.47
Recent TE content (bp)0.2050.890.2580.950.1090.83-0.2400.01-0.1360.250.0490.63-0.1190.23-0.2960.06-0.1720.34-0.0040.50
  1. *

    PanTHERIA.

  2. AnAge.

Additional files

Supplementary file 1

Metadata, assembly metrics, BUSCO scores and genome sizes for the initial 3214 species.

C-values and expected C-values are reported only for species with Quast_ContigN50≥50 kb. Method and Notes_cvalue_method report the method used for genome size measurement (FCM = Flow Cytometry, FD = Feulgen Densitometry, FIA = Feulgen Image Analysis) and how C-value was chosen from https://www.genomesize.com/, respectively. Expected C-values are the C-values predicted from the WLS trained on the dataset in Figure 2—source data 1. C-values are employed as genome size for the species with a record in the Genome Size database, while the expected C-values are used for the species without a record.

https://cdn.elifesciences.org/articles/100574/elife-100574-supp1-v1.xlsx
Supplementary file 2

BUSCO genes used to calculate the clade phylogenies and the branch lengths of the whole tree.

https://cdn.elifesciences.org/articles/100574/elife-100574-supp2-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/100574/elife-100574-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Alba Marino
  2. Gautier Debaecker
  3. Anna-Sophie Fiston-Lavier
  4. Annabelle Haudry
  5. Benoit Nabholz
(2025)
Effective population size does not explain long-term variation in genome size and transposable element content in animals
eLife 13:RP100574.
https://doi.org/10.7554/eLife.100574.3