Biomarkers in a socially exchanged fluid reflect colony maturity, behavior, and distributed metabolism

  1. Sanja M Hakala
  2. Marie-Pierre Meurville
  3. Michael Stumpe
  4. Adria C LeBoeuf  Is a corresponding author
  1. Department of Biology, University of Fribourg, Switzerland
  2. Metabolomics and Proteomics Platform, Department of Biology, University of Fribourg, Switzerland
6 figures, 1 table and 6 additional files

Figures

Schematic of study design.

(A) Four comparisons, Young vs. Mature, Nurse vs. Forager, Field vs. Lab, and East vs. West, analyzed in this study with sample numbers indicated in parentheses. In all comparisons sample numbers indicate colonies with the exception of Nurse vs. Forager, where samples are from single individuals, ten each from four colonies. Palm trees indicate field samples and boxes indicate laboratory samples. (B) Schematic of analysis approach to find robustly differing proteins in each comparison. Sample information can be found in Supplementary file 1.

Figure 2 with 1 supplement
Protein presence in trophallactic fluid varies with biotic and abiotic factors.

(A) Mean ± SD of the proportion of proteins present in samples of a given type. Proportion of proteins present in all samples of a given type are highlighted in black. (B) Coefficient of variation (standard deviation/mean), calculated for the iBAQ values greater than zero of all the proteins identified by sample type. Sample sizes per type are given under their names. Mature L and Mature N are mature colonies that were sampled six times to assess within-colony variation in colony samples. Significance of comparisons based on gamma GLM (A) or negative binomial GLM (B): NS indicated when p > 0.05 significant, ** p < 0.01, *** p < 0.001 (full results in Figure 2—source data 1; Figure 2—source data 2).

Figure 2—source data 1

Coefficient of variation by sample type Post-hoc comparisons of gamma GLM on coefficient of variation by sample type.

https://cdn.elifesciences.org/articles/74005/elife-74005-fig2-data1-v3.pdf
Figure 2—source data 2

Protein number by sample type Post-hoc comparisons of negative binomial GLM on protein number explained by sample type.

https://cdn.elifesciences.org/articles/74005/elife-74005-fig2-data2-v3.pdf
Figure 2—figure supplement 1
Protein abundance and commonness.

Protein abundances of the 519 proteins, calculated without missing values (where no matching spectra were detected), in (A) the colony dataset, and (B) in the single individual dataset. The proteins highlighted in red are the most abundant ones when calculated including missing values in both datasets combined, as shown in Figure 3. The red dashed line shows the cut-off used for classical frequentist statistical analyses – for the empirical Bayes and machine learning analyses all proteins were included.

Similarity across trophallactic fluid proteome samples of colonies and single individuals.

Principal component analysis for all proteins for (A) colony samples and (B) single individual samples from the four colonies. Symbols representing the four colonies represented in (B) can be found in maroon in (A). (C) Ranked Self-similarity S for each sample type comparison. Self-similarity is the absolute value of the difference between dissimilarity within and across samples divided by the average dissimilarity of all samples (by standardized Euclidean distance of protein abundance). Samples with higher S are more similar to samples of the same type, while samples with an S of zero are equidistant to the centroids of the two sample groups.

Figure 3—source code 1

Matlab source code to produce self-similarity scores, plots and PCA plots that make up Figure 3, https://github.com/dradri/variation2021.

https://cdn.elifesciences.org/articles/74005/elife-74005-fig3-code1-v3.zip
Figure 3—source data 1

Matlab MAT data file based on iBAQ values, gene names, and sample classes to produce self-similarity scores, plots and PCA plots that make up Figure 3.

https://cdn.elifesciences.org/articles/74005/elife-74005-fig3-data1-v3.mat
Figure 4 with 1 supplement
The sixty most abundant proteins in trophallactic fluid over 73 colony and 40 single individual samples.

Ranking of abundance (including missing values). From left to right, Drosophila melanogaster orthologs, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are bolded for the 25 out of 27 core trophallactic fluid proteins that are amongst the 60 most abundant proteins. The additional but less abundant core proteins are a cathepsin (26–29 p) and a myosin heavy chain (Mhc). For protein accession numbers, see Figure 4—figure supplement 1.

Figure 4—figure supplement 1
Most abundant proteins with accession numbers.

The sixty most abundant proteins in trophallactic fluid over 73 colony and 40 single individual samples. Ranking of abundance included zero values. From left to right, accession numbers, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are bolded for the 25 out of 27 core trophallactic fluid proteins that are amongst the 60 most abundant proteins. The additional but less abundant core proteins are a cathepsin (26–29 p) and a myosin heavy chain (Mhc).

Figure 5 with 2 supplements
All proteins that significantly differ in two out of three of the analysis methods (frequentist, empirical Bayes, and random forest classification with SHAP values).

From left to right, Venn diagrams of significance overlap between methods, Drosophila melanogaster orthologs, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples calculated without missing values, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are in bold for the core trophallactic fluid proteins present in all samples. For visualization of each analysis method, see Figure 5—figure supplement 1. For protein accession numbers, see Figure 5—figure supplement 2. For all the 135 proteins significantly differing in any analysis, see Supplementary file 2. For full model results, see Supplementary files 3-5.

Figure 5—figure supplement 1
Visualization of all results.

Venn diagrams summarizing statistical methods, frequentist volcano plots, empirical Bayes volcano plots, example SHAP value plots of feature importance the top 20 proteins. Each SHAP plot is for one of the ten models trained. For significant proteins, see Supplementary file 2, for full model results, see Supplementary files 3-5.

Figure 5—figure supplement 2
Significantly differing proteins in two out of three analyses with accession numbers.

All proteins that significantly differ in two out of three of the analysis methods (frequentist, empirical Bayes and random forest classification with SHAP values). From left to right, accession numbers, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples calculated without zero values, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms.

Gene set enrichment analysis of trophallactic fluid.

Significant terms for Drosophila melanogaster orthologs of (A) the 60 most abundant trophallactic fluid proteins, trophallactic fluid proteins significantly differing between (B) Young vs. Mature, (C) Nurse vs. Forager, and (D) Field vs. Lab, with -log10(FDR) indicated on y-axes. Deep purple indicates GO biological process; blue, GO molecular function; turquoise, GO cellular compartment; lime green, Reactome pathway; orange, KEGG pathway. Circle size indicates strength, log10(observed proteins / expected proteins in a random network of this size). Full results can be found in Figure 6—source data 1.

Figure 6—source data 1

Gene set enrichment analyses of trophallactic fluid proteins.

Sheet 1 provides the network characteristics for all gene set enrichment analyses. The detailed results for each network are presented in sheet 2 for the 60 most abundant proteins, and in sheet 3 for the trophallactic fluid proteins significantly differing in two out of three of our statistical methods, first combined and then separately for the three main comparisons. In each case, analysis was performed on D. melanogaster orthologs of C. floridanus proteins. Observed gene count indicates how many proteins in the network are annotated with the term. Background gene count indicates how many proteins in total have this term, in this network and in the background. Strength describes how large the enrichment effect is: log10(observed proteins / expected proteins in a random network of this size). False Discovery Rate describes how significant the enrichment is. p-Values are corrected for multiple testing within each category using the Benjamini– Hochberg procedure. The significant annotations are indicated for GO: Biological process (GO:BP), GO: Molecular function (GO:MF), GO: Cellular component (GO:CC), Reactome pathways and KEGG pathways.

https://cdn.elifesciences.org/articles/74005/elife-74005-fig6-data1-v3.xlsx

Tables

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Other UniProt Reference proteome (Camponotus floridanus); accessed February 2020 UniProt UP000000311
Other NCBI RefSeq Reference proteome (Camponotus floridanus), v7.5 NCBI RefSeq GCF_003227725.1
Biological sample (Camponotus floridanus) Trophallactic fluid (see details in Supplementary file 1) This paperSupplementary file 1Supplementary file 1
Software, algorithm MaxQuant v1.6.2.10 MaxQuant RRID:SCR_014485
Software, algorithm Perseus v1.6.15.0 Perseus RRID:SCR_015753
Software, algorithm R 3.6.1 R RRID:SCR_001905
Software, algorithm Matlab 2020b Mathworks RRID:SCR_001622
Software, algorithm R-package MASS 7.3–53 R Project RRID:SCR_019125
Software, algorithm R-package LME4 R Project RRID:SCR_015654
Software, algorithm R-package multcomp 1.4–15 R Project RRID:SCR_018255
Software, algorithm LIMMA-pipeline-proteomics pipeline 3.0.0 GitHub10.5281/zenodo.4050581
Software, algorithm sklearn v0.22.1 Scikit-learn RRID:SCR_019053
Software, algorithm Python 3.7.6 Python RRID:SCR_008394
Software, algorithm SHapley Additive ExPlanations package v0.37.0 GitHub RRID:SCR_021362
Software, algorithm OMA Browser
 OMA Browser (Martínez et al., 2020 release) RRID:SCR_011978
Software, algorithm Flybase Flybase RRID:SCR_006549
Software, algorithm STRING v11 STRING RRID:SCR_005223

Additional files

Supplementary file 1

The sampling scheme.

Trophallactic fluid (TF) sampled for proteomics analysis. Date (field) indicates when the colony extract was collected from the field site, Date (TF sampling) indicates the date of the trophallactic fluid collection. Volume and ants indicate the volume collected and the number of ants collected from for each sample. The lab2019 colonies were used for single individual trophallactic fluid samples. For the two mature colonies that were sampled six times, * marks the sample that was used in the main datasets.

https://cdn.elifesciences.org/articles/74005/elife-74005-supp1-v3.xlsx
Supplementary file 2

All 135 significantly differing proteins.

This supplementary file combines into a single sheet the results and additional information for all of the significantly differing proteins in our four comparisons (Young vs. Mature, Nurse vs. Forager, Field vs. Lab, East vs. West), by all of the three statistical methods (classical, empirical Bayes, machine learning). Protein accession numbers, presence in colony and individual datasets, abundance when present, fold changes by comparison and significance both by comparison and by model are shared.

https://cdn.elifesciences.org/articles/74005/elife-74005-supp2-v3.xlsx
Supplementary file 3

Full frequentist statistical results.

Statistical results for the classical frequentist models; the imputed data are also shared.

https://cdn.elifesciences.org/articles/74005/elife-74005-supp3-v3.xlsx
Supplementary file 4

Full empirical Bayes statistical results.

For the empirical Bayes LIMMA models, results are shared as raw output tables.

https://cdn.elifesciences.org/articles/74005/elife-74005-supp4-v3.xlsx
Supplementary file 5

Full random forest statistical results.

Accuracy, seed, and mean feature importances for each gene are reported for each model trained for the random forest analyses.

https://cdn.elifesciences.org/articles/74005/elife-74005-supp5-v3.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/74005/elife-74005-transrepform1-v3.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sanja M Hakala
  2. Marie-Pierre Meurville
  3. Michael Stumpe
  4. Adria C LeBoeuf
(2021)
Biomarkers in a socially exchanged fluid reflect colony maturity, behavior, and distributed metabolism
eLife 10:e74005.
https://doi.org/10.7554/eLife.74005