Biomarkers in a socially exchanged fluid reflect colony maturity, behavior, and distributed metabolism
Figures

Schematic of study design.
(A) Four comparisons, Young vs. Mature, Nurse vs. Forager, Field vs. Lab, and East vs. West, analyzed in this study with sample numbers indicated in parentheses. In all comparisons sample numbers indicate colonies with the exception of Nurse vs. Forager, where samples are from single individuals, ten each from four colonies. Palm trees indicate field samples and boxes indicate laboratory samples. (B) Schematic of analysis approach to find robustly differing proteins in each comparison. Sample information can be found in Supplementary file 1.

Protein presence in trophallactic fluid varies with biotic and abiotic factors.
(A) Mean ± SD of the proportion of proteins present in samples of a given type. Proportion of proteins present in all samples of a given type are highlighted in black. (B) Coefficient of variation (standard deviation/mean), calculated for the iBAQ values greater than zero of all the proteins identified by sample type. Sample sizes per type are given under their names. Mature L and Mature N are mature colonies that were sampled six times to assess within-colony variation in colony samples. Significance of comparisons based on gamma GLM (A) or negative binomial GLM (B): NS indicated when p > 0.05 significant, ** p < 0.01, *** p < 0.001 (full results in Figure 2—source data 1; Figure 2—source data 2).
-
Figure 2—source data 1
Coefficient of variation by sample type Post-hoc comparisons of gamma GLM on coefficient of variation by sample type.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig2-data1-v3.pdf
-
Figure 2—source data 2
Protein number by sample type Post-hoc comparisons of negative binomial GLM on protein number explained by sample type.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig2-data2-v3.pdf

Protein abundance and commonness.
Protein abundances of the 519 proteins, calculated without missing values (where no matching spectra were detected), in (A) the colony dataset, and (B) in the single individual dataset. The proteins highlighted in red are the most abundant ones when calculated including missing values in both datasets combined, as shown in Figure 3. The red dashed line shows the cut-off used for classical frequentist statistical analyses – for the empirical Bayes and machine learning analyses all proteins were included.

Similarity across trophallactic fluid proteome samples of colonies and single individuals.
Principal component analysis for all proteins for (A) colony samples and (B) single individual samples from the four colonies. Symbols representing the four colonies represented in (B) can be found in maroon in (A). (C) Ranked Self-similarity S for each sample type comparison. Self-similarity is the absolute value of the difference between dissimilarity within and across samples divided by the average dissimilarity of all samples (by standardized Euclidean distance of protein abundance). Samples with higher S are more similar to samples of the same type, while samples with an S of zero are equidistant to the centroids of the two sample groups.
-
Figure 3—source code 1
Matlab source code to produce self-similarity scores, plots and PCA plots that make up Figure 3, https://github.com/dradri/variation2021.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig3-code1-v3.zip
-
Figure 3—source data 1
Matlab MAT data file based on iBAQ values, gene names, and sample classes to produce self-similarity scores, plots and PCA plots that make up Figure 3.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig3-data1-v3.mat

The sixty most abundant proteins in trophallactic fluid over 73 colony and 40 single individual samples.
Ranking of abundance (including missing values). From left to right, Drosophila melanogaster orthologs, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are bolded for the 25 out of 27 core trophallactic fluid proteins that are amongst the 60 most abundant proteins. The additional but less abundant core proteins are a cathepsin (26–29 p) and a myosin heavy chain (Mhc). For protein accession numbers, see Figure 4—figure supplement 1.

Most abundant proteins with accession numbers.
The sixty most abundant proteins in trophallactic fluid over 73 colony and 40 single individual samples. Ranking of abundance included zero values. From left to right, accession numbers, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are bolded for the 25 out of 27 core trophallactic fluid proteins that are amongst the 60 most abundant proteins. The additional but less abundant core proteins are a cathepsin (26–29 p) and a myosin heavy chain (Mhc).

All proteins that significantly differ in two out of three of the analysis methods (frequentist, empirical Bayes, and random forest classification with SHAP values).
From left to right, Venn diagrams of significance overlap between methods, Drosophila melanogaster orthologs, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples calculated without missing values, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms. Annotation terms are in bold for the core trophallactic fluid proteins present in all samples. For visualization of each analysis method, see Figure 5—figure supplement 1. For protein accession numbers, see Figure 5—figure supplement 2. For all the 135 proteins significantly differing in any analysis, see Supplementary file 2. For full model results, see Supplementary files 3-5.
-
Figure 5—source code 1
Jupyter notebook to run random forest analyses, https://github.com/dradri/variation2021.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig5-code1-v3.zip

Visualization of all results.
Venn diagrams summarizing statistical methods, frequentist volcano plots, empirical Bayes volcano plots, example SHAP value plots of feature importance the top 20 proteins. Each SHAP plot is for one of the ten models trained. For significant proteins, see Supplementary file 2, for full model results, see Supplementary files 3-5.

Significantly differing proteins in two out of three analyses with accession numbers.
All proteins that significantly differ in two out of three of the analysis methods (frequentist, empirical Bayes and random forest classification with SHAP values). From left to right, accession numbers, proportion of samples in which the protein was identified in colony samples and single individual samples, average iBAQ abundance across all samples calculated without zero values, log2 of the fold change in abundance between types for a given comparison, the comparisons for which the protein was significant in two out of three methods are marked with yellow dots, annotation terms.

Gene set enrichment analysis of trophallactic fluid.
Significant terms for Drosophila melanogaster orthologs of (A) the 60 most abundant trophallactic fluid proteins, trophallactic fluid proteins significantly differing between (B) Young vs. Mature, (C) Nurse vs. Forager, and (D) Field vs. Lab, with -log10(FDR) indicated on y-axes. Deep purple indicates GO biological process; blue, GO molecular function; turquoise, GO cellular compartment; lime green, Reactome pathway; orange, KEGG pathway. Circle size indicates strength, log10(observed proteins / expected proteins in a random network of this size). Full results can be found in Figure 6—source data 1.
-
Figure 6—source data 1
Gene set enrichment analyses of trophallactic fluid proteins.
Sheet 1 provides the network characteristics for all gene set enrichment analyses. The detailed results for each network are presented in sheet 2 for the 60 most abundant proteins, and in sheet 3 for the trophallactic fluid proteins significantly differing in two out of three of our statistical methods, first combined and then separately for the three main comparisons. In each case, analysis was performed on D. melanogaster orthologs of C. floridanus proteins. Observed gene count indicates how many proteins in the network are annotated with the term. Background gene count indicates how many proteins in total have this term, in this network and in the background. Strength describes how large the enrichment effect is: log10(observed proteins / expected proteins in a random network of this size). False Discovery Rate describes how significant the enrichment is. p-Values are corrected for multiple testing within each category using the Benjamini– Hochberg procedure. The significant annotations are indicated for GO: Biological process (GO:BP), GO: Molecular function (GO:MF), GO: Cellular component (GO:CC), Reactome pathways and KEGG pathways.
- https://cdn.elifesciences.org/articles/74005/elife-74005-fig6-data1-v3.xlsx
Tables
Reagent type (species) or resource | Designation | Source or reference | Identifiers | Additional information |
---|---|---|---|---|
Other | UniProt Reference proteome (Camponotus floridanus); accessed February 2020 | UniProt | UP000000311 | |
Other | NCBI RefSeq Reference proteome (Camponotus floridanus), v7.5 | NCBI RefSeq | GCF_003227725.1 | |
Biological sample (Camponotus floridanus) | Trophallactic fluid (see details in Supplementary file 1) | This paper | Supplementary file 1 | Supplementary file 1 |
Software, algorithm | MaxQuant v1.6.2.10 | MaxQuant | RRID:SCR_014485 | |
Software, algorithm | Perseus v1.6.15.0 | Perseus | RRID:SCR_015753 | |
Software, algorithm | R 3.6.1 | R | RRID:SCR_001905 | |
Software, algorithm | Matlab 2020b | Mathworks | RRID:SCR_001622 | |
Software, algorithm | R-package MASS 7.3–53 | R Project | RRID:SCR_019125 | |
Software, algorithm | R-package LME4 | R Project | RRID:SCR_015654 | |
Software, algorithm | R-package multcomp 1.4–15 | R Project | RRID:SCR_018255 | |
Software, algorithm | LIMMA-pipeline-proteomics pipeline 3.0.0 | GitHub | 10.5281/zenodo.4050581 | |
Software, algorithm | sklearn v0.22.1 | Scikit-learn | RRID:SCR_019053 | |
Software, algorithm | Python 3.7.6 | Python | RRID:SCR_008394 | |
Software, algorithm | SHapley Additive ExPlanations package v0.37.0 | GitHub | RRID:SCR_021362 | |
Software, algorithm | OMA Browser | OMA Browser (Martínez et al., 2020 release) | RRID:SCR_011978 | |
Software, algorithm | Flybase | Flybase | RRID:SCR_006549 | |
Software, algorithm | STRING v11 | STRING | RRID:SCR_005223 |
Additional files
-
Supplementary file 1
The sampling scheme.
Trophallactic fluid (TF) sampled for proteomics analysis. Date (field) indicates when the colony extract was collected from the field site, Date (TF sampling) indicates the date of the trophallactic fluid collection. Volume and ants indicate the volume collected and the number of ants collected from for each sample. The lab2019 colonies were used for single individual trophallactic fluid samples. For the two mature colonies that were sampled six times, * marks the sample that was used in the main datasets.
- https://cdn.elifesciences.org/articles/74005/elife-74005-supp1-v3.xlsx
-
Supplementary file 2
All 135 significantly differing proteins.
This supplementary file combines into a single sheet the results and additional information for all of the significantly differing proteins in our four comparisons (Young vs. Mature, Nurse vs. Forager, Field vs. Lab, East vs. West), by all of the three statistical methods (classical, empirical Bayes, machine learning). Protein accession numbers, presence in colony and individual datasets, abundance when present, fold changes by comparison and significance both by comparison and by model are shared.
- https://cdn.elifesciences.org/articles/74005/elife-74005-supp2-v3.xlsx
-
Supplementary file 3
Full frequentist statistical results.
Statistical results for the classical frequentist models; the imputed data are also shared.
- https://cdn.elifesciences.org/articles/74005/elife-74005-supp3-v3.xlsx
-
Supplementary file 4
Full empirical Bayes statistical results.
For the empirical Bayes LIMMA models, results are shared as raw output tables.
- https://cdn.elifesciences.org/articles/74005/elife-74005-supp4-v3.xlsx
-
Supplementary file 5
Full random forest statistical results.
Accuracy, seed, and mean feature importances for each gene are reported for each model trained for the random forest analyses.
- https://cdn.elifesciences.org/articles/74005/elife-74005-supp5-v3.xlsx
-
Transparent reporting form
- https://cdn.elifesciences.org/articles/74005/elife-74005-transrepform1-v3.docx