Broad functional profiling of fission yeast proteins using phenomics and machine learning

María Rodríguez-López
Nicola Bordin
Jon Lees
Harry Scholes
Shaimaa Hassan
Quentin Saintain
Stephan Kamrad
Christine Orengo author has email address
Jürg Bähler author has email address

University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & Environment, London, United Kingdom
University College London, Institute of Structural and Molecular Biology, London, United Kingdom

https://doi.org/10.7554/eLife.88229.2

Open access
Copyright information

Figures and data

Schematic overview of study design.
Experimental and computational tasks along with the relationships between the different aspects of the work to generate new information on protein function.

Colony-based phenotyping of deletion mutants.
(A) Volcano plot of mutant colony sizes for priority unstudied genes (green) and all other genes (grey) growing in rich medium. The dashed vertical and horizontal lines show the 5% difference in effect size and significance threshold, respectively. Strains with lower fitness (smaller colonies) are <0 on the x-axis, and those with higher fitness are >0. We applied a significance threshold of 0.05 after Benjamini–Hochberg correction for multiple testing and a difference in fitness of abs(log2(mutant/wild type))> log2(0.05) to call hits based on colony size (n = 60).
(B) Distributions of significant hits per gene for priority unstudied mutants (right) and all other mutants (left) showing faster (green) or slower (purple) colony growth compared to wild-type cells.
(C) Distributions of significant hits per condition for all mutants showing faster (green) or slower (purple) colony growth compared to wild-type cells.
(D) Fitness of mutant cells growing in 2% glycerol+0.01% glucose in this study, plotted as in (A). Blue: mutants identified as slow growing in the same condition in our previous screen (32); red: mutants annotated as ‘increased cell population growth on glycerol carbon source’ (FYPO:0004167) (62); green rings: mutants in priority unstudied genes. Seventy-nine mutants were slow growing in both screens (p = 1×10^-39; Fisher’s exact test, FDR corrected for multiple testing).

Mutant phenotype signatures.
(A) Clustering of discretized data for 3,449 deletion mutants (rows) in 98 conditions (columns), following recommendations from the R package microbialPhenotypes. For discretizing the phenotype data, we classed mutants as either resistant (yellow), sensitive (blue), or similar (white) relative to their fitness in the corresponding control condition, applying an effect-size threshold of ≥10% and, for the two benign conditions, also a significance threshold of p <0.05 (Supplementary file 1). To improve the clustering, we removed conditions and mutants that produced <4 phenotypes, the 3 conditions that produced >2400 phenotypes, and genes with >50% of missing values. Different types of conditions are colour-coded on top. Indicated at left are the 8 main clusters, calculated using k-medoids in R (pam function) and a distance of 1-Pearson correlation. Using the function fviz_nbclust (R package factoextra), we determined that 8 is an optimum number of clusters.
(B) Functional enrichments of the 8 gene clusters in (A) using GO slim categories (51) for Cellular Component (CC), Biological Process (BP) and Molecular Function (MF) (left, using SystempipeR) as well as KEGG pathways (right, using g:Profiler) (77).

NET-FF predictions of GO terms for S. pombe proteins.
(A) GOGO semantic similarity scores for GO terms predicted by NET-FF and GO terms annotated by PomBase, both for real (orange) and randomized (blue) datasets from NET-FF. Left: Biological Process data, right: Molecular function data.
(B) Information content distribution for GO terms predicted by NET-FF and GO terms annotated by PomBase, both for GO Biological Process (orange) and GO Molecular Function (blue) terms. Left: dataset for all genes, right: subset of priority unstudied genes.

Experimental validation of new ageing-related gene reflecting its NET-FF GO predictions and phenotype-correlation network.
(A) Maximal CLS for eight selected deletion mutants (grey) of genes with NET-FF GO terms related to ageing/cell death and DNA-damage repair, measured using a high-throughput CLS assay (36). Of the eight mutants, we have recently shown that ubc14 and htr12 are short- and long-lived, respectively (36). As controls, the maximal CLS of known short-lived (alg14, green) and long-lived (pac3, purple) mutants (36) and wild-type cells (wt, red) are included. Three independent biological repeats were performed with actual datapoints shown as dots. Orange band: mutant in priority unstudied gene (SPAC23C4.09c) showing strong longevity phenotype.
(B) Cluster from Cytoscape network (see Figure 6) representing phenotype correlations between the SPAC23C4.09c gene-deletion mutant and ten other mutants. Orange edges show positive phenotype correlations, and red circles indicate genes with previously known CLS phenotypes (36,83). Details discussed in main text.

Example of phenotype-correlation clusters from Cytoscape network.
Positive and negative phenotype correlations across the 131 conditions are indicated as orange and blue edges, respectively. Gene pairs with high GO similarities based on PomBase annotations or NET-FF predictions are indicated as hatched pink and bright blue edges, respectively. (A) Cluster 13 consisting of a branch with all members of the PAN complex (red) and a branch containing six genes involved in cAMP signalling (blue), along with two genes involved in RNA interference (green).
(B) Cluster 31 enriched for genes involved in vacuolar and endosomal transport (red) and Cluster 22 enriched for genes with peroxisome functions (green).
(C) Three clusters linking members of the same protein complexes, involved in protein maturation (NatC, green), regulation of transcription elongation (CTDK-1, red), and histone deacetylation (Clr6, black). The data reflect different Clr6 sub-complexes, with png2 and cti6 being positively correlated (Clr6 complex I’) but png2 and png3 being negatively correlated (Clr6 complex I”) (87).
(D) Two networks containing priority unstudied genes (green), including those with implicated mitochondrial functions (green with purple ellipses), together with their nearest neighbours based on phenotype correlation, which include genes with known mitochondrial functions (grey with purple ellipses). For simplicity, high GOGO pairs are not indicated in these networks.

Colony-based phenotyping of mutants in benign conditions.
(A) Left: Volcano plot of mutant colony sizes for priority unstudied genes (green) and all other genes (grey) growing in minimal medium. The dashed vertical and horizontal lines show the 5% difference in effect size and significance threshold, respectively (statistics as in Fig. 2A; n = 8). Right: Visualisation of overlapping genes among different sets as indicated using Metascape (1). We classified the mutants as fast- or slow-growing relative to their growth compared to the wild-type strain in minimal (EMM) and rich (YES) media and separately show the mutants in priority unstudied genes.
(B) Top GO and Reactome terms enriched for fast- and slow-growth mutants in minimal (EMM) and rich (YES) medium. The different mutant classes showing phenotypes were subjected to GO enrichment analysis using Metascape.

Our phenotype association matrix has much higher concordance with STRING than expected by chance.
Number of edges obtained from STRING protein interactions that are concordant with phenotype correlations (red vertical line) compared to 100 randomisations of the phenotype data (blue curve). For each STRING protein pair with a STRING-SCORE ≥750 the association was kept and the STRING-SCORE was multiplied by the phenotype-correlation score for the corresponding protein pair (obtained by measuring Pearson correlation of phenotype profiles for protein pair). In total, 425 protein pairs (edges) had a combined product of STRING-SCORE and phenotype-correlation score >0.5 (red hatched line). This number was never obtained by repeating the process 100 times after randomising the genes in the phenotype matrix. A maximum of only 79 protein pairs (edges) was obtained for the random process (blue curve). This result indicates that the STRING data show above random concordance with the phenotype data.

Clusters of mutant phenotype profiles for oxidative-stress conditions.
Hierarchical clustering of discretized data for 3449 deletion mutants (rows) in 6 conditions (columns). For discretizing the phenotype data, we classed mutants as either resistant (yellow), sensitive (blue), or similar (white) relative to their fitness in the corresponding control condition (as in Figure 3A). Basic clustering was performed in R using heatmap.2, with euclidean measure to obtain distance matrix and complete agglomeration method.

Filtering of predictions by information content and probability value.
Illustration of the criteria for filtering the predictions (green area) by information content >5 (left) and by probability value >0.7 (right).

Performance of NET-FF predictor by source data.
Precision-recall curves plotted for each repeat (translucent curves) and micro-averaged data (solid curves). FF: FunFam homology data; NET: network data; NET-FF: combined FF and NET information; numbers indicate corresponding micro-averaged AUPR.

GO association networks show higher concordance with our phenotype data than expected by chance.
(A) For each gene pair with a predicted GOGO BP (score >0.5) the GOGO score was multiplied by the PHEPHE score of the same gene pair (obtained from Pearson correlation of phenotype profiles across conditions). If the resulting combined product of GOGO and PHEPHE scores was >0.5, we added the gene pair link to a graph (network). For our dataset, this resulted in a graph of 31 edges (red line above). Graphs of this size were never obtained by repeating the process 100 times through randomising the PHEPHE matrix column gene labels before the lookup on the PHEPHE score (blue data). A maximum graph size of 17 edges was found through 100 randomisations. This result indicates that the GO predictions show above random concordance with the phenotype data.
(B) For each gene pair with a PomBase GOGO MF or BP score >0.7, the GOGO score was multiplied by the PHEPHE score of the same gene pair (obtained from Pearson correlation of phenotype profiles across conditions). If the resulting combined product of PomBase GOGO and PHEPHE scores was >0.5, we added the gene pair link to a graph (network). For our dataset, this resulted in a graph of over 2100 edges (red line above). Graphs of this size were never obtained by repeating the process 100 times through randomising the PHEPHE matrix column gene labels before the lookup on the PHEPHE score (blue data). This result indicates that the PomBase GO associations show above random concordance with the phenotype data.

Enrichment of high GOGO gene pair edges in nine clusters from the Cytoscape clustering of phenotype correlations, containing at least 15 genes. Six showed more high GOGO gene pairs based on NET-FF predictions than expected by chance.

Phenotype-correlation clusters from entire Cytoscape network.
Examples are shown in Fig. 6. The data can be interactively explored here: https://www.ndexbio.org/#/network/d4e109d2-05e9-11ed-ac45-0ac135e8bacf?accesskey=46cbcedb7360ef936f5cd68fb551ef8f9cd7c6bf068f279bc2c61e3e94e4a490

Sign up for email alerts