General linear mixed effects model (GLMM) identifies protein complexes and pathways up-regulated in response to mutational load in human tumors.

(A) Overview of the GLMM used to measure the association of mutation load with gene expression while controlling for potential co-variates (purity and cancer type). Genes with a significant, positive β1 regression coefficient and false discovery rate (FDR) < 0.05 are used for gene set enrichment analysis. (B-C) Bar plots of protein complexes from the CORUM database (left) and pathways from the KEGG database (right) that are significantly enriched (p < 0.05) in response to mutational load. Length of bars denote negative log10 of adjusted p-value and colors denote broad functional groups enriched in both databases.

Gene silencing is elevated in high mutational load tumors likely through the coupling of intron retention with mRNA decay.

(A) Counts of the number of under-expressed transcripts with intron retention events, relative to counts of all intron retention events in tumors binned by the total number of protein-coding mutations. Intron retention events with PSI > 80% are counted. Error bars are 95% confidence intervals determined by bootstrap sampling. (B) Barplot of significant protein complexes in the CORUM database (in red) and Reactome pathway database (in blue) with more (bottom) and less (top) intron retention events in high mutational load tumors compared to low mutational load tumors.

Protein folding, degradation, and synthesis are regulated in both high mutational load tumors (TCGA) and cell lines (CCLE).

Box plots of β1 regression coefficients (top panels) and negative log10 adjusted p-values (bottom panels) measuring the association of mutation load and the expression of individual genes in chaperone (purple), proteasome (yellow), and ribosome (green) complexes. Shown are regression coefficients from human tumors (TCGA) on the left and cell lines (CCLE) on the right. Percentages and grey lines on top panels show the quantile distribution of regression coefficients measuring the association of mutational load and expression for all genes in the genome within each dataset. Vertical grey line on bottom panels shows threshold of significance (p = 0.05).

Viability in high mutational load cell lines decreases when proteostasis machinery is disrupted.

(A) Heatmap of β1 regression coefficients jointly measuring the association of mutational load and cell viability after expression knockdown of individual genes in proteostasis complexes. (B) Heatmap of β1 regression coefficients measuring the association mutational load and cell viability after inhibition of proteostasis machinery via drugs. Both panels show how stable regression estimates are when including all cancer types (‘All Cancers’) shown in black boxes and when removing each individual cancer type on the y-axis. Colors denote a positive (blue), zero (grey), or negative (red) relationship of mutational load and cell viability after expression knock-down or drug inhibition. Stars denote whether the relationship is significant (* = p < 0.05; ** = p <0.005; *** = p < 0.0005).

Targeting proteostasis machinery is a key vulnerability in high mutational load cell lines.

(A) Bar plot of the number of drugs in the PRISM database significantly (black) and not significantly (grey) associated with mutational load and cell viability using a simple generalized linear model (GLM). (B) Fraction of drugs in broad functional categories significantly negatively associated with mutational load and cell viability from the GLM. Confidence intervals were determined by randomly sampling 50 drugs in each functional category 100 times. Dashed line is the median of randomly sampled drugs across all categories.

No collinearity of point mutations and copy number alterations in human tumors (TCGA) and cancer cell lines (CCLE).

Heatmap of Pearson’s correlation coefficients between different classes of mutations in A. CCLE (cancer cell lines) and B. TCGA (human tumors). Colors denote magnitude of correlation coefficients and whether the relationship is positive (red), negative (blue) or negligible (white). CNAs are defined as the combined number of amplifications and deletions, while SNVs are the combined number of all point mutations.

Genes significantly expressed from the transcriptional screen mostly fall into the upper quartile of effect sizes, which are enriched for proteostasis complexes.

A. Volcano plot of positive β1 regression coefficients and negative log10 adjusted p-values measuring the association of mutation load and the expression of individual genes from the transcriptional screen in Fig.1A. Colors denote the lower (Q1 in red), median (Q2 in green) and upper (Q3 in red) quartiles of each positive beta coefficient from the regression model. Genes that are significantly expressed from the transcriptional screen mostly fall into the upper quartile. B. Barplot of significant protein complexes in the CORUM database identified using gene set enrichment analysis only on genes that fall into the upper quartile of effect sizes. Genes in the upper quartile of effect sizes contain half of the genes that were identified as significant previously (n=2,152 vs n=5,330), yet still identify protein degradation, translation and chaperones as the top significant protein complexes.

Intron retention events that overlap with mutations do not account for the association of gene silencing in high mutational load tumors.

A. Counts of the number of intron retention events filtered (in red) due to overlap with a mutation present in the same gene (and thus corresponding to potential eQTLs) compared the number of remaining alternative splicing events with no overlap with a mutation (in blue). Alternative splicing events filtered represent ∼1% of all alternative splicing events across all tumors. B-C. Counts of the number of under-expressed transcripts with intron retention events, relative to counts of all intron retention events in tumors binned by the total number of protein-coding mutations. Shown are when trends when (B) not filtering alternative splicing events due to overlap with mutations and (C) when events are filtered (same as Fig. 2A). Intron retention events with PSI > 80% are counted. Error bars are 95% confidence intervals determined by bootstrap sampling. These results further support the prediction that gene silencing is elevated in high mutational load tumors and likely mediated by the coupling of intron retention with mRNA decay

The number of under-expressed transcripts increases with the mutational load of tumors for different PSI value thresholds and alternative splicing events.

A. Counts of the number of under-expressed transcripts with intron retention events, relative to counts of all intron retention events in tumors binned by the total number of protein-coding mutations. Intron retention events with different PSI thresholds are shown colored. B. Counts of the number of under-expressed transcripts that contain different classes alternative splicing events, relative to counts of all alternative splicing events of the same class in tumors binned by the total number of protein-coding mutations. Alternative splicing events of different classes are shown colored (AA=Alternate Acceptor Sites, AD=Alternate Donor Sites, AP=Alternate Promoter, AT=Alternate Terminator, ES=Exon Skip, ME=Mutually Exclusive Exons, RI= Retained Intron). Error bars are 95% confidence intervals determined by bootstrap sampling.

Association between expression in proteostasis complexes and mutational load is not driven by a single cancer type in TCGA.

Box plots of regression coefficients from the GLMM measuring the association of the expression of each individual gene with the mutational load of tumors in TCGA colored by different proteostasis complexes. Shown are regression estimates after removing each individual cancer type (x-axis) and re-running the GLMM.

Linear regression analysis within cancer types in TCGA captures similar expression responses to mutational load across proteostasis complexes.

Heatmap of β1 regression coefficients measuring the effect of mutational load on gene expression in proteostasis complexes while controlling for tumor purity within cancer types which have enough samples to accurately measure effect sizes (N>150) and contain a sufficiently large enough mutational load to potentially generate a proteostasis response (median protein coding mutations > 25). ‘MutLoad’ shows log10 of the median number of protein coding mutations for each cancer type.

Association between the expression in proteostasis complexes and mutational load is not driven by patient age.

Boxplots of regression coefficients from the GLMM measuring the association of the expression of each individual gene with the mutational load of tumors from TCGA colored by different proteostasis complexes. Shown are regression coefficients when running the GLMM on tumors stratified by different age groups (x-axis).

Association between the expression in proteostasis complexes and mutational load is not driven by a single cancer type in CCLE.

Box plots of regression coefficients from the GLM measuring the association of the expression of each individual gene with the mutational load of tumors colored by different proteostasis complexes. Shown are regression estimates after removing each cancer type in CCLE (x-axis) and re-running the GLM.

Similar patterns of expression and protein abundances in response to mutational load in CCLE within genes that regulate protein folding, degradation, and synthesis.

Box plots of β1 regression coefficients measuring the association of mutation load and protein abundance (right) or gene expression (left) of individual genes in chaperone (purple), proteasome (yellow), and ribosome (green) complexes. Shown are regression coefficients from cancer cell lines (CCLE), which contains the largest dataset available of RNA (n=1377) and protein (n=373) abundances which are harmonized across samples. Percentages and grey lines on top panels show the quantile distribution of regression coefficients measuring the association of mutational load and expression for all genes in the genome within each dataset.