Introduction

Most eukaryotic mRNAs are modified by the addition of a 5’-terminal 7-methylguanosine cap, splicing of intronic sequences, and 3’-terminal cleavage and polyadenylation. The cleavage and polyadenylation reactions are tightly coupled with transcription termination and the release of newly synthesized transcripts from RNA polymerase II. Therefore, precise cleavage and polyadenylation are critical for the production of mature mRNAs. Mechanistically, these reactions require a co-transcriptional assembly of a multisubunit protein complex at the corresponding cis-regulatory sequences near the pre-mRNA cleavage site (CS) 13. A key cis-element guiding the assembly of the cleavage and polyadenylation machinery is the polyadenylation signal (PAS). The most common PAS sequences in mammals are AATAAA and ATTAAA hexamers, although their single nucleotide-substituted variants may function in some cases 4, 5.

Earlier studies have emphasized the importance of pre-mRNA cleavage/polyadenylation in the context of human diseases. For example, alternative cleavage/polyadenylation has been proposed to modulate the expression of oncogenes and tumour suppressors in different types of cancer 6, 7. Germline mutations affecting polyadenylation signals can play a role in genetic disorders 810 and increase cancer susceptibility 11, 12. Notably, although the role of somatic mutations affecting polyadenylation signals has been investigated for individual genes in a limited number of tumour samples 13, 14, a systematic characterization of the role of this type of mutations in cancer has not been carried out.

Large-scale genome sequencing studies have identified numerous cancer driver and driver-like mutations within protein-coding sequences 15. Such mutations have also been mapped to noncoding regions; however, existing research has primarily focused on promoters, enhancers, and splicing signals 1620, rather than sequences regulating pre-mRNA cleavage and polyadenylation.

Here, we conducted a systematic genome-wide analysis of somatic single-nucleotide variants (SNVs) affecting the PAS elements in mRNA 3’ untranslated regions (3’UTRs) in cancer cells. Using a large tumour whole-genome sequencing dataset, the Pan-Cancer Analysis of Whole Genomes (PCAWG) 15, we found that strong and evolutionarily conserved clavage/polyadenylation signals are often disrupted by cancer-specific SNVs. Strikingly, such mutations are significantly enriched in tumour suppressor genes. We further provide evidence that such mutations can substantially decrease the expression of tumour suppressor genes in cancer cells. Overall, our work identifies a novel class of noncoding somatic mutations with driver-like properties in cancer.

Results

Somatic mutations often disrupt cleavage and polyadenylation sequences in cancer

We first analysed SNVs neighbouring annotated human cleavage and polyadenylation positions (paSNVs) in 3’UTRs from the PolyA_DB3 database 21. We considered two distinct cohorts: “Normal” paSNVs from a healthy human population (the 1000 Genomes phase 3 data 22) and “Cancer” paSNVs from the whole-genome sequencing of cancer samples (PCAWG) 15.

For each paSNV, we calculated the change in the cleavage/polyadenylation efficiency predicted by the APARENT2 neural network mode 23 and assessed the loss and gain of the two strongest polyadenylation signals, AATAAA and ATTAAA (referred to as AWTAAA throughout this study; Fig. 1A). As expected, paSNVs predicted to have a strong impact on cleavage/polyadenylation were often situated immediately upstream of a CS, with most of them affecting AWTAAA hexamers (Fig. 1B).

paSNVs disrupting cleavage/polyadenylation signals are depleted in the normal population.

(A) Bioinformatics workflow used to analyse the effect of paSNVs on pre-mRNA cleavage and polyadenylation.

(B) Top, effects of UP- and DOWN-paSNVs on the APARENT2 score (mean±SEM) as a function of their position with respect to annotated pre-mRNA cleavage sites (CSs). Bottom, combined distribution of AWTAAA-affecting paSNVs in both datasets.

(C) Box plot showing that paSNVs disrupting polyadenylation signals are significantly less frequent compared to control groups of events in normal population.

(D) paSNVs disrupting polyadenylation signals are enriched for singletons, consistent with purifying selection against such events in normal population.

We then categorized all paSNVs into three groups: (1) upregulating cleavage/polyadenylation (UP-paSNVs), defined as events with a ≥1 increase in the APARENT2 score (log odds ratio, or LOR) and creating an AWTAAA hexamer; (2) downregulating cleavage/polyadenylation (DOWN-paSNVs; LOR≤-1 and disrupting an AWTAAA); and (3) the remaining annotated cleavage position-adjacent paSNVs (Fig. 1B). The latter group served as a background control (BG-paSNVs) in our subsequent analyses.

Consistent with the earlier studies 2325, we observed a pronounced negative selection against the DOWN-paSNVs in the Normal dataset (Fig. 1C-D). This category showed significantly decreased allele frequencies in comparison to the BG-paSNVs and was enriched for singletons (unique variants in the analysed dataset). This effect was more evident when considering changes in both the score and the hexamer composition (Fig. S1).

Notably, comparison of the Normal and Cancer datasets showed that cancer somatic mutations, on average, had a stronger effect on the polyadenylation efficiency in both the UP- and DOWN-paSNV groups (Fig. 1B; LOR sample variance 0.115 in cancer vs 0.0876 in normal). DOWN-paSNVs were significantly enriched in cancer compared to the normal population data (Fig. 2A). We also observed that mutations disrupting AWTAAA hexamers in 3’UTRs tended to occur near annotated cleavage sites in cancer (Fig. 2B).

Cancer somatic mutations tend to disrupt functional cleavage/polyadenylation signals.

(A) Bar plot showing enrichment of paSNVs disrupting polyadenylation signals among cancer somatic mutations.

(B) Bar plot showing enrichment of SNVs affecting AWTAAA sequences in 3’UTRs close to annotated cleavage sites (CSs) among cancer somatic mutations.

(C) Box plot showing that somatic mutations disrupt stronger cleavage/polyadenylation signals in cancer.

(D) paSNVs disrupting polyadenylation signals occurs in more evolutionary conserved regions in cancer (mean PhastCons score in 15-nt window centred at SNVs).

Interestingly, cancer-specific DOWN-paSNVs affected cleavage/polyadenylation signals with higher APARENT2 scores (Fig. 2C). Furthermore, DOWN-paSNVs tended to affect more evolutionarily conserved sequences in the Cancer dataset compared to the Normal control (Fig. 2D and Fig. S2). In total, we identified 1614 distinct somatic DOWN-paSNVs in the cancer dataset affecting 1570 cleavage/polyadenylation events in 1460 genes in 610 tumours, i.e. ∼23% of all tumour samples in PCAWG.

We concluded that mutations disrupting functional cleavage/polyadenylation signals are abundant in cancer cells despite being subject to strong purifying selection in a healthy population.

Cancer-specific mutations in cleavage and polyadenylation sequences are enriched in tumour suppressor genes

There are two possible explanations for the enrichment of DOWN-paSNV events in cancer: (1) an increase in the overall mutation load and (2) positive selection for such mutations. Since the latter possibility may increase the incidence of mutations in cancer driver genes, we analysed the distribution of DOWN-paSNVs within genes from the Cancer Gene Census26. This revealed a remarkable over-representation of the DOWN-paSNVs in tumour suppressor genes, with the magnitude of this effect being greater than the corresponding enrichment of nonsense mutations (SNVs creating a premature translation termination codon) (Fig. 3A and Fig. S3A-B).

Somatic cancer mutations often disrupt cleavage/polyadenylation signals in tumour suppressor genes.

(A-B) Overrepresentation of (A) tumour suppressors but not (B) oncogenes among genes with cancer somatic DOWN-paSNVs, as compared to genes with cancer somatic BG-paSNVs. Fractions of tumour suppressors and oncogenes are also shown for all genes and genes containing cancer somatic nonsense (premature stop codons), missense (altered amino acid residues) and synonymous (synonymous codons) mutations. Note that the enrichment of tumour suppressors is stronger for DOWN-paSNVs compared to nonsense mutations.

(C-D) Enrichment of different groups of cancer somatic SNVs in (C) tumour suppressors and (D) oncogenes calculated using DigDriver relative to genes not listed in Cancer Census (non-Census) and presented with 95% confidence intervals. Note that DOWN-paSNVs and nonsense mutations are enriched in tumour suppressors but not in oncogenes. In contrast, oncogenes are often affected by missense mutations, as expected.

(E) Cancer somatic DOWN-paSNVs co-occur in the same tumour with non-synonymous damaging SNVs, a group of somatic mutations defined in 20, more often than BG-paSNVs. Note that the co-occurrence is particularly high for tumour suppressors.

(F) The overall frequency of non-synonymous damaging SNVs is significantly higher in the DOWN-paSNV-containing group compared to the DOWN-paSNV-lacking group of tumour suppressor genes.

Overall DOWN-paSNVs were found to affect 38 tumour suppressor genes, i.e. 14.3% of all genes in this category in the Census dataset. Consistent with tumour suppressors being a major target of DOWN-paSNVs, genes with this type of mutations were strongly enriched for apoptosis-related functions (Fig. S3C). DOWN-paSNVs were not enriched in the oncogenes (Fig. 3B), in line with the disruptive nature of such mutations under normal conditions (Fig. 1). Conversely, oncogenes but not tumour suppressors showed enrichment for UP-paSNVs (Fig. S3A-B).

To independently confirm the functional impact of DOWN-paSNVs in cancer, we compared the mutational excess of different types of somatic mutations using DigDriver 18, a neural network-based method that accounts for cancer-specific mutation rates. This analysis revealed a significantly higher observed-to-expected mutation rate for DOWN-paSNV events in cancer compared to the BG-paSNV group (Fig. S4). DOWN-paSNVs tended to be enriched in tumour suppressor genes, consistent with positive selection for these events in cancer (Fig. 3C). No such enrichment was detected in oncogenes (Fig. 3D),

Of note, our analysis of wild-type sequences showed that tumour suppressor 3’UTRs are characterized by stronger cleavage/polyadenylation signals compared to oncogenes and non-cancer genes (Fig. S5A-B). Moreover, tumour suppressors associated with hallmarks of cancer 27 in the Census dataset had stronger cleavage/polyadenylation signals than the rest of tumour suppressor genes (Fig. S5C).

According to the classical two-hit hypothesis 28, both alleles of tumour suppressor genes may acquire distinct damaging mutations in cancer. With this in mind, we analysed the co-occurrence of paSNVs with damaging non-synonymous mutations from the PCAWG collection (Non-syn. variants from the binarised gene-centric table in 20). DOWN-paSNV-containing tumour suppressors showed a markedly increased incidence of such additional somatic mutations in the same tumour compared to the BG-paSNV control (Fig. 3E). Furthermore, the overall frequency of damaging non-synonymous mutations in tumour suppressors affected by DOWN-paSNVs in at least one sample was significantly higher than in the DOWN-paSNV-negative tumour suppressor group (Fig. 3F).

Taken together, these data suggest that somatic mutations disrupting cleavage and polyadenylation can facilitate the inactivation of tumour suppressors in cancer.

Somatic mutations in cleavage and polyadenylation signals can decrease the expression of tumour suppressor genes

Genetic inactivation of functional cleavage/polyadenylation sequences may negatively affect gene expression (see e.g., Ref8). To explore this possibility, we turned to the colorectal adenocarcinoma subset of PCAWG, as it contained most of the DOWN-paSNVs in tumour suppressors and the corresponding gene expression information 20. We shortlisted detectably expressed tumour suppressors that contained DOWN-paSNVs and no other damaging mutations in specific cancer samples, and were wild type in other samples. Seven genes passing these filters were involved in various aspects of tumour biology, including cell survival and DNA repair (CASP9, NDRG1, and XPA), mTOR signalling (TSC1), and transcription and RNA processing (ETV6, ISY1 and SMAD2).

Plotting pairwise gene-specific expression differences for the aggregated tumour suppressor set, we observed a significant bias towards downregulation in the samples containing DOWN-paSNVs compared to the wild-type controls (Fig. 4A; median downregulation of 1.25-fold). Remarkably, similar negative biases were detected for all seven individual genes, with median downregulation values ranging from 1.1-to 3.2-fold (Fig. 4B).

Somatic cancer DOWN-paSNVs are sufficient to downregulate tumour suppressor genes.

(A-B) Gene-specific expression differences between DOWN-paSNV-containing and wild-type samples (ΔLog2 of copy number variation-normalized FPKM values; see Materials and Methods) reveal a consistently negative effect of DOWN-paSNV on tumour suppressor mRNA abundance in colorectal cancers. Box plots are shown for (A) an aggregated set of qualifying tumour suppressors and (B) individual genes from this set. Outliers are omitted for clarity.

(C) Wild-type and mutated sequences of the XPA tumour suppressor gene cleavage/polyadenylation signal. The PAS hexamer is enclosed within a box.

(D) Top, XPA cleavage site read-through minigenes and corresponding primers used for RT-qPCR analyses. Bottom, RT-qPCR data showing stronger read-through (weaker polyadenylation) in the mutant minigene.

(E) Top, luciferase expression minigenes. Bottom, luciferase assay revealing that the cancer-specific PAS mutation dampens the expression of the reporter gene.

To validate the effect of DOWN-paSNVs on gene expression, we focused on a somatic mutation that disrupts the cleavage/polyadenylation signal in the tumour suppressor XPA. This gene has been shown to promote apoptosis in response to DNA damage, in addition to its role in nucleotide excision repair 29. Moreover, downregulation of XPA has been associated with decreased patient survival in colorectal cancer 30.

The XPA mutation identified by our bioinformatics analyses alters the canonical AATAAA PAS hexamer to GATAAA near the terminal CS and significantly reduces the APARENT2 score (Fig. 4C). To experimentally assess the effect of this mutation on the efficiency of pre-mRNA cleavage and polyadenylation, we prepared minigene constructs where the wild-type or mutant sequences were inserted upstream of a recombinant CS (Fig. 4D).

We used the wild-type and mutant minigenes to transfect the human colorectal cancer cell line HCT-116. An RT-qPCR assay measuring the efficiency of cleavage/polyadenylation as a ratio between the CS-read-through and CS-upstream signals revealed a significant decrease in cleavage/polyadenylation efficiency in response to the XPA DOWN-paSNV (Fig. 4D).

To directly assess the effect of defective cleavage/polyadenylation on gene expression, the wild-type or the mutant 3’UTR sequences were inserted downstream of a luciferase reporter gene. Following the transfection of HCT-116 cells, we detected a significantly reduced production of luciferase protein from the mutant construct compared to the wild-type control (Fig. 4E).

Thus, somatic mutations disrupting polyadenylation signals in tumour suppressor genes can reduce the abundance of functional mRNA transcripts.

Discussion

We interrogated whole-genome mutation data using recently developed machine learning approaches to systematically characterize the impact of SNVs on 3’UTR polyadenylation signals (PAS) in cancer. Our analyses confirm that germline SNVs disrupting PAS are likely deleterious, as they are subjected to strong negative selection in the normal population (Fig. 1 and Fig. S1). Intriguingly, we found that somatic mutations affecting such cis-elements in cancer are more prevalent, tend to occur near stronger CSs, and target more evolutionarily conserved PAS hexamers (Fig. 2 and Fig. S2).

Importantly, these cancer somatic SNVs disrupt PAS sequences in tumour suppressor genes with a similar enrichment pattern to well-known deleterious SNVs in protein-coding regions, such as nonsense mutations (Fig. 3A-D). Additionally, wild-type tumour suppressors have stronger cleavage/polyadenylation signals than other groups of genes (Fig. S5), pointing to the importance of the corresponding steps of pre-mRNA processing for their expression.

Consistent with the two-hit hypothesis 28, we found that tumour suppressors with disrupted cleavage/polyadenylation signals (i.e. containing DOWN-paSNVs) are more likely to acquire other damaging somatic mutations in the same tumour (Fig. 3E-F). However, it is possible that DOWN-paSNVs can contribute to tumour progression even in the absence of other mutations. Indeed, tumour suppressors containing only DOWN-paSNVs are consistently expressed at lower levels compared to their wild-type counterparts (Fig. 4A-B). Moreover, it is currently thought that partial inactivation of many tumour suppressors can be sufficient to promote tumorigenesis 31, 32.

Using the tumour suppressor gene XPA as an example, we directly show that a cancer-specific single-nucleotide mutation disrupting the PAS hexamer is sufficient to block pre-mRNA cleavage/polyadenylation and dampen the expression of mature mRNA (Fig. 4C-E). These results support our bioinformatics analyses and argue that SNVs targeting polyadenylation signals can have a profound effect on gene expression in cancer. Our data are also consistent with previous reports showing similar gene expression effects of PAS-specific germline SNVs 8, 10.

It is expected that mutation of cleavage/polyadenylation signals should lead to the appearance of abnormal read-through transcripts that may be destabilized by either nuclear or cytoplasmic RNA quality control mechanisms 33. Alternatively, a decrease in cleavage/polyadenylation activity might dampen transcription initiation, as these two processes are known to be interconnected 34. Differentiating between these possibilities will be an important next step in understanding the molecular mechanisms, which may link compromised cleavage/polyadenylation and gene expression defects in cancer. Furthermore, although we focused on annotated 3’UTR CSs in this work, similar analyses of SNVs occurring in other noncoding parts of mammalian genes (e.g. introns) might reveal an even wider impact of the loss and gain of PAS-like sequences in cancer.

In conclusion, our study reveals that the genetic inactivation of cleavage and polyadenylation in tumour suppressor genes constitutes a prevalent, yet previously overlooked category of somatic cancer mutations with driver properties. These findings emphasize the importance of pre-mRNA processing in the biology of cancer and underscore the need for improved functional annotation of single nucleotide variants in noncoding regions of the human genome.

Materials and Methods

Source data sets

Pre-mRNA cleavage site (CS) positions and the corresponding metadata were obtained from the PolyA_DB3 database 21 (release 3.2 https://exon.apps.wistar.org/PolyA_DB/v3/). The phase-3 1000 genomes vcf files were downloaded from the International Genome Sample Resource (https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/). Cancer somatic SNVs and indels from whole-genome sequencing of 2,583 unique tumours (PCAWG) were downloaded from the International Cancer Genome Consortium (ICGC) data portal (https://dcc.icgc.org/) and the database of Genotypes and Phenotypes (dbGaP) (project code: phs000178). Only bona fide SNVs that differed from the reference genome at a single-nucleotide position were included in the analysis. The v97 release of the Cancer Gene Census was downloaded from https://cancer.sanger.ac.uk/cosmic/download.

Data processing

CSs located in 3’UTRs according to PolyA_DB3 were extended by 102 nt on both sides to generate 205-nt intervals. All SNVs from the 1000 genomes and the PCAWG datasets mapping to these intervals were kept for further analyses (paSNVs). FASTA files corresponding to wild-type and mutant 205-nt intervals analysed by the APARENT223. For each variant we estimated log odds ratio (LOR) of mutant (mut) variant isoform abundance with respect to the wild-type (wt) abundance (abundance were calculated by summing all cleavage probabilities mapping to 205-nt interval) as follows:

Incidence of PAS hexamers was quantified using the vcountPattern function from the Biostrings R/Bioconductor package (doi:10.18129/B9.bioc.Biostrings). Evolutionary conservation was calculated for either exact SNV positions or 15-nt SNV-centred windows using the GenomicScores 35 and the phastCons100way.UCSC.hg19 36 R/Bioconductor packages. Only unique SNV entries were kept for further analysis. In cases where a single SNV was located near more than one distinct CS, the strongest effect on cleavage/polyadenylation was used for further analyses. GO terms enrichment was analysed using the ClusterProfiler R/Bioconductor package 37.

To analyse changes in polyadenylation scores of all mutations affecting AWTAAA sequences in 3’UTRs in Fig. 2B, APARENT2 scores were calculated for all SNV-centred 205-nt intervals from both datasets located within canonical UCSC 3’UTRs. SNVs disrupting AWTAAA sequence with LOR≤-1 within 100-nt intervals centred around polyA_DB3 CSs were considered “annotated”.

Cancer Census gene enrichment

Enrichment of different types of SNVs in Cancer Census genes was calculated using two-tailed Fisher’s exact test. Somatic SNVs in protein-coding sequences were classified as “Nonsense”, “Missense”, or “Synonymous” based on the information provided in PCAWG maf files (“Variant_Classification” column). Tumour suppressors were defined as genes labelled as “TSG“ but not “Oncogene” in the Census dataset. A similar stringent approach was used to define oncogenes. Genes annotated as both “Tumour suppressors’’ and “Oncogenes” were excluded (most analyses), analysed as “Both” (Fig. S3A-B), or combined with tumour suppressors to form the extended “Tumour suppressor+” group (Fig. S5B).

DigDriver enrichment analysis

We used the “Analyzing new mutation sets” mode of DigDriver to process different functional categories of somatic SNVs. Functional annotation was taken from DigPreprocess.py annotMutationFile output files. Enrichment/excess of mutations of Census cancer gene category was calculated as:

To calculate the 95% Confidence Interval (CI) of this enrichment, we performed bootstrap resampling of tumour suppressors, oncogenes and non-cancer genes in each mutation class for 1000 iterations. In each iteration, the enrichment/excess of mutations was calculated as described above. The 2.5th and 97.5th percentiles of the resampled distribution were used as the 95% confidence interval boundaries.

Gene expression analysis

To analyse possible effect of DOWN-paSNVs on transcript abundance, we selected tumour suppressor genes from the published colorectal adenocarcinoma study 20, which contained DOWN-paSNVs and no other damaging SNVs (i.e. Non-syn. variants from the binarised gene-centric table in 20) in some samples, and no mutations in other samples. We normalized the available gene expression data (FPKM) to account for gene copy number variation and log2-tranformed them to obtain Log2(nFPKM) values. Gene-specific Log2(nFPKM) values for the wild-type samples were then subtracted from corresponding Log2(nFPKM) values for the DOWN-paSNVs samples to obtain distributions of gene expression differences (ΔLog2(nFPKM)). A one-tailed Wilcoxon signed-rank test was used to analyse the significance of a negative shift of ΔLog2(nFPKM) distributions compared to 0.

DNA constructs

To generate read-through XPA minigenes, 431-nt gBlock fragments (Integrated DNA Technologies) containing the human XPA 3’UTR in its natural context (chr9: 100436867-100437297: GRCh37/hg19) and either the wild-type or mutated PAS were cloned into the pEGFP-N3 plasmid (Clontech) at the BsrGI and NotI sites. To generate luciferase reporter plasmids, the entire XPA 3’UTR (chr9: 100437071-100437680; GRCh37/hg19) was amplified from HCT-116 genomic DNA using KAPA HiFi DNA polymerase HotStart ReadyMix (Roche, cat# KK2601) with MLO4220 (5’-AACGCTAGCAAATAAAGGAAATTTAGATTGGTCCT-3’) and MLO4221 (5’-ATCGGTCGACTCAACAATCAGATAGTCAACCATGA-3’) primers. The PCR product was gel-purified and cloned into the pGL3-control plasmid (Promega) at the XbaI and SalI sites. The cancer-specific PAS mutation was introduced using a modified Quikchange site-directed mutagenesis protocol, using the KAPA HiFi DNA polymerase HotStart ReadyMix (Roche, cat# KK2601) with MLO4159 (5’-GCCCTAATAGCAGAGATAAACATTGAGTTG-3’) and MLO4160 (5’-CAACTCAATGTTTATCTCTGCTATTAGGGC-3’) primers. All constructs were verified by Sanger sequencing. Plasmid maps are available on request.

Minigenes experiments

HCT-116 cells (ATCC® CCL-247) were cultured in a humidified incubator at 37°C, 5% CO2, in DMEM containing 4.5 g/L glucose, GlutaMAX and 110 mg/L sodium pyruvate (Thermo Fisher Scientific, cat# 11360070) supplemented with 10% FBS (Hyclone, cat# SV30160.03) and 100 units/ml PenStrep (Thermo Fisher Scientific, cat# 15140122). For passaging, cells were washed with 1 × PBS and dissociated in 0.05% Trypsin-EDTA (Thermo Fisher Scientific, cat# 15400054) for 10 min at 37°C.

For read-through minigene transfection experiments, cells were typically seeded overnight in 1 mL of culture medium at 1-2 × 105 per well of a 12-well plate. Next morning, 1 µg of plasmid DNA was mixed with 2.5 µl of Jetprime transfection reagent in 150 µl of Jetprime transfection buffer (Polyplus, cat# 101000015), incubated for 10 min at RT and added drop-wise to the cells. Total RNAs were extracted from cells 24 hours post-transfection using TRIzol (Thermo Fisher Scientific, cat#15596026) with an additional acidic phenol-chloroform (1:1) extraction step. The aqueous phase was precipitated with an equal volume of isopropanol, washed with 70% ethanol, and dissolved in 80 µl of nuclease-free water (Thermo Fisher Scientific, cat# AM9939). RNA samples were then treated with 4-6 units of Turbo DNase (Thermo Fisher Scientific, cat# AM2238) at 37 °C for 30 min to remove the bulk of plasmid DNA contamination, extracted with an equal volume of acidic phenol– chloroform (1:1), precipitated with 3 volumes of 100% ethanol and 0.1 volume of 3 M sodium acetate (pH 5.2), washed with 70% ethanol and rehydrated in nuclease-free water. To remove any remaining traces of DNA, the RNA samples were additionally pre-treated with 2 units of RQ1-DNAse (Promega, cat# M6101) per 1 µg of RNA at 37 °C for 30 minutes. RQ1-DNAse was inactivated by adding the Stop Solution as recommended and the RNAs were immediately reverse-transcribed using SuperScript IV (Thermo Fisher Scientific, cat# 18090050) and random decamer (N10) primers at 50 °C for 30 min. cDNA samples were analysed by qPCR using a Light Cycler®96 Real-Time PCR System (Roche) and qPCR BIO SyGreen Master Mix (PCR Biosystems, cat# PB20.11-51). The RT-qPCR signals downstream of the XPA cleavage site (MLO944, 5’-GGCCGCGACTCTAGATCATAA-3’ and MLO358, 5’-GTAACCATTATAAGCTGCAATAAACAAG-3’) were normalized to those obtained using upstream primers (MLO775, 5’-AGAACGGCATCAAGGTGAAC-3’ and MLO776, 5’- TGCTCAGGTAGTGGTTGTCG-3’).

For luciferase minigene transfection experiments, cells were typically seeded overnight in 100 µl of culture medium at 5 × 103 per well of a 96-well plate. Next morning, 70 ng of a firefly luciferase reporter construct containing XPA sequences and 30 ng of the Renilla luciferase control (pRL-TK; Promega) were mixed with 0.2 µl of Jetprime transfection reagent in 10 µl of Jetprime transfection buffer (Polyplus, cat# 101000015), incubated for 10 min at RT and added drop-wise to the cells. Following a 24-hour incubation, transfected cells were analysed using a Dual-Glo® Luciferase Assay System kit (Promega, cat# E2920) as recommended by the manufacturer. Luminescence was measured using a Berthold Mithras LB940 plate reader.

Statistics

Unless stated otherwise, all statistical procedures were performed in R. Data were averaged from at least three experiments and shown as box plots, with box bounds representing the first and the third quartiles and whiskers extending from the first and the third quartile to the lowest and highest data points or, if there are outliers, 1.5× of the interquartile range. Data obtained from RT-qPCR and luciferase assays were compared using the two-tailed Student’s t-test assuming unequal variances. Genome-wide data were analysed using Wilcoxon rank sum test or Fisher’s exact test (two-tailed if not stated otherwise). Specific tests used and the P-values obtained are indicated in the figures and/or figure legends

Distribution of cleavage/polyadenylation signal-disrupting mutations in the normal population (1000 genomes dataset).

(A) Box plot comparison of normal-population allele frequencies of cleavage/polyadenylation signal-disrupting mutations defined by considering only AWTAAA gain/loss, only APARENT2 score changes, or both (DOWN-paSNVs).

(B) Bar plot comparison of normal-population fractions of singletons for cleavage/polyadenylation-disrupting mutations defined by considering only AWTAAA gain/loss, only APARENT2 score changes, or both (DOWN-paSNVs).

Cancer somatic DOWN-paSNVs often occur in evolutionarily conserved regions. The plot is generated similarly to Fig. 2D except the conservation was calculated for the exact SNV position.

Cancer somatic DOWN-paSNVs often reside in genes with tumour suppressive functions.

(A) Stacked bar plot showing enrichment of SNVs disrupting polyadenylation signals (DOWN-paSNVs) in tumour suppressors in cancer.

(B) The data in (A) normalized for Census genes only. Note that nonsense mutations show a similar to DOWN-paSNVs enrichment in tumour suppressors, but not oncogenes. Conversely, UP-paSNVs are enriched in oncogenes but not tumour suppressors.

(C) Top 10 GO terms enriched in genes with cancer somatic DOWN-paSNVs. Note the enrichment of apoptosis- and cell death-related functions.

Cancer somatic DOWN-paSNVs are enriched for statistically significant DigDriver events (BH-adjusted P<0.01), suggesting that they may be under positive selection in cancer.

Wild-type tumour suppressor genes tend to have efficient cleavage/polyadenylation signals.

(A) Box plot showing that wild-type tumour suppressors have stronger cleavage/polyadenylation signals than oncogenes and non-Census genes.

(B) All Census genes classifiable as tumour suppressors (“Tumour suppressors+”; see Materials and Methods) have stronger cleavage/polyadenylation signals compared to oncogenes and non-Census genes.

(C) Tumour suppressors associated with “hallmarks of cancer” have stronger cleavage/polyadenylation signals than “non-hallmark” tumour suppressor genes.