Serum RNAs can predict lung cancer up to 10 years prior to diagnosis

  1. Sinan U Umu  Is a corresponding author
  2. Hilde Langseth
  3. Verena Zuber
  4. Åslaug Helland
  5. Robert Lyle
  6. Trine B Rounge  Is a corresponding author
  1. Department of Research, Cancer Registry of Norway, Norway
  2. Department of Epidemiology and Biostatistics, Imperial College London, United Kingdom
  3. Department of Oncology, Oslo University Hospital, Norway
  4. Institute for Cancer Research, Oslo University Hospital, Norway
  5. Institute of Clinical Medicine, University of Oslo, Norway
  6. Department of Medical Genetics, Oslo University Hospital and University of Oslo, Norway
  7. Centre for Fertility and Health, Norwegian Institute of Public Health, Norway
  8. Department of Informatics, University of Oslo, Norway

Abstract

Lung cancer (LC) prognosis is closely linked to the stage of disease when diagnosed. We investigated the biomarker potential of serum RNAs for the early detection of LC in smokers at different prediagnostic time intervals and histological subtypes. In total, 1061 samples from 925 individuals were analyzed. RNA sequencing with an average of 18 million reads per sample was performed. We generated machine learning models using normalized serum RNA levels and found that smokers later diagnosed with LC in 10 years can be robustly separated from healthy controls regardless of histology with an average area under the ROC curve (AUC) of 0.76 (95% CI, 0.68–0.83). Furthermore, the strongest models that took both time to diagnosis and histology into account successfully predicted non-small cell LC (NSCLC) between 6 and 8 years, with an AUC of 0.82 (95% CI, 0.76–0.88), and SCLC between 2 and 5 years, with an AUC of 0.89 (95% CI, 0.77–1.0), before diagnosis. The most important separators were microRNAs, miscellaneous RNAs, isomiRs, and tRNA-derived fragments. We have shown that LC can be detected years before diagnosis and manifestation of disease symptoms independently of histological subtype. However, the highest AUCs were achieved for specific subtypes and time intervals before diagnosis. The collection of models may therefore also predict the severity of cancer development and its histology. Our study demonstrates that serum RNAs can be promising prediagnostic biomarkers in an LC screening setting, from early detection to risk assessment.

Editor's evaluation

This work has generated valuable data demonstrating the potential utility of serum RNA for lung cancer detection.

https://doi.org/10.7554/eLife.71035.sa0

Introduction

Lung cancer (LC) continues to be the leading cause of cancer-related deaths despite declining smoking prevalence (Bray et al., 2018; Wild et al., 2020). Non-small-cell (NSCLC) and small-cell (SCLC) are the two major subtypes of LC. The symptoms generally occur at a late stage and the prognosis is poor. Stage at diagnosis typically determines patient survival (Aberle et al., 2011; Bach et al., 2012; Brustugun et al., 2018). Screening with low-dose computed tomography (LDCT) can be effective for early detection (Bach et al., 2012; Peled and Ilouze, 2015) and reduce LC mortality up to 20% in high-risk groups (de Koning et al., 2020; Hanash et al., 2018; Seijo et al., 2019). However, LDCT has limitations such as high false-positive rates, risk of overdiagnosis, and high costs (Gopal et al., 2010; Peled and Ilouze, 2015). Annual CT scans also cause harmful radiation exposure (Bach et al., 2012; Hanash et al., 2018). Robust biomarkers can help stratify high-risk groups and increase accuracy in patient inclusion criteria for LDCT-based screening programs (Hanash et al., 2018).

Liquid biopsies quantifying molecular biomarkers in circulation, such as tumor-derived DNAs, proteins, and RNAs, can be used to detect cancer (Hanash et al., 2018; Ko et al., 2018; Sandfeld-Paulsen et al., 2016). MicroRNAs (miRNA), a class of ~21 nucleotide long short RNAs, have been widely investigated for their biomarker potential (Fehlmann et al., 2020; Keller and Meese, 2016; Pichler and Calin, 2015; Tian et al., 2019). They can be found both in serum (Keller and Meese, 2016; Murillo et al., 2019; Umu et al., 2018) and in plasma (Freedman et al., 2016; Keller and Meese, 2016; Murillo et al., 2019) as cell-free circulating RNAs, which may originate from dying cells or be actively secreted (Zaporozhchenko et al., 2018). Some of them are bounded by proteins or confined in layered exosomes which can protect them from degradation (Fritz et al., 2016). MiRNAs can function as tumor suppressors or oncomiRs and regulate tumor traits such as cell growth, angiogenesis, immune evasion, and metastasis (Pichler and Calin, 2015; Svoronos et al., 2016). The search for RNA biomarkers is not limited to miRNAs. Aberrant expression of other RNA classes, such as protein coding mRNAs, tRNAs, piwi-interacting RNAs (piRNAs), and long-noncoding RNAs (lncRNAs), has been associated with cancer (Kim et al., 2017; Slack and Chinnaiyan, 2019). Despite the immense potential of cell-free RNAs, the promise of non-invasive RNA biomarkers of cancer has not yet been fulfilled.

One explanation of the lack of circulating RNAs used in clinical settings is our limited understanding of the prediagnostic dynamics of cell-free RNAs, since studies are usually based on samples at or after diagnosis. Carcinogenesis is a multistep process that turns cell functions from normal to malignant (Hanahan and Weinberg, 2000). It can cause temporal changes in RNA levels linked to cellular processes driven by the hallmarks of cancer (Gutschner and Diederichs, 2012; Hanahan and Weinberg, 2000). We have shown that prediagnostic RNA levels in serum are highly dynamic in LC patients, which may signal early carcinogenesis (Umu et al., 2020). A similar result was observed in breast cancer (Lund et al., 2016) and testicular cancer patients (Burton et al., 2020). A lack of reproducibility among studies is also a problem, caused by technical and biological factors such as storage time, sampling procedure, age, sex, smoking history, etc. (Rounge et al., 2018). It is therefore important to control for these factors.

In the present study, our objective was to identify serum RNA-based biomarkers for early diagnosis of LC using prediagnostic samples. We identified the optimal machine learning (ML) algorithm for RNA biomarker modeling. Optimization of prediction models was done with an ML workflow, including cross-validation and testing, which was repeated five times to increase the generalizability of our results. We also investigated the biological relevance of the best RNA separators in the context of cancer biomarkers.

Results

Patient characteristics and RNA-seq profiles

In this study, we selected 400 patients with prediagnostic serum samples including multiple samples from the same patients. We also included 525 individuals as controls. After excluding failed or low input samples, we obtained RNA-seq data from 1061 serum samples. However, samples from individuals without any smoking history (i.e. never smokers) or missing information were excluded from initial analyses. This resulted in 535 cases and 263 control samples from 645 current or former smokers for modeling and testing (Table 1 and Figure 1A). Non-smokers consist of 7 cases and 256 control samples from 260 individuals (Supplementary file 1). We used non-smokers in a leave-out set only to test our final models and to calculate relative risk (RR).

Consort diagram of the study and our model training and testing workflow.

(A) The sample selection is summarized by the flow chart. Non-smokers were excluded from model building. (B) We randomly created five different training and testing datasets for each group (e.g. standard, histology-specific, or prediagnostic models).

Table 1
Clinical and histological characteristics of samples used in modeling.
Stage
Early (localized)Locally Advanced (regional)Advanced (distant)UnknownControls
Histology
NSCLC849916711-
SCLC935764-
Others105314-
Sex
Male7810417812185
Female253596778
Age at donation, years
Mean (SD)54.3 (7.33)54.9 (9.08)53.5 (8.25)51.8 (6.53)49.9 (10.9)
Age at diagnosis, years
Mean (SD)59.8 (7.67)60.6 (8.89)59.4 (8.31)58.6 (6.05)-
Prediagnostic sampling time, years
Mean (SD)5.52 (2.81)5.63 (2.79)5.91 (2.66)6.75 (2.18)-
Total samples10313927419263
Individuals7910218916263
Total individuals645 (smokers*)
  1. *

    See supplementary document for non-smokers (Supplementary file 1).

After filtering out low-count transcripts, 3306 RNAs were selected as candidate features and used in the models: 202 miRNAs, 1137 isomiRs, 89 miscellaneous RNAs (miscRNAs), 380 piRNAs, 119 small nucleolar RNAs, 530 tRFs, 790 mRNAs, and 59 lncRNAs.

ML algorithms can differentiate between prediagnostic cases and controls regardless of prediagnostic time

We first evaluated the classification performance of the ML algorithms in terms of average AUCs on test datasets, created by five random repeats as explained in Materials and methods.

All samples were included in algorithm evaluation regardless of their stage at diagnosis and prediagnostic time which were regarded as full-time standard models (Figure 2—figure supplement 1). The average AUC of all algorithms was 0.67 (95% CI, 0.66–0.69) for all histologies, 0.67 (95% CI, 0.65–0.69) for NSCLC and 0.64 (95% CI, 0.62–0.66) for SCLC on the test datasets. The XGBoost algorithm produced a higher AUC than the average, 0.71 (95% CI, 0.68–0.73). The XGBoost models also performed better when the samples were stratified by histologies: NSCLC, 0.70 (95% CI, 0.65–0.75) and SCLC, 0.71 (95% CI, 0.68–0.74) (Figure 2).

Figure 2 with 2 supplements see all
Each ROC curve is based on the prediction results of a randomly created testing dataset (in total five).

Area under the ROC curve (AUC) values show the average of these predictions. The most important features of the classifiers were sorted on their average feature importance and are shown in the lower panels. A detailed list of biomarkers with their feature importance is available in supplementary (Supplementary file 2). We did not perform any feature selection while training these models (see also Figure 2—source data 1).

Although the models of all algorithms had comparable performances in terms of average AUCs, they differ in terms of total number of non-zero features (i.e. different model complexity). For example, random forest (RF) selected more than 3000 non-zero features while the lasso model selected fewer than 25 features. However, the profiles of the top features, ranked in terms of feature importance, usually consisted of similar RNAs (e.g. miRNAs or tRFs).

Since XGBoost produced the most predictive full-time models, we used it for the remaining analyses. We also investigated the best predictors of the XGBoost models and ranked them based on their importance (Supplementary file 2). The top three best features were an isomiR of hsa-miR-486–5p (iso-23-8YUYFYKSY), piR-hsa-28723, and INTS10 for all histologies; Y-RNA, piR-hsa-28723, and GPB3 for NSCLC; and tRF-BS68BFD2, RN7SL724P, and tRF-947673FE5 for SCLC. An in-depth investigation of selected features by other algorithms also showed common RNAs. For example, Y-RNA and iso-23-8YUYFYKSY isomiR were among the top predictors of the RF, elastic-net, the SGL, and the lasso models for NSCLC; tRF-BS68BFD2 for SCLC. We also performed KEGG pathway enrichment analysis based on the common miRNA, mRNA, and isomiR features. The results showed that many cancer-related pathways were significantly (p < 0.01) enriched such as MAPK signaling, mTOR signaling, and AMPK signaling.

We evaluated the classification performance of the XGBoost algorithm by selecting one sample per patient rather than using all samples from the same individuals. Our results showed comparable performance in terms of AUCs for all models (Figure 2—figure supplement 2). The SCLC models performed slightly worse than the others. This discrepancy can be explained by the relatively small sample size of this group. Therefore, we decided to use all samples from the same individuals.

MiscRNA- and miRNA-only models are more accurate than the others

We produced XGBoost models that included only a single RNA class (e.g. miRNA, isomiR, etc.) to further investigate important features/classes. This method showed that miscRNA-only and miRNA-only models achieved better classification performance than the other RNA classes regardless of histology and stage at diagnosis (Table 2). The best separators of these models included hsa-miR-99a-5p, hsa-miR-1908–5p, hsa-miR-3925–5p, and Y-RNA-related transcripts (i.e. RNY1P5 and RNY4P30). When we took histology into account, miRNAs and isomiRs for NSCLC and miscRNAs for SCLC produced better models (Table 2). The most important features of histology-dependent models included hsa-miR-629–5p, hsa-miR-99a-5p, hsa-miR-486–5p isomiR (iso-23-8YUYFYKSY), hsa-miR-151a-3p isomiR (iso-22-B0NKZK1JN) for NSCLC; 7SL RNA-related transcripts and vault-RNA for SCLC (Supplementary file 2).

Table 2
Averages of area under the ROC curves (AUCs), accuracies (acc), sensitivities (sn), and specificities (sp) of the XGBoost algorithm models on test datasets when prediagnostic time was not included.
Histologies of model
All (including others)NSCLCSCLC
Features included:AUCAv. # of features*Av. % of acc/sn/spAUCAv. # of featuresAv. % acc/sn/spAUCAv. # of featuresAv. % acc/sn/sp
All RNAs0.71 (95% CI, 0.68–0.73)30169/73/620.70 (95% CI, 0.65–0.75)37367/70/640.71 (95% CI, 0.68–0.74)21370/69/71
Lasso-selected features0.78 (95% CI, 0.74–0.82)14973/75/710.78 (95% CI, 0.75–0.82)5673/73/720.74 (95% CI, 0.69–0.80)5872/61/83
Univariate significant features0.70 (95% CI, 0.66–0.73)7667/75/580.69 (95% CI, 0.64–0.73)5167/71/640.70 (95% CI, 0.65–0.76)1168/69/68
miRNA only0.72 (95% CI, 0.68–0.76)16869/76/610.73 (95% CI, 0.70–0.75)19969/74/640.65 (95% CI, 0.62–0.69)2067/74/60
isomiR only0.70 (95% CI, 0.65–0.74)20467/68/670.73 (95% CI, 0.69–0.77)21571/75/660.65 (95% CI, 0.60–0.70)10866/65/67
tRF only0.69 (95% CI, 0.65–0.73)31465/77/530.67 (95% CI, 0.65–0.69)31466/64/670.68 (95% CI, 0.65–0.71)2366/69/63
MiscRNA only0.72 (95% CI, 0.69–0.74)8369/73/650.68 (95% CI, 0.63–0.74)8766/73/590.69 (95% CI, 0.64–0.75)7670/78/61
  1. *

    Average number of non-zero features selected by the models. Note: Detailed information on all selected features are in Supplementary file 2.

Feature selection improves model performance and reduces model complexity

Single RNA class models also implied that feature selection can further improve model performances. Thus, we tested two feature selection methods. The results showed that lasso feature selection improved AUC values and reduced complexity (Table 2). The most important features of lasso-selected models included hsa-miR-423–5p isomiR (iso-20-5KP25HFF), GBP3, and piR-hsa-28723 for all histologies; Y-RNA, hsa-miR-423–5p isomiR (iso-20-5KP25HFF), and LINC01362 for NSCLC; HIST1H4E, PTCH2, and tRF-R29P4P9L5HJVE for SCLC (Supplementary file 2). Moreover, univariate significant feature selection greatly reduced model complexity with an acceptable performance (Table 2). For example, SCLC models only included 11 RNAs. The most important features were GBP3, LINC01362, and hsa-miR-30a-5p for all histologies; LINC01362, GBP3, and tRF-9MV47P596V for NSCLC; piR-hsa-7001 and tRF-7343R × 6NMH3 for SCLC (Supplementary file 2).

Histology-specific prediagnostic models can improve prediction performance

We previously demonstrated that RNA levels are dynamic and histology-specific in prediagnostic samples (Umu et al., 2020). We therefore trained and tested models stratified by prediagnostic time which were selected by a sliding window approach as explained in Materials and methods.

The results showed that inclusion of prediagnostic time and histological subtype together creates better models for specific time intervals (Figure 3). For example, SCLC models restricted to samples from 2 to 5 years prior to diagnosis had an average AUC of 0.84 (95% CI, 0.77–0.9) (Figure 3). Another model of SCLC samples that only utilized miRNAs restricted to 3–5 years prior to diagnosis had an average AUC of 0.85 (95% CI, 0.76–0.93) on the test datasets. Both SCLC models selected the same miRNAs as their most important features such as hsa-miR-30a-5p, hsa-miR-339–3p, hsa-miR-215–5p. Besides miRNAs, an isomiR of hsa-miR-451a and RN7SL181P were the most important features of prediagnostic SCLC models. Enrichment analysis of the most important features identified signaling pathways, such as MAPK, PI3K-Akt, RAS, and other pathways like choline metabolism, cellular senescence, and PD-L1 expression and PD-1 checkpoint. Similarly, NSCLC models restricted to 6–8 years prior to diagnosis had an average AUC of 0.81 (95% CI, 0.75–0.86). The most important RNAs of this period were tRF-YP9L0N4V3, an isomiR of hsa-miR-484 (iso-23-8K4P8R8SDE) and tRF-9MV47P596V. More than 70 pathways were enriched such as endocytosis, MAPK, RAS, choline metabolism, and neurotrophin signaling pathway.

Sliding windows analysis showed better models which utilizes prediagnostic samples in specific time intervals such as small-cell lung cancer (SCLC) models, which were restricted to samples from 2 to 5 years prior to diagnosis (see the first and the second panel, red dots).

Each color represents different histologies: black and red only have non-small cell lung cancer (NSCLC) and SCLC samples respectively while blue has all histologies including others (Figure 3—source data 1).

As an alternative to sliding windows, we also performed a fixed window approach and trained models using samples from up to 2 years (0–2), up to 5 years (0–5), and up to 8 years (0–8) before diagnosis. The results showed slight improvement in model performance compared to full-time models (see Supplementary file 3). However, sliding windows models performed better on specific time intervals.

Frequent features can create simple and accurate models

We created models by compiling the best features from the full-time models. Our results showed improved prediction performance for these models despite inclusion of leave-out datasets into the test set (see Materials and methods). In the test datasets including only smokers, AUC for all histologies was 0.76 (95% CI, 0.68–0.83); NSCLC model was 0.78 (95% CI, 0.70–0.85); SCLC model was 0.88 (95% CI, 0.83–0.94) (Table 3). However, when non-smokers were also included in the test set, the model performance dropped to 0.68 (0.63–0.72) for all histologies and 0.68 (0.63–0.73) for NSCLC. Remarkably, the SCLC model still had AUC of 0.84 (0.79–0.9) when including non-smokers.

Table 3
All selected features, performance, and relative risk (RR) of XGBoost models.
Models
All*NSCLCSCLC
Featuresiso-20-5KP25HFFGBP3 hsa-miR-30a-5pINTS10LINC01362 piR-hsa-28723RNU1-8P iso-23-BQ8DQWM4ZCTD-3252C9.4DSTHBA2HIST2H2AC hsa-miR-99b-3pLATS1 piR-hsa-28391 piR-hsa-28394RN7SL181PRN7SL8PRNU2-27P iso-23-8YUYFYKSYTLN1 tRF-V47P59D9 tRF-86V8WPMN1EJ3 tRF-6SXMSL73VL4Y tRF-QKF1R3WE8RO8ISLINC01362Y-RNA iso-23-B0NKZ01J0D iso-22-MKJIJLJ2Q iso-21-N2NBQRZ00GBP3 iso-20-RNUW92OIGNAS hsa-miR-30a-3pNHSL2 piR-hsa-28488RC3H2RN7SL181PRNU2-19PRNY4P27 iso-23–909 U247N04tRF-I89NJ4S2 tRF-9MV47P596VE tRF-86J8WPMN1EJ3 tRF-86V8WPMN1EJ3 tRF-Q1Q89P9L8422EAC113404.1C6orf223HIST1H4E hsa-miR-30a-5p hsa-miR-574–5pODC1PTCH2PTMARN7SL181P tRF-22-947673FE5AKAP9MIGA1RAP1BRN7SL724PRUFY2 iso-23-X3749W540L tRF-BS68BFD2 tRF-R29P4P9L5HJVE tRF-ZRS3S3R × 8HYVD
Total features252119
Total test samples (total leave-out size) (non-smokers)640 (535) (263)465 (360) (262)444 (395) (256)
AUC on test (95% CI)(only smokers**)0.76 (0.68–0.83)0.78 (0.70–0.85)0.88 (0.83–0.94)
AUC on test (95% CI)(both smokers and non-smokers**)0.68 (0.63–0.72)0.68 (0.63–0.73)0.84 (0.79–0.9)
RR on test (95% CI)(only smokers**)2.37 (1.54–3.7) p = 1.15 × 10–72.36 (1.52–3.66) p = 2.83 × 10–62.48 (2.06–3) p = 3.32 × 10–9
RR on test (95% CI)(both smokers and non-smokers**)1.84 (1.7–2.01) p = 1.25 × 10–61.52 (1.27–1.83) p = 2.67 × 10–52.04 (1.85–2.25) p = 8.8 × 10–8
  1. *

    Including other histologies. ** includes samples previously not used (leave-out samples).

The RRs and their associated p-values on the test dataset, with and without non-smokers, are reported in Table 3. A positive test in smokers suggests more than two times higher risk of getting LC diagnosis in future.

We also investigated the prediagnostic models, using the ML workflow, and selected two pairs of models for NSCLC and SCLC, which showed high performance before and after 5 years prior to diagnosis (see Supplementary file 4). We found that NSCLC models restricted to 0–2 and 6–8 years before diagnosis had an average AUCs of 0.89 (95% CI, 0.84–0.96) and 0.82 (95% CI, 0.76–0.88), respectively; SCLC models restricted to 2–5 and 8–10 years before diagnosis had an average AUCs of 0.89 (95% CI, 0.77–1.0) and 0.83 (95%, 0.69–0.97), respectively. We reported other model metrics and the best features in the supplementary document (Supplementary file 4).

Discussion

In this study, we showed that ML models of prediagnostic serum RNA levels can be used to predict LC years before diagnosis and manifestation of disease symptoms. Our models achieved clinically relevant performance in terms of AUC, accuracy, sensitivity, and specificity (Tables 2 and 3). The model performance was further increased for specific prediagnostic time windows and histologies making it feasible to develop them as biomarkers for LC screening (Figure 3). A collection of the best models (and predictors) (Table 3 and Supplementary file 4) can predict risk for developing LC, which histologies to look for and indicate the level of cancer progression. The time window of the high-performance models may be a first indication of how often to screen for LC (Figure 4). Our study is unique in including serum samples collected up to 10 years prior to LC diagnosis and a large set of control samples.

Suggested clinical uses of RNA biomarkers in lung cancer (LC) screening.

A positive test from full-time models shows elevated risk (at least two times). They can detect cancer-related RNA signals up to 10 years before diagnosis. Prediagnostic models have higher accuracy, sensitivity, and specificity which can potentially assist full-time models and improve specificity (Supplementary file 4).

Figure 4—source data 1

Suggested clinical uses of RNA biomarkers in lung cancer (LC) screening.

https://cdn.elifesciences.org/articles/71035/elife-71035-fig4-data1-v2.docx

We previously reported that prediagnostic circulating RNA signals are highly dynamic in LC patients and they can be histology and stage dependent (Umu et al., 2020). In the present study, ML models using all samples regardless of stage, histology, or prediagnostic time successfully separated LC patients from controls. All the tested algorithms consistently produced acceptable AUC values (Figure 2A). The best algorithm, XGBoost, resulted in an average accuracy of 69% without feature selection. An analysis of the features showed a large panel of selected RNAs: more than 300 out of available 3306 (with no feature selection implemented). This may be interpreted as a general shift in the levels of RNAs during cancer development, consistent with our previous study that showed hundreds of RNAs were differentially expressed up to 10 years before diagnosis (Umu et al., 2020).

We found that some features were considerably more important (and frequent) separators than others with or without feature selection. The list includes piRNAs (e.g. piR-hsa-28723), miRNAs (e.g. hsa-miR-574–5p, hsa-miR-30a-5p, hsa-miR-106b-5p), isomiRs (e.g. isomiR of hsa-miR-423–5p (iso-20-5KP25HFF), hsa-miR-486–5p (iso-23-8YUYFYKSY)), and miscRNAs (e.g. RN7SL181P). Some of them were particularly interesting since they were associated with cancer or proposed as cancer biomarkers. Hsa-miR-30a-5p is a tumor suppressor and downregulated in LC tissues (Yanaihara et al., 2006). It regulates oncogenes such as RAB38 and RAB27B (The RNAcentral Consortium, 2019). Another notable example is hsa-miR-574–5p which promotes metastasis in NSCLC by targeting PCP2 in tumor tissues (Zhou et al., 2016) and has been proposed as an early stage NSCLC serum biomarker (Foss et al., 2011). Hsa-miR-574–5p was among the most important features in lasso-selected and miRNA-only histology-specific SCLC models. It was also one of the most important features in our histology-specific NSCLC models. There were also isomiRs among the most important features such as hsa-miR-486–5p canonical form, which was the best separator for all histologies. Hsa-miR-486–5p targets PIK3R1 to suppress cell growth. Its overexpression inhibits cell proliferation and invasion and it was significantly downregulated in both tissue and serum (Tian et al., 2019). Hsa-miR-486–5p was proposed as a diagnostic and prognostic biomarker for NSCLC (ElKhouly et al., 2020; Tian et al., 2019).

Besides miRNAs and isomiRs, RNAs of other classes were noteworthy and linked to carcinogenesis. For example, 7SL, a member of miscRNAs, is upregulated in tumor cells. It binds to TP53 mRNA at the 3'UTR region and downregulates its expression (Abdelmohsen et al., 2014). 7SL-related transcripts (e.g. RN7SL181P) were among the most important separators in the cell histology, NSCLC- and SCLC-specific models. Another example is Y-RNA and we found that Y-RNA and related genes (e.g. RNY4P30) were among the most important features for NSCLC models. Y-RNA was also chosen as an important feature by the lasso-selected NSCLC models. Y-RNA-derived small RNAs function as tumor suppressors in NSCLC. They inhibit cell proliferation and were proposed as circulating RNA biomarkers since they were upregulated in NSCLC EVs (Li et al., 2018).

Inclusion of both prediagnostic time and histology produced better models in certain time windows (e.g. 2–5 years before diagnosis for SCLC) (Figure 3). This can be explained by the dynamic nature of prediagnostic RNA levels (Lund et al., 2016; Umu et al., 2020). Important features of these models can also be linked to early carcinogenesis and some were specific to these models. For example, hsa-miR-339–3p was among the most important features of SCLC prediagnostic models. Hsa-miR-339–3p is a tumor suppressor and was proposed as a serum biomarker of LC (Yu et al., 2019). We retrained some of these prediagnostic models using the most frequent features and achieved higher prediction performance than the full-time models in specific time intervals. We reported these models in supplementary (Supplementary file 3).

The most important features of histology-specific models also showed associations with carcinogenesis-related KEGG pathways, which were common or specific to histology. The common ones include well-known signaling (e.g. RAS, PI3K-Akt, MAPK, ErbB) and cancer-related pathways (e.g. proteoglycans in cancer and pathways in cancer). Choline metabolism in cancer pathway was one of the common ones and enriched in some histology-specific prediagnostic models. Altered choline profiles are characteristics of tumor tissues (Glunde et al., 2006). Moreover, a lipidome serum profiling study on early stage NSCLC patients proposed choline-containing phospholipids as potential LC biomarkers (Klupczynska et al., 2019). Enrichment of choline metabolism pathway years before diagnosis (i.e. NSCLC 6–8 and SCLC 2–5) supports this conclusion. We also reported enrichment of this pathway for all histologies before diagnosis in our previous study (Umu et al., 2020).

A strength of our study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. We have detailed information on histological subtype and stage at diagnosis from the Cancer Registry of Norway (CRN) and smoking history from survey data. We also accounted for other potential confounders (i.e. age, sex, and blood donor group [BDg]) (Rounge et al., 2018). Some of our potential biomarkers (e.g. hsa-miR-30a-5p, sa-miR-339–3p, 7SL) were already associated with carcinogenesis or proposed as biomarkers, which shows consistent results with current literature. Further, we found potential biomarkers from overlooked RNA classes which add important new knowledge into the field. We shared the average feature importance of all RNAs as supplementary tables (Supplementary file 2). We investigated performance of different algorithms which showed consistent results in terms of AUCs and features. We compiled shortlists from the most important features and tested their performance in a leave-out dataset on both smokers and non-smokers. We also found that smokers with a positive test had more than two times higher risk of getting LC diagnosis in future (Table 3).

There are some weaknesses in our study that we need to address. First, an independent cohort should replicate our results. However, only a few cohorts include prediagnostic samples that can be used for discovery and validation. We tried to overcome this issue by using training-testing repeats for assessing generalisability. We also reported our results with and without feature selection since some feature selection methods (e.g. lasso and univariate) can cause overfitting. Second, using more than one sample from the same individual can potentially cause overfitting. However, we did not detect any effect related to this issue (Figure 2—figure supplement 2). Third, our study focused only on smokers (since case samples are mostly smokers). However, our results show acceptable performance when including non-smokers as a test dataset as well. Fourth, reuse of the same data for frequent biomarker models (as reported in Table 3) can also result in overfitting. We tried to overcome this issue by including a leave-out dataset (which was never used) into the test set and reported performance. Lastly, since our samples are long-term stored, some unstable RNA molecules may have been degraded over the years, though we have already shown that this effect is negligible (Umu et al., 2018). Yet, we matched cases and controls for BDg which includes the effect of storage time (see Materials and methods).

In LC screening programs, RNA biomarkers can be used as a tool of initial assessment or combined with LDCT for early detection (Hanash et al., 2018). We found that smokers with a positive test had higher risk of getting LC diagnosis in future (Table 3). We also found that our biomarkers can be potentially used on non-smokers, especially SCLC biomarkers. However, we do not have enough non-smoker cases to further validate this interpretation. The dynamic nature of the prediagnostic signal for cancer may pose challenges for the performance of modeling and biomarker development. However, using a set of models specific for histology and time might provide additional information useful in evaluating LC risk (Figure 4). Our proposed use of RNA biomarkers starts with risk assessment using standard full-time models which can be used for an initial assessment in smokers when the disease is undetectable. A positive signal (i.e. high probability of being in LC group) classifies those individuals into an elevated risk group. Since prediagnostic models have a 2-year peak performance, every second-year testing with these models can provide confirmation of preneoplasia or an early stage tumor for individuals with elevated risk and selection criteria for CT monitoring. Prediagnostic models had higher overall specificity (more than 80%) which can help to determine future diagnosis histology. However, it requires further research. We selected two sets of histology-specific diagnostic models for early/late NSCLC and SCLC diagnosis and reported these in the supplementary document (Supplementary file 4). RNA biomarkers can prevent unnecessary use of LDCT while improving the chance of an early diagnosis of LC in an early stage. This hypothesis can be investigated in screening programs for validation.

Conclusion

We have shown that LC can be detected in both smokers and non-smokers years before diagnosis and the manifestation of symptoms regardless of histological subtype. We also proposed a model on how RNA biomarkers can be utilized in clinical settings. Our top performing models can produce AUCs up to 0.9 before diagnosis suggesting a great potential for LC early prediction.

Materials and methods

Study population and data sources

Request a detailed protocol

We used the population-based Janus Serum Bank (JSB) cohort containing prediagnostic serum samples (Langseth et al., 2017). The study participants were identified by linking the JSB to the CRN. We restricted our analyses to patients later diagnosed with LC up to 10 years after blood donation and control samples from individuals cancer-free (except non-melanoma skin cancer) at least 10 years after sample collection. We matched cases and controls on confounders (see Bioinformatics analyses). Smoking, collected from health survey data, was classified as current, former, or never smokers (Hjerkind et al., 2017). Since we have previously shown that smoking significantly affects RNA expression levels, we only included smokers (i.e. current and former) in the initial analyses and model building. However, non-smokers and samples not included by the frequency matching were used as an additional independent leave-out dataset to assess the level of overfitting.

Tumor staging

Request a detailed protocol

Detailed cancer information was selected from the CRN that has systematically collected mandatory notification on cancer occurrence for the Norwegian population since 1952 (Larsen et al., 2009). The cases were classified into histological subtypes: NSCLC, SCLC, and others, the latter referring to other less defined or multiple histologies. Stage at diagnosis was encoded with the TNM system: early (localized – stage I), locally advanced (regional – stages II and III), advanced or metastatic (distant – stage IV), and unknown (Cancer Registry of Norway, 2020).

Laboratory processing

Request a detailed protocol

We extracted RNA from 400 µL serum using phenol-chloroform and miRNeasy Serum/Plasma kit (Qiagen, Valencia, CA). We performed size selection using a 3% Agarose Gel Cassette (Cat. No CSD3010) on a Pippin Prep (Sage Science) with a cut size optimized to cover RNA molecules from 17 to 47 nt in length. Libraries were prepared with the NEBNext Small RNA kit (NEB, Ipswich, MA) and sequenced on a HiSeq 2500 platform to on average 18 million sequences per sample (Illumina, San Diego, CA).

Bioinformatic analyses

Request a detailed protocol

Our bioinformatics workflow includes quality control, adapter trimming, read mapping, read counting, and creation of count tables. We used a large annotation dataset containing several RNA classes available in serum (Umu et al., 2018), including miRBase (v22.1) for miRNAs (Kozomara et al., 2019), piRBase (v1.0) for piRNAs (Zhang et al., 2014), and the GENCODE (v26) for other RNA classes (Harrow et al., 2012). We used the AdapterRemoval tool for adapter trimming (Schubert et al., 2016) and Bowtie2 (Langmead and Salzberg, 2012) for mapping reads to the human genome (hg38) with an average mapping ratio of 70%. The SeqBuster tool was used for miRNA annotation counts and isomiR calling (Pantano et al., 2010). We filtered out the RNAs with fewer than five reads in less than 80% of the samples. All isomiRs passed the expression were regarded as bona fide isomiRs. We used DESeq2’s (Love et al., 2014) variance stabilizing normalization function to normalize identified RNA counts. The optmatch (v0.9–11) R package (Hansen and Klopfer, 2006) selected appropriately matched controls while building models. Therefore, we matched LC samples and controls on sex, age at donation, and BDg. BDg is a technical cofounder combining the effect of sample treatment at donation and storage time (Rounge et al., 2018). We used R function kegga from the limma package for KEGG pathway enrichment analysis of selected RNA features if they are miRNA, isomiR, or mRNA. The miRNA and isomiR targets were extracted from MIRDB (v5.0) predictions (Wong and Wang, 2015) (score cutoff >60). p-Values were adjusted using false discovery rate (FDR) (using p.adjust function of R).

ML classification algorithms and training/testing workflow

Request a detailed protocol

High dimensionality is often a problem in modeling RNA-seq data. Our preliminary analysis showed that ML algorithms with regularization produced successful models. Therefore, we selected five ML algorithms to create our initial models: lasso, elastic-net, sparse group lasso (SGL), RF, and extreme gradient boosting (XGBoost) algorithms. We used fivefold cross-validation (if available) to tune hyperparameters for model training. For the SGL models, RNAs were classified by type.

R implementations of these algorithms were used: caret (v6.0–84) and glmnet (v2.0–18) packages for elastic-net and the lasso, sglfast (v0.10) and msgl (v2.3.9) for the SGL models and xgboost (v1.0.0.2) for XGBoost. Classifications were performed according to histology and time to diagnosis (for details see next paragraph) using an automated ML workflow. In the ML workflow the datasets were split into training (70%) and test (30%) (Figure 1B). We repeated this step five times using designated seed numbers to select five different training and test datasets which were balanced for case/control numbers and also matched for confounders (i.e. sex, age, and BDg). Model optimization including hyperparameter tuning was done by a grid search approach followed by fivefold cross-validation using the training sets. The test datasets were only used for testing to overcome overfitting and assess true performance. The performance of the classifiers were mainly evaluated by area under the ROC curves (AUC)s. We also calculated accuracy, sensitivity, and specificity. Confidence interval calculations were done using metrics of test datasets.

Histology and prediagnostic models

Request a detailed protocol

We refer to models for all histologies, NSCLC and SCLC that do not take time to diagnose into account as standard full-time models (Figure 1B). Prediagnostic models were created using a sliding windows approach and a fixed-time approach to find optimal time to diagnose intervals. We first selected three different window sizes, 2, 3, and 4 years, which were moved over the 10 years prior to diagnosis time. We then built models based on samples captured by these sliding windows. Fixed-time windows were 0–2, 0–5, and 0–8 years before diagnosis. We used the workflow described above to train and test both standard and prediagnostic models.

Feature selection methods

Request a detailed protocol

We implemented feature selection methods to improve model performances, including single-RNA class, lasso selection, and significant selection. In the single-RNA class method, we dropped all RNA types except one. In lasso selection, all non-zero features selected by the lasso classification models were pooled. Next, we retrained new classification models which were restricted to use only these features. In significance selection, an univariate regression analysis was done per feature and significant features (multiple testing adjusted) were used to train classification models.

Frequent feature models, independent leave-out test, and RR calculations

Request a detailed protocol

We created models for each histology which utilize the most frequent features identified in the standard full-time models. To assess overfitting and to get a better estimate of these model performances, we split the datasets into training (80%) and test (20%) sets. To the test sets we also added non-smokers and samples from smokers, but not previously used in frequency matching (number of samples reported in Table 3). Both unmatchable samples and non-smokers were never used for model building and evaluation which we refer to as leave-out sets. We did not repeat this analysis five times as in the automated ML workflow. RRs were calculated using the test sets. The optimal threshold was identified in cross-validation. We used the R packages cutpointr (v1.0.1) and epitools (v0.5–10.1) to calculate RRs.

Data accessibility

Request a detailed protocol

The datasets generated for this article are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required. Requests to access the datasets should be directed to the corresponding authors. Our scripts and bioinformatics workflow files can be accessed from our GitHub repo (https://github.com/sinanugur/LCscripts, Umu, 2022 copy archived at swh:1:rev:26bccc86a551f71284559db11bb74230f5d00cc4).

Data availability

The datasets generated for this manuscript are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required. Requests to access the datasets should be directed to the corresponding authors with a project proposal. Please refer to our project website for the latest information on data sharing (kreftregisteret.no/en/janusrna). Our scripts, plot data, and bioinformatics workflow files can be accessed from our Github repo (https://github.com/sinanugur/LCscripts copy archived at swh:1:rev:26bccc86a551f71284559db11bb74230f5d00cc4).

The following data sets were generated
    1. Umu SU
    (2021) GitHub
    ID 439cf34. Lung Cancer analyses scripts.

References

  1. Software
    1. Cancer Registry of Norway
    (2020)
    Cancer in Norway 2019
    Cancer Incidence, Mortality, Survival and Prevalence in Norway.
  2. Book
    1. Wild CP
    2. Weiderpass W
    3. Stewart BW
    (2020)
    World Cancer Report: Cancer Research for Cancer Prevention
    International Agency for Research on Cancer.

Decision letter

  1. YM Dennis Lo
    Senior and Reviewing Editor; The Chinese University of Hong Kong, Hong Kong
  2. Shenglin Huang
    Reviewer; Fudan University, China

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Serum RNAs can predict lung cancer up to 10 years prior to diagnosis" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Y M Dennis Lo as the Senior Editor/Reviewing Editor. The following individual involved in review of your submission has agreed to reveal their identity: Shenglin Huang (Reviewer #2).

The reviewers have seen each other's reviews, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1. The analysis and result of serum RNA sequencing were not clearly presented. RNA sequencing with an average of 18 million reads per sample was performed. How about the mapped rate? Since the samples are long-term stored and may be at different times of collection, it is important that the sequencing results should be consisent. How about the sample correlations? In addition, why many of the RNA candidates (1137 of 3306) are isomiRs? They are derived from miRs, or degradation?

2. The authors showed that the candidate biomarkers were invovled in the cancer-related pathways. This result is somewhat overestimated because these RNAs are small RNAs and the KEGG pathway enrichment analysis is based on the targets.

3. A strength of this study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. The authors obtained RNA-seq data from 1061 serum samples. However, only samples from a total of 645 individuals with smoking history were analyzed. It would be helpful to include the samples without smoking to investigate the specificity of the potential biomarkers.

4. There are concerns regarding multiple samples from the same cancer subject. If samples from one subject go to both the training and testing set, there would be issues of overfitting. In the revision, the authors need to remove the duplicate samples from the analysis.

5. Calculate the relative risk for future cancer development for individuals with positive and negative test results.

6. Instead of using sliding windows of fixed duration, use windows of increasing duration.

7. Address the issue of over-fitting for the analysis from line 338 onwards.

8. The proposed use of the analysis (Figure 5 and the discussion) does not appear to be realistic. The authors need to calculate the percentage of subjects falling into the test positive and negative groups based on their analysis. Then, for each group, what percentage of subjects will eventually develop LC.

Reviewer #2:

This study by Umu et al., investigated the biomarker potential of serum RNAs for the early detection of lung cancer (LC) in smokers at different prediagnostic time intervals and histological subtypes. They observed that smokers later diagnosed with LC can be robustly separated from healthy controls by using machine learning models with serum RNAs profiles. This work demonstrated that serum RNAs could be promising prediagnostic biomarkers in a LC screening setting. This finding is interesting but not convincing because only one sample cohort was evaluated in current study and the datasets generated for this manuscript are not available. Also there are some concerns as follows.

1.The analysis and result of serum RNA sequencing were not clearly presented. RNA sequencing with an average of 18 million reads per sample was performed. How about the mapped rate? Since the samples are long-term stored and may be at different times of collection, it is important that the sequencing results should be consisent. How about the sample correlations? In addition, why many of the RNA candidates (1137 of 3306) are isomiRs? They are derived from miRs, or degradation?

2.The authors showed that the candidate biomarkers were invovled in the cancer-related pathways. This result is somewhat overestimated because these RNAs are small RNAs and the KEGG pathway enrichment analysis is based on the targets.

3.A strength of this study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. The authors obtained RNA-seq data from 1061 serum samples. However, only samples from a total of 645 individuals with smoking history were analyzed. It would be helpful to include the samples without smoking to investigate the specificity of the potential biomarkers.

Reviewer #3:

The authors analyzed archived serum samples which were collected before the development of lung cancer (LC) for different species of RNA.

They used machine learning to investigate if the combination of the RNA markers would be useful for the prediction of the future development of LC. In the machine learning analysis, they used 70% samples for training and 30% samples for testing. The process was repeated 5 times so that the samples can be assigned to both training and testing groups.

The strength of the study is that they include a large cohort of samples collected years before the development of LC.

The weaknesses of the study include

1. the questionable biological basis of this approach. As the samples were collected years before cancer diagnosis, the amount of tumor-derived RNA is expected to be very low. In this regard, the aberrations in serum RNA, if any, are likely to reflect the background risk of the body as a whole.

2. multiple samples from the same cancer subject. If samples from one subject go to both the training and testing set, there would be issues of overfitting. The authors would need to remove the duplicate samples from the analysis.

3. the low diagnostic performance. The sensitivity and specificity of the models were both around 70%. This would not be useful clinically. In fact, the use of relative risk for individuals with positive and negative test results would be more useful for predictive models.

4. The authors used a sliding window of fixed duration for the analysis. This analysis is counter-intuitive. If a marker is useful for predicting cancer at 3-5 years before cancer, what is the reason why it cannot predict cancers that occur within 3 years. It would be better to use windows with flexible size e.g. prediction of cancer in the next 2, then next 3, then next 5 years, etc.

5. For the analysis on lines 338 onwards, the authors further select features identified in the previous machine learning models for evaluation. What is the testing group in this analysis? Did they use all the samples without separating training and testing groups? If so, the over-fitting problem would be significant.

6. The. proposed use of the analysis (Figure 5 and the discussion) does not appear to be realistic. The authors need to calculate the percentage of subjects falling into the test positive and negative groups based on their analysis. Then, for each group, what percentage of subjects will eventually develop LC.

7. Remove duplicated samples from the LC group.

8. Calculate the relative risk for future cancer development for individuals with positive and negative test results.

9. Instead of using sliding windows of fixed duration, use windows of increasing duration.

10. Address the issue of over-fitting for the analysis from line 338 onwards.

11. To address the issue of clinical utility raised in point 6 in the previous section.

https://doi.org/10.7554/eLife.71035.sa1

Author response

Essential revisions:

1. The analysis and result of serum RNA sequencing were not clearly presented. RNA sequencing with an average of 18 million reads per sample was performed. How about the mapped rate? Since the samples are long-term stored and may be at different times of collection, it is important that the sequencing results should be consisent. How about the sample correlations? In addition, why many of the RNA candidates (1137 of 3306) are isomiRs? They are derived from miRs, or degradation?

We have now improved the presentation of the analysis and results throughout the manuscript. The manuscript builds on previous publications showing feasibility and profiles of serum small RNA sequencing, confounding, long-term stability and differential expression in cancer in this biobank (Rounge et al., CEBP 2015, Umu et al., RNA Biology, 2018, Rounge et al., Sci Rep, 2018, Umu et al., Mol Oncol, 2020 and Burton et al., Front in Oncol, 2020).

However, we agree with the reviewer that more information about the characteristics of the RNA profiles would be helpful to the reader. To answer the specific questions: First, the average mapping rate was around 70%. We revised the materials methods section “laboratory processing and bioinformatic analyses” and included the information on mapping tools and rates. We created a separate section called “Bioinformatic analyses” (see page 4 and 5).

Regarding long-term storage, we already showed that long-term stored samples are a viable source for small RNA analysis (Rounge et al., 2015). Small RNAs (e.g. miRNA) are stable during long-term archiving (Umu et al., 2018). We showed that some traits such as sex, age of patients and blood donor group (BDg) affect RNA profiles. Therefore, these traits must be matched between cases and controls. BDg refers to a technical cofounder combining the effect of sample treatment at donation and storage time (Rounge et al., 2018). In this revision, we further explained this and added these sentences into Materials and methods:

“Therefore, we frequency matched LC samples and controls on sex, age at donation, and blood donor group (BDg). BDg is a technical cofounder combining the effect of sample treatment at donation and storage time (Rounge et al., 2018).” (see page 4)

In Umu et al., 2018, we also investigated whether RNAs were degradation products or not. We found that RNA fragments, such as tRNA derived fragments, had specific expression patterns, which can be explained by RNA biogenesis. We concluded that many RNA fragments in serum must be biological products, including tRFs and isomiRs. Therefore, in our submitted manuscript, we used the SeqCluster (miraligner) tool and we regarded all isomiRs which passed the expression threshold as bona fide isomiRs. In this version, we also included mirTOP identifiers of candidate isomiRs.

Our definition of expression threshold is:

“any RNAs with more than 5 reads mapped in 80% of the samples”.

This is a high threshold to detect robust signals of expression. We now cited SeqCluster and mirTOP and included our expression detection threshold in Materials and methods, which was missing in our initial submission. (see page 4)

2. The authors showed that the candidate biomarkers were involved in the cancer-related pathways. This result is somewhat overestimated because these RNAs are small RNAs and the KEGG pathway enrichment analysis is based on the targets.

Thank you for pointing this out. miRNAs can target multiple genes. Unfortunately there is no easy way to do pathway enrichment analysis for other important small RNAs (e.g. miscRNAs), which are underrepresented. We believe it is a plausible assumption to expect miRNAs will disrupt a large set of biological pathways including cancer related ones since they often regulate multiple mRNAs.

While predicting miRNA targets, we used a high threshold to reduce the total number of false positives. We also performed a multiple testing correction to filter out less significant enrichments. Our literature search (as discussed in the results and discussion) also showed that many biomarker candidates, such as miRNAs and some other sncRNAs, were associated with cancer-related pathways.

3. A strength of this study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. The authors obtained RNA-seq data from 1061 serum samples. However, only samples from a total of 645 individuals with smoking history were analyzed. It would be helpful to include the samples without smoking to investigate the specificity of the potential biomarkers.

As reviewers suggested, we included non-smokers for AUC estimation and relative risk calculation. This also assessed the robustness and overfitting issue that was brought by the reviewers (please refer to question 7). Non-smokers were not used for training and only used to evaluate frequent biomarkers. We put non-smokers characteristics table into the supplementary document (Table S1).

4. There are concerns regarding multiple samples from the same cancer subject. If samples from one subject go to both the training and testing set, there would be issues of overfitting. In the revision, the authors need to remove the duplicate samples from the analysis.

To answer this question, we repeated the XGBoost modelling described in the Results section “ML algorithms can differentiate between prediagnostic cases and controls regardless of prediagnostic time” without multiple samples. We selected only one sample per patient. We used the identical modelling workflow. Our results showed similar performance in terms of AUCs. We therefore concluded that multiple samples do not cause overfitting or any bias (see supplementary document, Figure S2).

We discussed this result briefly in the main manuscript by writing:

“Second, using more than one sample from the same individual can potentially cause overfitting. However, we did not detect any effect related to this issue (Figure S2).” (see page 18)

5. Calculate the relative risk for future cancer development for individuals with positive and negative test results.

We addressed this issue and calculated the relative risk (Table 3). We explained how we calculate the relative risk in Materials and methods. (Please also see revision answer 7)

6. Instead of using sliding windows of fixed duration, use windows of increasing duration.

To address this question, we used a fixed duration approach as suggested by the reviewers. We used three different time windows 0 to 2 years, 0 to 5 years and 0 to 8 years before diagnosis. These points were selected because the results showed slight improvement in the model performance in some time windows. The performance is not as good as sliding window models.

However, we included the results from fixed-duration models to the supplementary document as a separate section (Table S3) and briefly discussed the results in main manuscript by adding this paragraph:

“As an alternative to sliding windows, we also performed a fixed window approach and trained models using samples from up to 2 years (0-2), up to 5 years (0-5) and up to 8 years (0-8) before diagnosis. The results showed slight improvement in model performance compared to full time models (see supplementary document and Table S1). However, sliding windows models performed better on specific time intervals.”.

7. Address the issue of over-fitting for the analysis from line 338 onwards.

We agree with the reviewer that overfitting might be an issue. We tried to overcome this by including an independent dataset that was never used for testing or training.

Our machine learning workflow balances for case/control numbers and also matches samples in training and test based on “age”, “sex” and “BDg”. (Please see revision answer 1 and page 4 of the manuscript). Therefore, we previously discarded unmatchable samples for each strata and these samples were never not used. We have now utilized them to assess model performance and to assess overfitting by adding them into the testing group for section “Frequent features can create simple and accurate models”.

Furthermore, we added non-smokers into the testing dataset. We measured evaluation metrics on test datasets which returned a better estimate on the model performance.

In short, the training group contains 80% and the test group contains 20% plus smokers that have never been used for training (i.e. unmatchable samples) and non-smokers that we did not use in our original submission. We refer to these samples as independent leave-out dataset. We showed in Table 3 how many of the test set samples were never used. However, overfitting might still be an issue but we wanted to find and report a shortlist of the best biomarkers for future research.

8. The proposed use of the analysis (Figure 5 and the discussion) does not appear to be realistic. The authors need to calculate the percentage of subjects falling into the test positive and negative groups based on their analysis. Then, for each group, what percentage of subjects will eventually develop LC.

This is a good suggestion and to accommodate the reviewers comments, we have updated Figure 4 and the discussion. Please see the highlighted text (page 19).

Reviewer #2:

This study by Umu et al., investigated the biomarker potential of serum RNAs for the early detection of lung cancer (LC) in smokers at different prediagnostic time intervals and histological subtypes. They observed that smokers later diagnosed with LC can be robustly separated from healthy controls by using machine learning models with serum RNAs profiles. This work demonstrated that serum RNAs could be promising prediagnostic biomarkers in a LC screening setting. This finding is interesting but not convincing because only one sample cohort was evaluated in current study and the datasets generated for this manuscript are not available. Also there are some concerns as follows.

Unfortunately the datasets generated for this article are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). However, in a collaborative setting, data is available and can be shared in a secure environment.

1.The analysis and result of serum RNA sequencing were not clearly presented. RNA sequencing with an average of 18 million reads per sample was performed. How about the mapped rate? Since the samples are long-term stored and may be at different times of collection, it is important that the sequencing results should be consisent. How about the sample correlations? In addition, why many of the RNA candidates (1137 of 3306) are isomiRs? They are derived from miRs, or degradation?

Please see the essential revisions.

2.The authors showed that the candidate biomarkers were invovled in the cancer-related pathways. This result is somewhat overestimated because these RNAs are small RNAs and the KEGG pathway enrichment analysis is based on the targets.

Please see the essential revisions.

3.A strength of this study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. The authors obtained RNA-seq data from 1061 serum samples. However, only samples from a total of 645 individuals with smoking history were analyzed. It would be helpful to include the samples without smoking to investigate the specificity of the potential biomarkers.

Please see the essential revisions.

Reviewer #3:

The authors analyzed archived serum samples which were collected before the development of lung cancer (LC) for different species of RNA.

They used machine learning to investigate if the combination of the RNA markers would be useful for the prediction of the future development of LC. In the machine learning analysis, they used 70% samples for training and 30% samples for testing. The process was repeated 5 times so that the samples can be assigned to both training and testing groups.

The strength of the study is that they include a large cohort of samples collected years before the development of LC.

The weaknesses of the study include

1. the questionable biological basis of this approach. As the samples were collected years before cancer diagnosis, the amount of tumor-derived RNA is expected to be very low. In this regard, the aberrations in serum RNA, if any, are likely to reflect the background risk of the body as a whole.

We agree with the reviewer, the origin of serum RNAs can be different tissues including immune cells or tumor cells. After diagnosis, the size of the tumor can still be too small to detect any originating RNAs. It is plausible that we detect RNAs originating from any tissues which signal cancer progression. We included a few sentences and citations that discuss the origin of circulating RNAs as well (see introduction, second paragraph).

As the reviewer suggested, we calculated the RR of a positive prediction.

2. multiple samples from the same cancer subject. If samples from one subject go to both the training and testing set, there would be issues of overfitting. The authors would need to remove the duplicate samples from the analysis.

As the reviewer suggested, we tested this effect and added the results into the Supplementary documents. Please also see the essential revisions answer 4.

3. the low diagnostic performance. The sensitivity and specificity of the models were both around 70%. This would not be useful clinically. In fact, the use of relative risk for individuals with positive and negative test results would be more useful for predictive models.

Please see the essential revisions.

4. The authors used a sliding window of fixed duration for the analysis. This analysis is counter-intuitive. If a marker is useful for predicting cancer at 3-5 years before cancer, what is the reason why it cannot predict cancers that occur within 3 years. It would be better to use windows with flexible size e.g. prediction of cancer in the next 2, then next 3, then next 5 years, etc.

Please see the essential revisions.

5. For the analysis on lines 338 onwards, the authors further select features identified in the previous machine learning models for evaluation. What is the testing group in this analysis? Did they use all the samples without separating training and testing groups? If so, the over-fitting problem would be significant.

Please see the essential revisions.

6. The. proposed use of the analysis (Figure 5 and the discussion) does not appear to be realistic. The authors need to calculate the percentage of subjects falling into the test positive and negative groups based on their analysis. Then, for each group, what percentage of subjects will eventually develop LC.

Please see the essential revisions.

https://doi.org/10.7554/eLife.71035.sa2

Article and author information

Author details

  1. Sinan U Umu

    Department of Research, Cancer Registry of Norway, Oslo, Norway
    Contribution
    Formal analysis, Investigation, Methodology, Software, Writing – original draft
    For correspondence
    sinan.ugur.umu@kreftregisteret.no
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8081-7819
  2. Hilde Langseth

    1. Department of Research, Cancer Registry of Norway, Oslo, Norway
    2. Department of Epidemiology and Biostatistics, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Funding acquisition, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Verena Zuber

    Department of Epidemiology and Biostatistics, Imperial College London, London, United Kingdom
    Contribution
    Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Åslaug Helland

    1. Department of Oncology, Oslo University Hospital, Oslo, Norway
    2. Institute for Cancer Research, Oslo University Hospital, Oslo, Norway
    3. Institute of Clinical Medicine, University of Oslo, Oslo, Norway
    Contribution
    Writing - review and editing
    Competing interests
    No competing interests declared
  5. Robert Lyle

    1. Department of Medical Genetics, Oslo University Hospital and University of Oslo, Oslo, Norway
    2. Centre for Fertility and Health, Norwegian Institute of Public Health, Oslo, Norway
    Contribution
    Resources, Writing - review and editing
    Competing interests
    No competing interests declared
  6. Trine B Rounge

    1. Department of Research, Cancer Registry of Norway, Oslo, Norway
    2. Department of Informatics, University of Oslo, Oslo, Norway
    Contribution
    Conceptualization, Funding acquisition, Project administration, Writing – original draft, Writing - review and editing
    For correspondence
    trine.rounge@kreftregisteret.no
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2677-2722

Funding

The Research Council of Norway (Human Biobanks and Health Data [229621/H10])

  • Hilde Langseth
  • Trine B Rounge

The Research Council of Norway (Human Biobanks and Health Data [248791/H10])

  • Hilde Langseth
  • Trine B Rounge

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by The Norwegian Research Council’s Programme ‘Human Biobanks and Health Data [229621 /H10, 248791 /H10]. Disclosure of invention was accepted by the technology transfer office, Invent2 (DOFI: 19010). We would like to acknowledge Cecilie Bucher-Johannessen, Marianne Lauritzen, Magnus Leithaug for performing lab and coordination tasks. We also acknowledge Matthew D Whitaker and Marc Chadeau-Hyam from Imperial College London for discussions on ML model training and testing. We acknowledge the Norwegian Institute of Public Health for access to survey data in this study. The sequencing service was provided by the Norwegian Sequencing Centre (https://www.sequencing.uio.no), a national technology platform hosted by Oslo University Hospital and the University of Oslo supported by the Research Council of Norway and the Southeastern Regional Health Authority.

Ethics

Human subjects: This study was approved by the Norwegian Regional Committee for medical and health research ethics (REC no: 19892 previous 2016/1290) and was based on broad consent from participants in the Janus cohort. The work has been carried out in compliance with the standards set by the Declaration of Helsinki.

Senior and Reviewing Editor

  1. YM Dennis Lo, The Chinese University of Hong Kong, Hong Kong

Reviewer

  1. Shenglin Huang, Fudan University, China

Publication history

  1. Received: June 6, 2021
  2. Accepted: February 9, 2022
  3. Accepted Manuscript published: February 11, 2022 (version 1)
  4. Version of Record published: February 28, 2022 (version 2)

Copyright

© 2022, Umu et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 960
    Page views
  • 187
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sinan U Umu
  2. Hilde Langseth
  3. Verena Zuber
  4. Åslaug Helland
  5. Robert Lyle
  6. Trine B Rounge
(2022)
Serum RNAs can predict lung cancer up to 10 years prior to diagnosis
eLife 11:e71035.
https://doi.org/10.7554/eLife.71035

Further reading

    1. Biochemistry and Chemical Biology
    2. Cancer Biology
    David J Hosfield et al.
    Research Article

    Chemical manipulation of estrogen receptor alpha ligand binding domain structural mobility tunes receptor lifetime and influences breast cancer therapeutic activities. Selective estrogen receptor modulators (SERMs) extend ERα cellular lifetime/accumulation. They are antagonists in the breast but agonists in the uterine epithelium and/or in bone. Selective estrogen receptor degraders/downregulators (SERDs) reduce ERα cellular lifetime/accumulation and are pure antagonists. Activating somatic ESR1 mutations Y537S and D538G enable resistance to first-line endocrine therapies. SERDs have shown significant activities in ESR1 mutant setting while few SERMs have been studied. To understand whether chemical manipulation of ERα cellular lifetime and accumulation influences antagonistic activity, we studied a series of methylpyrollidine lasofoxifene derivatives that maintained the drug's antagonistic activities while uniquely tuning ERα cellular accumulation. These molecules were examined alongside a panel of antiestrogens in live cell assays of ERα cellular accumulation, lifetime, SUMOylation, and transcriptional antagonism. High-resolution x-ray crystal structures of WT and Y537S ERα ligand binding domain in complex with the methylated lasofoxifene derivatives or representative SERMs and SERDs show that molecules that favor a highly buried helix 12 antagonist conformation achieve the greatest transcriptional suppression activities in breast cancer cells harboring WT/Y537S ESR1. Together these results show that chemical reduction of ERα cellular lifetime is not necessarily the most crucial parameter for transcriptional antagonism in ESR1 mutated breast cancer cells. Importantly, our studies show how small chemical differences within a scaffold series can provide compounds with similar antagonistic activities, but with greatly different effects of the cellular lifetime of the ERα, which is crucial for achieving desired SERM or SERD profiles.

    1. Biochemistry and Chemical Biology
    2. Cancer Biology
    Yanchun Zhang et al.
    Research Article

    Tyrosine phosphorylation, orchestrated by tyrosine kinases and phosphatases, modulates a multi-layered signaling network in a time- and space-dependent manner. Dysregulation of this post-translational modification is inevitably associated with pathological diseases. Our previous work has demonstrated that non-receptor tyrosine kinase FER is upregulated in ovarian cancer, knocking down which attenuates metastatic phenotypes. However, due to the limited number of known substrates in the ovarian cancer context, the molecular basis for its pro-proliferation activity remains enigmatic. Here, we employed mass spectrometry and biochemical approaches to identify insulin receptor substrate 4 (IRS4) as a novel substrate of FER. FER engaged its kinase domain to associate with the PH and PTB domains of IRS4. Using a proximity-based tagging system in ovarian carcinoma-derived OVCAR-5 cells, we determined that FER-mediated phosphorylation of Tyr779 enables IRS4 to recruit PIK3R2/p85β, the regulatory subunit of PI3K, and activate the PI3K-AKT pathway. Rescuing IRS4-null ovarian tumor cells with phosphorylation-defective mutant, but not WT IRS4 delayed ovarian tumor cell proliferation both in vitro and in vivo. Overall, we revealed a kinase-substrate mode between FER and IRS4, and the pharmacological inhibition of FER kinase may be beneficial for ovarian cancer patients with PI3K-AKT hyperactivation.