Serum RNAs can predict lung cancer up to 10 years prior to diagnosis

  1. Sinan U Umu  Is a corresponding author
  2. Hilde Langseth
  3. Verena Zuber
  4. Åslaug Helland
  5. Robert Lyle
  6. Trine B Rounge  Is a corresponding author
  1. Department of Research, Cancer Registry of Norway, Norway
  2. Department of Epidemiology and Biostatistics, Imperial College London, United Kingdom
  3. Department of Oncology, Oslo University Hospital, Norway
  4. Institute for Cancer Research, Oslo University Hospital, Norway
  5. Institute of Clinical Medicine, University of Oslo, Norway
  6. Department of Medical Genetics, Oslo University Hospital and University of Oslo, Norway
  7. Centre for Fertility and Health, Norwegian Institute of Public Health, Norway
  8. Department of Informatics, University of Oslo, Norway
4 figures, 3 tables and 5 additional files


Consort diagram of the study and our model training and testing workflow.

(A) The sample selection is summarized by the flow chart. Non-smokers were excluded from model building. (B) We randomly created five different training and testing datasets for each group (e.g. standard, histology-specific, or prediagnostic models).

Figure 2 with 2 supplements
Each ROC curve is based on the prediction results of a randomly created testing dataset (in total five).

Area under the ROC curve (AUC) values show the average of these predictions. The most important features of the classifiers were sorted on their average feature importance and are shown in the lower panels. A detailed list of biomarkers with their feature importance is available in supplementary (Supplementary file 2). We did not perform any feature selection while training these models (see also Figure 2—source data 1).

Figure 2—figure supplement 1
Each boxplot shows performances of an algorithm measured by area under the ROC curves (AUCs).

The analyses were done for all histologies and histology-specific (i.e. separately for non-small cell lung cancer [NSCLC] and small-cell LC [SCLC]), regardless of prediagnostic time. The dashed lines represent the combined average performances of all tested algorithms. XGBoost produced an above average prediction performance (Figure 2—figure supplement 1—source data 1).

Figure 2—figure supplement 2
ROC curves of various types of models with/without serial samples.

(A) The performance of the XGBoost algorithm with all samples. (B) When one sample per individual was selected, the classification performance was comparable for all models (Figure 2—figure supplement 2—source data 1).

Figure 2—figure supplement 2—source data 1

Source data of ROC plots without multiple samples from same individuals (Figure 2—figure supplement 2).
Sliding windows analysis showed better models which utilizes prediagnostic samples in specific time intervals such as small-cell lung cancer (SCLC) models, which were restricted to samples from 2 to 5 years prior to diagnosis (see the first and the second panel, red dots).

Each color represents different histologies: black and red only have non-small cell lung cancer (NSCLC) and SCLC samples respectively while blue has all histologies including others (Figure 3—source data 1).

Suggested clinical uses of RNA biomarkers in lung cancer (LC) screening.

A positive test from full-time models shows elevated risk (at least two times). They can detect cancer-related RNA signals up to 10 years before diagnosis. Prediagnostic models have higher accuracy, sensitivity, and specificity which can potentially assist full-time models and improve specificity (Supplementary file 4).

Figure 4—source data 1

Suggested clinical uses of RNA biomarkers in lung cancer (LC) screening.


Table 1
Clinical and histological characteristics of samples used in modeling.
Early (localized)Locally Advanced (regional)Advanced (distant)UnknownControls
Age at donation, years
Mean (SD)54.3 (7.33)54.9 (9.08)53.5 (8.25)51.8 (6.53)49.9 (10.9)
Age at diagnosis, years
Mean (SD)59.8 (7.67)60.6 (8.89)59.4 (8.31)58.6 (6.05)-
Prediagnostic sampling time, years
Mean (SD)5.52 (2.81)5.63 (2.79)5.91 (2.66)6.75 (2.18)-
Total samples10313927419263
Total individuals645 (smokers*)
  1. *

    See supplementary document for non-smokers (Supplementary file 1).

Table 2
Averages of area under the ROC curves (AUCs), accuracies (acc), sensitivities (sn), and specificities (sp) of the XGBoost algorithm models on test datasets when prediagnostic time was not included.
Histologies of model
All (including others)NSCLCSCLC
Features included:AUCAv. # of features*Av. % of acc/sn/spAUCAv. # of featuresAv. % acc/sn/spAUCAv. # of featuresAv. % acc/sn/sp
All RNAs0.71 (95% CI, 0.68–0.73)30169/73/620.70 (95% CI, 0.65–0.75)37367/70/640.71 (95% CI, 0.68–0.74)21370/69/71
Lasso-selected features0.78 (95% CI, 0.74–0.82)14973/75/710.78 (95% CI, 0.75–0.82)5673/73/720.74 (95% CI, 0.69–0.80)5872/61/83
Univariate significant features0.70 (95% CI, 0.66–0.73)7667/75/580.69 (95% CI, 0.64–0.73)5167/71/640.70 (95% CI, 0.65–0.76)1168/69/68
miRNA only0.72 (95% CI, 0.68–0.76)16869/76/610.73 (95% CI, 0.70–0.75)19969/74/640.65 (95% CI, 0.62–0.69)2067/74/60
isomiR only0.70 (95% CI, 0.65–0.74)20467/68/670.73 (95% CI, 0.69–0.77)21571/75/660.65 (95% CI, 0.60–0.70)10866/65/67
tRF only0.69 (95% CI, 0.65–0.73)31465/77/530.67 (95% CI, 0.65–0.69)31466/64/670.68 (95% CI, 0.65–0.71)2366/69/63
MiscRNA only0.72 (95% CI, 0.69–0.74)8369/73/650.68 (95% CI, 0.63–0.74)8766/73/590.69 (95% CI, 0.64–0.75)7670/78/61
  1. *

    Average number of non-zero features selected by the models. Note: Detailed information on all selected features are in Supplementary file 2.

Table 3
All selected features, performance, and relative risk (RR) of XGBoost models.
Featuresiso-20-5KP25HFFGBP3 hsa-miR-30a-5pINTS10LINC01362 piR-hsa-28723RNU1-8P iso-23-BQ8DQWM4ZCTD-3252C9.4DSTHBA2HIST2H2AC hsa-miR-99b-3pLATS1 piR-hsa-28391 piR-hsa-28394RN7SL181PRN7SL8PRNU2-27P iso-23-8YUYFYKSYTLN1 tRF-V47P59D9 tRF-86V8WPMN1EJ3 tRF-6SXMSL73VL4Y tRF-QKF1R3WE8RO8ISLINC01362Y-RNA iso-23-B0NKZ01J0D iso-22-MKJIJLJ2Q iso-21-N2NBQRZ00GBP3 iso-20-RNUW92OIGNAS hsa-miR-30a-3pNHSL2 piR-hsa-28488RC3H2RN7SL181PRNU2-19PRNY4P27 iso-23–909 U247N04tRF-I89NJ4S2 tRF-9MV47P596VE tRF-86J8WPMN1EJ3 tRF-86V8WPMN1EJ3 tRF-Q1Q89P9L8422EAC113404.1C6orf223HIST1H4E hsa-miR-30a-5p hsa-miR-574–5pODC1PTCH2PTMARN7SL181P tRF-22-947673FE5AKAP9MIGA1RAP1BRN7SL724PRUFY2 iso-23-X3749W540L tRF-BS68BFD2 tRF-R29P4P9L5HJVE tRF-ZRS3S3R × 8HYVD
Total features252119
Total test samples (total leave-out size) (non-smokers)640 (535) (263)465 (360) (262)444 (395) (256)
AUC on test (95% CI)(only smokers**)0.76 (0.68–0.83)0.78 (0.70–0.85)0.88 (0.83–0.94)
AUC on test (95% CI)(both smokers and non-smokers**)0.68 (0.63–0.72)0.68 (0.63–0.73)0.84 (0.79–0.9)
RR on test (95% CI)(only smokers**)2.37 (1.54–3.7) p = 1.15 × 10–72.36 (1.52–3.66) p = 2.83 × 10–62.48 (2.06–3) p = 3.32 × 10–9
RR on test (95% CI)(both smokers and non-smokers**)1.84 (1.7–2.01) p = 1.25 × 10–61.52 (1.27–1.83) p = 2.67 × 10–52.04 (1.85–2.25) p = 8.8 × 10–8
  1. *

    Including other histologies. ** includes samples previously not used (leave-out samples).

Additional files

Supplementary file 1

Clinical and histological characteristics of non-smoker samples of leave-out dataset.
Supplementary file 2

Detailed feature importance tables for all trained models.
Supplementary file 3

Fixed-time model performance on different histologies.
Supplementary file 4

Selected prediagnostic models, metrics, and their feature importance tables.
Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sinan U Umu
  2. Hilde Langseth
  3. Verena Zuber
  4. Åslaug Helland
  5. Robert Lyle
  6. Trine B Rounge
Serum RNAs can predict lung cancer up to 10 years prior to diagnosis
eLife 11:e71035.