Serum RNAs can predict lung cancer up to 10 years prior to diagnosis

Lung cancer (LC) prognosis is closely linked to the stage of disease when diagnosed. We investigated the biomarker potential of serum RNAs for the early detection of LC in smokers at different prediagnostic time intervals and histological subtypes. In total, 1061 samples from 925 individuals were analyzed. RNA sequencing with an average of 18 million reads per sample was performed. We generated machine learning models using normalized serum RNA levels and found that smokers later diagnosed with LC in 10 years can be robustly separated from healthy controls regardless of histology with an average area under the ROC curve (AUC) of 0.76 (95% CI, 0.68–0.83). Furthermore, the strongest models that took both time to diagnosis and histology into account successfully predicted non-small cell LC (NSCLC) between 6 and 8 years, with an AUC of 0.82 (95% CI, 0.76–0.88), and SCLC between 2 and 5 years, with an AUC of 0.89 (95% CI, 0.77–1.0), before diagnosis. The most important separators were microRNAs, miscellaneous RNAs, isomiRs, and tRNA-derived fragments. We have shown that LC can be detected years before diagnosis and manifestation of disease symptoms independently of histological subtype. However, the highest AUCs were achieved for specific subtypes and time intervals before diagnosis. The collection of models may therefore also predict the severity of cancer development and its histology. Our study demonstrates that serum RNAs can be promising prediagnostic biomarkers in an LC screening setting, from early detection to risk assessment.


Introduction
Lung cancer (LC) continues to be the leading cause of cancer-related deaths despite declining smoking prevalence (Bray et al., 2018;Wild et al., 2020). Non-small-cell (NSCLC) and small-cell (SCLC) are the two major subtypes of LC. The symptoms generally occur at a late stage and the prognosis is poor. Stage at diagnosis typically determines patient survival (Aberle et al., 2011;Bach et al., 2012;Brustugun et al., 2018). Screening with low-dose computed tomography (LDCT) can be effective for early detection (Bach et al., 2012;Peled and Ilouze, 2015) and reduce LC mortality up to 20% in high-risk groups (de Koning et al., 2020;Hanash et al., 2018;Seijo et al., 2019). However, LDCT has limitations such as high false-positive rates, risk of overdiagnosis, and high costs (Gopal et al., 2010;Peled and Ilouze, 2015). Annual CT scans also cause harmful radiation exposure (Bach et al., 2012;Hanash et al., 2018). Robust biomarkers can help stratify high-risk groups and increase accuracy in patient inclusion criteria for LDCT-based screening programs (Hanash et al., 2018).
Liquid biopsies quantifying molecular biomarkers in circulation, such as tumor-derived DNAs, proteins, and RNAs, can be used to detect cancer (Hanash et al., 2018;Ko et al., 2018;Sandfeld-Paulsen et al., 2016). MicroRNAs (miRNA), a class of ~21 nucleotide long short RNAs, have been widely investigated for their biomarker potential (Fehlmann et al., 2020;Keller and Meese, 2016;Pichler and Calin, 2015;Tian et al., 2019). They can be found both in serum (Keller and Meese, 2016;Murillo et al., 2019;Umu et al., 2018) and in plasma (Freedman et al., 2016;Keller and Meese, 2016;Murillo et al., 2019) as cell-free circulating RNAs, which may originate from dying cells or be actively secreted (Zaporozhchenko et al., 2018). Some of them are bounded by proteins or confined in layered exosomes which can protect them from degradation (Fritz et al., 2016). MiRNAs can function as tumor suppressors or oncomiRs and regulate tumor traits such as cell growth, angiogenesis, immune evasion, and metastasis (Pichler and Calin, 2015;Svoronos et al., 2016). The search for RNA biomarkers is not limited to miRNAs. Aberrant expression of other RNA classes, such as protein coding mRNAs, tRNAs, piwi-interacting RNAs (piRNAs), and long-noncoding RNAs (lncRNAs), has been associated with cancer (Kim et al., 2017;Slack and Chinnaiyan, 2019). Despite the immense potential of cell-free RNAs, the promise of non-invasive RNA biomarkers of cancer has not yet been fulfilled.
One explanation of the lack of circulating RNAs used in clinical settings is our limited understanding of the prediagnostic dynamics of cell-free RNAs, since studies are usually based on samples at or after diagnosis. Carcinogenesis is a multistep process that turns cell functions from normal to malignant (Hanahan and Weinberg, 2000). It can cause temporal changes in RNA levels linked to cellular processes driven by the hallmarks of cancer (Gutschner and Diederichs, 2012;Hanahan and Weinberg, 2000). We have shown that prediagnostic RNA levels in serum are highly dynamic in LC patients, which may signal early carcinogenesis . A similar result was observed in breast cancer (Lund et al., 2016) and testicular cancer patients (Burton et al., 2020). A lack of reproducibility among studies is also a problem, caused by technical and biological factors such as storage time, sampling procedure, age, sex, smoking history, etc. . It is therefore important to control for these factors.
In the present study, our objective was to identify serum RNA-based biomarkers for early diagnosis of LC using prediagnostic samples. We identified the optimal machine learning (ML) algorithm for RNA biomarker modeling. Optimization of prediction models was done with an ML workflow, including cross-validation and testing, which was repeated five times to increase the generalizability of our results. We also investigated the biological relevance of the best RNA separators in the context of cancer biomarkers.

Patient characteristics and RNA-seq profiles
In this study, we selected 400 patients with prediagnostic serum samples including multiple samples from the same patients. We also included 525 individuals as controls. After excluding failed or low input samples, we obtained RNA-seq data from 1061 serum samples. However, samples from individuals without any smoking history (i.e. never smokers) or missing information were excluded from initial analyses. This resulted in 535 cases and 263 control samples from 645 current or former smokers for modeling and testing (Table 1 and Figure 1A). Non-smokers consist of 7 cases and 256 control samples from 260 individuals (Supplementary file 1). We used non-smokers in a leave-out set only to test our final models and to calculate relative risk (RR).

ML algorithms can differentiate between prediagnostic cases and controls regardless of prediagnostic time
We first evaluated the classification performance of the ML algorithms in terms of average AUCs on test datasets, created by five random repeats as explained in Materials and methods.
Although the models of all algorithms had comparable performances in terms of average AUCs, they differ in terms of total number of non-zero features (i.e. different model complexity). For example, random forest (RF) selected more than 3000 non-zero features while the lasso model selected fewer than 25 features. However, the profiles of the top features, ranked in terms of feature importance, usually consisted of similar RNAs (e.g. miRNAs or tRFs).
Since XGBoost produced the most predictive full-time models, we used it for the remaining analyses. We also investigated the best predictors of the XGBoost models and ranked them based on their importance (Supplementary file 2). The top three best features were an isomiR of hsa-miR-486-5p (iso-23-8YUYFYKSY), piR-hsa-28723, and INTS10 for all histologies; Y-RNA, piR-hsa-28723, and GPB3 for NSCLC; and tRF-BS68BFD2, RN7SL724P, and tRF-947673FE5 for SCLC. An in-depth investigation of selected features by other algorithms also showed common RNAs. For example, Y-RNA and iso-23-8YUYFYKSY isomiR were among the top predictors of the RF, elastic-net, the SGL, and the lasso models for NSCLC; tRF-BS68BFD2 for SCLC. We also performed KEGG pathway enrichment analysis based on the common miRNA, mRNA, and isomiR features. The results showed that many cancerrelated pathways were significantly (p < 0.01) enriched such as MAPK signaling, mTOR signaling, and AMPK signaling.
We evaluated the classification performance of the XGBoost algorithm by selecting one sample per patient rather than using all samples from the same individuals. Our results showed comparable performance in terms of AUCs for all models (Figure 2-figure supplement 2). The SCLC models performed slightly worse than the others. This discrepancy can be explained by the relatively small sample size of this group. Therefore, we decided to use all samples from the same individuals.

Feature selection improves model performance and reduces model complexity
Single RNA class models also implied that feature selection can further improve model performances. Thus, we tested two feature selection methods. The results showed that lasso feature selection improved AUC values and reduced complexity ( Table 2). The most important features of lasso-selected models included hsa-miR-423-5p isomiR (iso-20-5KP25HFF), GBP3, and piR-hsa-28723 for all histologies; Y-RNA, hsa-miR-423-5p isomiR (iso-20-5KP25HFF), and LINC01362 for NSCLC; HIST1H4E, PTCH2, and tRF-R29P4P9L5HJVE for SCLC (Supplementary file 2). Moreover, univariate significant feature selection greatly reduced model complexity with an acceptable performance ( Table 2). For example, SCLC models only included 11 RNAs. The most important features were GBP3, LINC01362, and hsa-miR-30a-5p for all histologies; LINC01362, GBP3, and tRF-9MV47P596V for NSCLC; piR-hsa-7001 and tRF-7343R × 6NMH3 for SCLC (Supplementary file 2). . Sliding windows analysis showed better models which utilizes prediagnostic samples in specific time intervals such as small-cell lung cancer (SCLC) models, which were restricted to samples from 2 to 5 years prior to diagnosis (see the first and the second panel, red dots). Each color represents different histologies: black and red only have non-small cell lung cancer (NSCLC) and SCLC samples respectively while blue has all histologies including others ( Figure 3-source data 1).
The online version of this article includes the following source data for figure 3: Source data 1. Source data of all the panels for Figure 3.

Histology-specific prediagnostic models can improve prediction performance
We previously demonstrated that RNA levels are dynamic and histology-specific in prediagnostic samples . We therefore trained and tested models stratified by prediagnostic time which were selected by a sliding window approach as explained in Materials and methods. The results showed that inclusion of prediagnostic time and histological subtype together creates better models for specific time intervals (Figure 3). For example, SCLC models restricted to samples from 2 to 5 years prior to diagnosis had an average AUC of 0.84 (95% CI, 0.77-0.9) (Figure 3). Another model of SCLC samples that only utilized miRNAs restricted to 3-5 years prior to diagnosis had an average AUC of 0.85 (95% CI, 0.76-0.93) on the test datasets. Both SCLC models selected the same miRNAs as their most important features such as hsa-miR-30a-5p, hsa-miR-339-3p, hsa-miR-215-5p. Besides miRNAs, an isomiR of hsa-miR-451a and RN7SL181P were the most important features of prediagnostic SCLC models. Enrichment analysis of the most important features identified signaling pathways, such as MAPK, PI3K-Akt, RAS, and other pathways like choline metabolism, cellular senescence, and PD-L1 expression and PD-1 checkpoint. Similarly, NSCLC models restricted to 6-8 years prior to diagnosis had an average AUC of 0.81 (95% CI, 0.75-0.86). The most important RNAs of this period were tRF-YP9L0N4V3, an isomiR of hsa-miR-484 (iso-23-8K4P8R8SDE) and tRF-9MV47P596V. More than 70 pathways were enriched such as endocytosis, MAPK, RAS, choline metabolism, and neurotrophin signaling pathway. Table 3. All selected features, performance, and relative risk (RR) of XGBoost models.

All* NSCLC SCLC
Features As an alternative to sliding windows, we also performed a fixed window approach and trained models using samples from up to 2 years (0-2), up to 5 years (0-5), and up to 8 years (0-8) before diagnosis. The results showed slight improvement in model performance compared to full-time models (see Supplementary file 3). However, sliding windows models performed better on specific time intervals.

Frequent features can create simple and accurate models
We created models by compiling the best features from the full-time models. Our results showed improved prediction performance for these models despite inclusion of leave-out datasets into the test set (see Materials and methods). In the test datasets including only smokers, AUC for all histologies was 0.76 (95% CI, 0.68-0.83); NSCLC model was 0.78 (95% CI, 0.70-0.85); SCLC model was 0.88 (95% CI, 0.83-0.94) ( Table 3). However, when non-smokers were also included in the test set, the model performance dropped to 0.68 (0.63-0.72) for all histologies and 0.68 (0.63-0.73) for NSCLC. Remarkably, the SCLC model still had AUC of 0.84 (0.79-0.9) when including non-smokers.
The RRs and their associated p-values on the test dataset, with and without non-smokers, are reported in Table 3. A positive test in smokers suggests more than two times higher risk of getting LC diagnosis in future.
We also investigated the prediagnostic models, using the ML workflow, and selected two pairs of models for NSCLC and SCLC, which showed high performance before and after 5 years prior to diagnosis (see Supplementary file 4). We found that NSCLC models restricted to 0-2 and 6-8 years before diagnosis had an average AUCs of 0.89 (95% CI, 0.84-0.96) and 0.82 (95% CI, 0.76-0.88), respectively; SCLC models restricted to 2-5 and 8-10 years before diagnosis had an average AUCs of 0.89 (95% CI, 0.77-1.0) and 0.83 (95%, 0.69-0.97), respectively. We reported other model metrics and the best features in the supplementary document (Supplementary file 4).

Discussion
In this study, we showed that ML models of prediagnostic serum RNA levels can be used to predict LC years before diagnosis and manifestation of disease symptoms. Our models achieved clinically relevant performance in terms of AUC, accuracy, sensitivity, and specificity (Tables 2 and 3). The model performance was further increased for specific prediagnostic time windows and histologies making it feasible to develop them as biomarkers for LC screening (Figure 3). A collection of the best models (and predictors) ( Table 3 and Supplementary file 4) can predict risk for developing LC, which histologies to look for and indicate the level of cancer progression. The time window of the high-performance models may be a first indication of how often to screen for LC (Figure 4). Our study is unique in including serum samples collected up to 10 years prior to LC diagnosis and a large set of control samples.
We previously reported that prediagnostic circulating RNA signals are highly dynamic in LC patients and they can be histology and stage dependent . In the present study, ML models using all samples regardless of stage, histology, or prediagnostic time successfully separated LC patients from controls. All the tested algorithms consistently produced acceptable AUC values (Figure 2A). The best algorithm, XGBoost, resulted in an average accuracy of 69% without feature selection. An analysis of the features showed a large panel of selected RNAs: more than 300 out of available 3306 (with no feature selection implemented). This may be interpreted as a general shift in the levels of RNAs during cancer development, consistent with our previous study that showed hundreds of RNAs were differentially expressed up to 10 years before diagnosis .
Besides miRNAs and isomiRs, RNAs of other classes were noteworthy and linked to carcinogenesis. For example, 7SL, a member of miscRNAs, is upregulated in tumor cells. It binds to TP53 mRNA at the 3'UTR region and downregulates its expression (Abdelmohsen et al., 2014). 7SL-related transcripts (e.g. RN7SL181P) were among the most important separators in the cell histology, NSCLC-and SCLC-specific models. Another example is Y-RNA and we found that Y-RNA and related genes (e.g. RNY4P30) were among the most important features for NSCLC models. Y-RNA was also chosen as an important feature by the lasso-selected NSCLC models. Y-RNA-derived small RNAs function as tumor suppressors in NSCLC. They inhibit cell proliferation and were proposed as circulating RNA biomarkers since they were upregulated in NSCLC EVs (Li et al., 2018).
Inclusion of both prediagnostic time and histology produced better models in certain time windows (e.g. 2-5 years before diagnosis for SCLC) (Figure 3). This can be explained by the dynamic nature of prediagnostic RNA levels (Lund et al., 2016;Umu et al., 2020). Important features of these models can also be linked to early carcinogenesis and some were specific to these models. For example, hsa-miR-339-3p was among the most important features of SCLC prediagnostic models. Hsa-miR-339-3p is a tumor suppressor and was proposed as a serum biomarker of LC (Yu et al., 2019). We retrained some of these prediagnostic models using the most frequent features and achieved higher prediction performance than the full-time models in specific time intervals. We reported these models in supplementary (Supplementary file 3).
The most important features of histology-specific models also showed associations with carcinogenesis-related KEGG pathways, which were common or specific to histology. The common ones include well-known signaling (e.g. RAS, PI3K-Akt, MAPK, ErbB) and cancer-related pathways (e.g. proteoglycans in cancer and pathways in cancer). Choline metabolism in cancer pathway was one of the common ones and enriched in some histology-specific prediagnostic models. Altered choline profiles are characteristics of tumor tissues (Glunde et al., 2006). Moreover, a lipidome serum profiling study on early stage NSCLC patients proposed choline-containing phospholipids as potential LC biomarkers (Klupczynska et al., 2019). Enrichment of choline metabolism pathway years before diagnosis (i.e. NSCLC 6-8 and SCLC 2-5) supports this conclusion. We also reported enrichment of this pathway for all histologies before diagnosis in our previous study .
A strength of our study is the large sample size from prediagnostic cases and a large control group from cancer-free individuals from the same cohort. We have detailed information on histological subtype and stage at diagnosis from the Cancer Registry of Norway (CRN) and smoking history from survey data. We also accounted for other potential confounders (i.e. age, sex, and blood donor group [BDg]) . Some of our potential biomarkers (e.g. hsa-miR-30a-5p, sa-miR-339-3p, 7SL) were already associated with carcinogenesis or proposed as biomarkers, which shows consistent results with current literature. Further, we found potential biomarkers from overlooked RNA classes which add important new knowledge into the field. We shared the average feature importance of all RNAs as supplementary tables (Supplementary file 2). We investigated performance of different algorithms which showed consistent results in terms of AUCs and features. We compiled shortlists from the most important features and tested their performance in a leave-out dataset on both smokers and non-smokers. We also found that smokers with a positive test had more than two times higher risk of getting LC diagnosis in future ( Table 3).
There are some weaknesses in our study that we need to address. First, an independent cohort should replicate our results. However, only a few cohorts include prediagnostic samples that can be used for discovery and validation. We tried to overcome this issue by using training-testing repeats for assessing generalisability. We also reported our results with and without feature selection since some feature selection methods (e.g. lasso and univariate) can cause overfitting. Second, using more than one sample from the same individual can potentially cause overfitting. However, we did not detect any effect related to this issue (Figure 2-figure supplement 2). Third, our study focused only on smokers (since case samples are mostly smokers). However, our results show acceptable performance when including non-smokers as a test dataset as well. Fourth, reuse of the same data for frequent biomarker models (as reported in Table 3) can also result in overfitting. We tried to overcome this issue by including a leave-out dataset (which was never used) into the test set and reported performance. Lastly, since our samples are long-term stored, some unstable RNA molecules may have been degraded over the years, though we have already shown that this effect is negligible . Yet, we matched cases and controls for BDg which includes the effect of storage time (see Materials and methods).
In LC screening programs, RNA biomarkers can be used as a tool of initial assessment or combined with LDCT for early detection (Hanash et al., 2018). We found that smokers with a positive test had higher risk of getting LC diagnosis in future (Table 3). We also found that our biomarkers can be potentially used on non-smokers, especially SCLC biomarkers. However, we do not have enough non-smoker cases to further validate this interpretation. The dynamic nature of the prediagnostic signal for cancer may pose challenges for the performance of modeling and biomarker development. However, using a set of models specific for histology and time might provide additional information useful in evaluating LC risk (Figure 4). Our proposed use of RNA biomarkers starts with risk assessment using standard full-time models which can be used for an initial assessment in smokers when the disease is undetectable. A positive signal (i.e. high probability of being in LC group) classifies those individuals into an elevated risk group. Since prediagnostic models have a 2-year peak performance, every second-year testing with these models can provide confirmation of preneoplasia or an early stage tumor for individuals with elevated risk and selection criteria for CT monitoring. Prediagnostic models had higher overall specificity (more than 80%) which can help to determine future diagnosis histology. However, it requires further research. We selected two sets of histology-specific diagnostic models for early/late NSCLC and SCLC diagnosis and reported these in the supplementary document (Supplementary file 4). RNA biomarkers can prevent unnecessary use of LDCT while improving the chance of an early diagnosis of LC in an early stage. This hypothesis can be investigated in screening programs for validation.

Conclusion
We have shown that LC can be detected in both smokers and non-smokers years before diagnosis and the manifestation of symptoms regardless of histological subtype. We also proposed a model on how RNA biomarkers can be utilized in clinical settings. Our top performing models can produce AUCs up to 0.9 before diagnosis suggesting a great potential for LC early prediction.

Study population and data sources
We used the population-based Janus Serum Bank (JSB) cohort containing prediagnostic serum samples . The study participants were identified by linking the JSB to the CRN. We restricted our analyses to patients later diagnosed with LC up to 10 years after blood donation and control samples from individuals cancer-free (except non-melanoma skin cancer) at least 10 years after sample collection. We matched cases and controls on confounders (see Bioinformatics analyses). Smoking, collected from health survey data, was classified as current, former, or never smokers (Hjerkind et al., 2017). Since we have previously shown that smoking significantly affects RNA expression levels, we only included smokers (i.e. current and former) in the initial analyses and model building. However, non-smokers and samples not included by the frequency matching were used as an additional independent leave-out dataset to assess the level of overfitting.

Tumor staging
Detailed cancer information was selected from the CRN that has systematically collected mandatory notification on cancer occurrence for the Norwegian population since 1952 (Larsen et al., 2009). The cases were classified into histological subtypes: NSCLC, SCLC, and others, the latter referring to other less defined or multiple histologies. Stage at diagnosis was encoded with the TNM system: early (localized -stage I), locally advanced (regional -stages II and III), advanced or metastatic (distantstage IV), and unknown (Cancer Registry of Norway, 2020).

Laboratory processing
We extracted RNA from 400 µL serum using phenol-chloroform and miRNeasy Serum/Plasma kit (Qiagen, Valencia, CA). We performed size selection using a 3% Agarose Gel Cassette (Cat. No CSD3010) on a Pippin Prep (Sage Science) with a cut size optimized to cover RNA molecules from 17 to 47 nt in length. Libraries were prepared with the NEBNext Small RNA kit (NEB, Ipswich, MA) and sequenced on a HiSeq 2500 platform to on average 18 million sequences per sample (Illumina, San Diego, CA).

Bioinformatic analyses
Our bioinformatics workflow includes quality control, adapter trimming, read mapping, read counting, and creation of count tables. We used a large annotation dataset containing several RNA classes available in serum , including miRBase (v22.1) for miRNAs (Kozomara et al., 2019), piRBase (v1.0) for piRNAs (Zhang et al., 2014), and the GENCODE (v26) for other RNA classes (Harrow et al., 2012). We used the AdapterRemoval tool for adapter trimming (Schubert et al., 2016) and Bowtie2 (Langmead and Salzberg, 2012) for mapping reads to the human genome (hg38) with an average mapping ratio of 70%. The SeqBuster tool was used for miRNA annotation counts and isomiR calling (Pantano et al., 2010). We filtered out the RNAs with fewer than five reads in less than 80% of the samples. All isomiRs passed the expression were regarded as bona fide isomiRs. We used DESeq2's (Love et al., 2014) variance stabilizing normalization function to normalize identified RNA counts. The optmatch (v0.9-11) R package (Hansen and Klopfer, 2006) selected appropriately matched controls while building models. Therefore, we matched LC samples and controls on sex, age at donation, and BDg. BDg is a technical cofounder combining the effect of sample treatment at donation and storage time . We used R function kegga from the limma package for KEGG pathway enrichment analysis of selected RNA features if they are miRNA, isomiR, or mRNA. The miRNA and isomiR targets were extracted from MIRDB (v5.0) predictions (Wong and Wang, 2015) (score cutoff >60). p-Values were adjusted using false discovery rate (FDR) (using p.adjust function of R).

ML classification algorithms and training/testing workflow
High dimensionality is often a problem in modeling RNA-seq data. Our preliminary analysis showed that ML algorithms with regularization produced successful models. Therefore, we selected five ML algorithms to create our initial models: lasso, elastic-net, sparse group lasso (SGL), RF, and extreme gradient boosting (XGBoost) algorithms. We used fivefold cross-validation (if available) to tune hyperparameters for model training. For the SGL models, RNAs were classified by type.
R implementations of these algorithms were used: caret (v6.0-84) and glmnet (v2.0-18) packages for elastic-net and the lasso, sglfast (v0.10) and msgl (v2.3.9) for the SGL models and xgboost (v1.0.0.2) for XGBoost. Classifications were performed according to histology and time to diagnosis (for details see next paragraph) using an automated ML workflow. In the ML workflow the datasets were split into training (70%) and test (30%) ( Figure 1B). We repeated this step five times using designated seed numbers to select five different training and test datasets which were balanced for case/control numbers and also matched for confounders (i.e. sex, age, and BDg). Model optimization including hyperparameter tuning was done by a grid search approach followed by fivefold cross-validation using the training sets. The test datasets were only used for testing to overcome overfitting and assess true performance. The performance of the classifiers were mainly evaluated by area under the ROC curves (AUC)s. We also calculated accuracy, sensitivity, and specificity. Confidence interval calculations were done using metrics of test datasets.

Histology and prediagnostic models
We refer to models for all histologies, NSCLC and SCLC that do not take time to diagnose into account as standard full-time models ( Figure 1B). Prediagnostic models were created using a sliding windows approach and a fixed-time approach to find optimal time to diagnose intervals. We first selected three different window sizes, 2, 3, and 4 years, which were moved over the 10 years prior to diagnosis time. We then built models based on samples captured by these sliding windows. Fixedtime windows were 0-2, 0-5, and 0-8 years before diagnosis. We used the workflow described above to train and test both standard and prediagnostic models.

Feature selection methods
We implemented feature selection methods to improve model performances, including single-RNA class, lasso selection, and significant selection. In the single-RNA class method, we dropped all RNA types except one. In lasso selection, all non-zero features selected by the lasso classification models were pooled. Next, we retrained new classification models which were restricted to use only these features. In significance selection, an univariate regression analysis was done per feature and significant features (multiple testing adjusted) were used to train classification models.

Frequent feature models, independent leave-out test, and RR calculations
We created models for each histology which utilize the most frequent features identified in the standard full-time models. To assess overfitting and to get a better estimate of these model performances, we split the datasets into training (80%) and test (20%) sets. To the test sets we also added non-smokers and samples from smokers, but not previously used in frequency matching (number of samples reported in Table 3). Both unmatchable samples and non-smokers were never used for model building and evaluation which we refer to as leave-out sets. We did not repeat this analysis five times as in the automated ML workflow. RRs were calculated using the test sets. The optimal threshold was identified in cross-validation. We used the R packages cutpointr (v1.0.1) and epitools (v0.5-10.1) to calculate RRs.

Data accessibility
The datasets generated for this article are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required.
Requests to access the datasets should be directed to the corresponding authors. Our scripts and bioinformatics workflow files can be accessed from our GitHub repo (https://github.com/sinanugur/ LCscripts, Umu, 2022 copy archived at swh:1:rev:26bccc86a551f71284559db11bb74230f5d00cc4).
• Supplementary file 3. Fixed-time model performance on different histologies.
• Supplementary file 4. Selected prediagnostic models, metrics, and their feature importance tables.

Data availability
The datasets generated for this manuscript are not readily available because of the principles and conditions set out in articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation (GDPR). National legal basis as per the Regulations on population-based health surveys and ethical approval from the Norwegian Regional Committee for Medical and Health Research Ethics (REC) is also required.
Requests to access the datasets should be directed to the corresponding authors with a project proposal. Please refer to our project website for the latest information on data sharing ( kreftregisteret. no/ en/ janusrna). Our scripts, plot data, and bioinformatics workflow files can be accessed from our Github repo (https://github.com/sinanugur/LCscripts copy archived at swh:1:rev:26bccc86a551f7128 4559db11bb74230f5d00cc4).
The following dataset was generated: