Introduction

Nasopharyngeal carcinoma (NPC), which originates from the nasopharyngeal mucosal epithelium, is a malignant tumor with distinct epidemiological, histopathological, clinical, and therapeutic features among head and neck tumors [13]. According to the most recent global cancer statistics, a total of 133,354 new cases of nasopharyngeal cancer were reported, accounting for only 0.7% of all new cancer cases that year [4]. Although the incidence of nasopharyngeal cancer is relatively low compared with other cancer types, it has a unique geographic distribution. About 70% of new nasopharyngeal cancer cases occur in Eastern and Southeastern Asia, while the rest are mainly in South-Central Asia, North Africa, and East Africa [5]. Currently, radiotherapy is used as the primary curative treatment for patients with nasopharyngeal carcinoma. With the continuous advancement of radiotherapy technology, recent studies have shown that hyperfractionated intensity-modulated radiotherapy increases the 3-year overall survival rate in these patients from 55.0% to 74.6% compared to conventional fractionated intensity-modulated radiotherapy [6]. However, approximately 20-30% of patients experience locoregional recurrence of the tumor due to radiation resistance [79]. Studies have shown that radiotherapy resistance is the main reason for treatment failure in nasopharyngeal cancer patients [10, 11]. Given the importance of radiotherapy in nasopharyngeal cancer treatment, it is imperative to investigate the molecular mechanisms of radiotherapy resistance and develop strategies to enhance the radiosensitivity of nasopharyngeal cancer cells.

In recent years, artificial intelligence (AI) has been deeply integrated into the medical field [12]. Machine learning (ML), an important branch of AI, has the ability to process nonlinear data, discover new patterns, and generate predictive models after learning from existing data. Machine learning plays a crucial role in developing predictive markers for tumor treatment and prognosis. Recent studies have focused on utilizing machine learning techniques to construct predictive models for these purposes. These studies employ various methods and algorithms, including integrated learning, immune-derived feature extraction, and AI network-guided signatures. For instance, one study utilized machine learning through tumor MR image feature mapping for patient stratification in adjuvant chemotherapy for locally advanced nasopharyngeal carcinoma [13]. Another study employed machine learning based on MRI/DWI radiomics signatures to predict the prognosis of nasopharyngeal carcinoma [14]. Additionally, machine learning-based algorithms have been developed to create predictive models for identifying lncRNAs associated with tumor-infiltrating immune cells, aiming to improve prognosis and immunotherapy response in cancer patients [15, 16]. These studies aim to enhance patient outcomes and prognosis while providing superior predictive power for personalized clinical treatment strategies. However, there is still a lack of machine learning-based biomarkers for predicting radiotherapy sensitivity in nasopharyngeal carcinoma.

In this study, a combination of multiple machine learning methods and transcriptomic data from nasopharyngeal cancer patients was used to construct a biomarker that can predict the sensitivity to radiotherapy in nasopharyngeal cancer patients, termed the Nasopharyngeal Carcinoma Radiosensitivity Score (NPC-RSS). In addition, we explored the biological mechanisms underlying the relationship between NPC-RSS and radiotherapy response in nasopharyngeal cancer patients. We anticipate that the construction of NPC-RSS will guide the selection of radiotherapy strategies for nasopharyngeal carcinoma patients, thereby improving the clinical outcomes and prognosis of these patients.

Materials and Methods

1. NPC patient data sources and data processing

We downloaded chemotherapy-related gene expression data from the Gene Expression Omnibus (GEO) database for 20 nasopharyngeal carcinoma (NPC) patients (GSE32389). Additionally, we collected samples from 34 NPC patients treated at Zhujiang Hospital of Southern Medical University between March 2017 and July 2021 (in-house cohort). The clinical characteristics of the NPC patients in the in-house cohort are detailed in Supplementary Table 1. Written informed consent was obtained from the NPC patients whose nasopharyngeal carcinoma tissues were collected, and the study was approved by the Ethics Committee of Zhujiang Hospital of Southern Medical University. In this study, the in-house cohort was used as the training set, and GSE32389 was used as the validation set. The raw transcriptome sequencing data from the in-house cohort underwent quality control and preprocessing, which included data quality assessment, removal of low-quality reads, and removal of adapter or primer sequences using tools such as Trimmomatic and FastQC. Subsequently, the pre-processed reads were aligned to a reference genome or transcriptome using the alignment software HISAT2 to identify and localize RNA sequences. The aligned reads were then quantified using featureCounts to obtain the number of reads per gene or transcript. Gene expression levels were then quantified by normalizing the read counts to the length of each gene and the total number of reads, calculating the Fragments Per Kilobase Million (FPKM) values. The data for the GEO database were obtained from the Affymetrix® GPL570 platform. Microarray data from the Affymetrix platform were background corrected, quantile normalized, and log2 transformed using the R package "affy" with the RMA algorithm [17].

2. Construction of NPC-RSS

In the training set, we performed differential analysis of the gene expression data between nasopharyngeal carcinoma patients who were sensitive or resistant to radiotherapy using the limma R package. This analysis identified 71 differentially expressed genes (DEGs) (P < 0.05 & |LogFC| > 0.585, Supplementary Table 2) [18]. Next, we used these DEGs to develop a binary classification model using machine learning algorithms to predict radiotherapy sensitivity in nasopharyngeal carcinoma patients. In the data preprocessing stage, we performed robust multi-array average (RMA) normalization on the gene expression data to ensure that it met the input requirements of the machine learning algorithms, which assume a normal distribution.

(1) First, we integrated a combination of 12 machine learning algorithms, including Least Absolute Shrinkage and Selection Operator (LASSO), Ridge, Elastic Net (Enet), stepwise generalized linear models (Stepglm), support vector machines (SVM), gradient boosting with component-wise linear models (glmBoost), linear discriminant analysis (LDA), partial least squares generalized linear models (plsRglm), random forest (RF), gradient boosting machine (GBM), extreme gradient boosting (XGBoost), and naive Bayes. Among these, RF, LASSO, glmBoost, Stepglm with both forward and backward selection (Stepglm[both]), and Stepglm with backward selection (Stepglm[backward]) can perform feature selection and dimensionality reduction. We combined these algorithms with the others to form 113 combinations of machine learning algorithms. In each of these 113 combinations, one algorithm was used for feature selection, while the other was used to construct a binary classification model. The performance of these models was evaluated using cross-validation to ensure robustness. (2) Next, in the training set, we used these 113 combinations to construct predictive models using the expression data of the 71 DEGs. A total of 48 models were successfully constructed. (3) In the validation set, we used these predictive models to calculate a radiotherapy sensitivity score for each nasopharyngeal carcinoma (NPC) patient. (4) We evaluated the predictive performance of these models in the training and validation sets using receiver operating characteristic (ROC) curves and the average area under the curve (AUC). The final model was named the Nasopharyngeal Carcinoma Radiosensitivity Score (NPC-RSS).

3. Validating key genes of NPC-RSS at the cellular level

3.1 Cell culture

The CNE2 nasopharyngeal carcinoma cell line, previously preserved by the Pearl River Hospital Center for Translational Medicine, was cultured in RPMI 1640 medium supplemented with 10% fetal bovine serum and 1% penicillin/streptomycin. The cells were maintained in a humidified incubator at 37°C with 5% CO2. Cell passaging and division ratios were carefully monitored and maintained within optimal ranges to ensure cell health and experimental consistency. Regular cell line authentication was performed to ensure the use of the correct cell line. Stringent quality control measures were implemented for the culture medium and fetal bovine serum to minimize experimental variability. Precise addition of penicillin/streptomycin was ensured to maintain bacterial resistance while minimizing potential adverse effects on cell growth and viability. Appropriate cryopreservation and thawing protocols were followed to ensure optimal cell preservation and recovery. Strict adherence to aseptic techniques was maintained throughout the cell culture process to prevent bacterial and fungal contamination. Continuous monitoring of the incubator’s internal environment was performed to maintain stable CO2 concentration and temperature. Meticulous execution of these steps is crucial for maintaining the stability and reliability of the cell culture system.

3.2 Cell irradiation and construction of CNE2-R radiation-resistant cells

A 6 MV photon beam was generated at room temperature using a linear accelerator (Elekta, SYNERGY PL, Switzerland) and delivered to the cells at a 180° gantry angle with a dose rate of approximately 600 cGy/min. The bottom of the cell culture vessel was covered with a 1.5 cm thick tissue-equivalent bolus. The source-to-surface distance (SSD) was 100 cm.

Each T25 culture flask was seeded with approximately 2x105 cells in the exponential growth phase and sequentially irradiated with increasing doses (2, 4, 6, 8, and 10 Gy) twice, followed by a single 10 Gy dose, for a total of 9 irradiations and a cumulative dose of 50 Gy. Following the initial 2 Gy irradiation, the cells were cultured and passaged twice before being irradiated with another 2 Gy. The surviving cells were then subjected to the aforementioned irradiation schedule, and the resulting population was designated as the radiation-resistant cell line CNE-2-RS (Fig 2E). Parental cells were used as controls and underwent the same treatment regimen without irradiation.

To maintain radioresistance, the CNE-2-RS cells used in the experiments were exposed to a low dose of approximately 1-2 Gy per month. Cells within 15 passages following the low-dose irradiation were used for subsequent experiments, and all experiments were conducted using cells in the exponential growth phase. The radioresistance of CNE-2-RS cells was subsequently validated using clonogenic assays, with results consistently confirming the maintained radioresistant phenotype.

3.3 RT-qPCR

Total RNA was extracted from CNE2 cells using TRIzol reagent (Invitrogen). Following the measurement of RNA concentration with a Nanodrop 2000 instrument (Thermo Scientific), cDNA synthesis was performed using the Color Reverse Transcription Kit (EZBioscience, USA). The relative mRNA expression levels of each gene were calculated using the 2^(-ΔΔCT) method. The primer sequences of the genes are presented in Supplementary Table 3.

4. Annotation of immune-related features of NPC-RSS key genes

The CIBERSORT algorithm was applied to analyze the expression data of nasopharyngeal cancer patients to infer the relative proportions of 22 immune-infiltrating cells. Additionally, a correlation analysis was performed between gene expression and immune cell content using Spearman’s method. Results with a P value < 0.05 were considered statistically significant. After grouping the nasopharyngeal cancer data based on the previously extracted NPC-RSS, the CIBERSORT method was applied to quantify the immune cell infiltration. The objective was to thoroughly investigate whether there was a significant difference in immune infiltration between the sensitive and resistant groups after stratifying nasopharyngeal cancer patients according to their NPC-RSS. To further elucidate the potential association of immune features in nasopharyngeal cancer, we also investigated the correlation of NPC-RSS key genes with immune-related feature genes obtained from the TISIDB database, including chemokine-related genes, immunosuppressant-related genes, MHC-related genes, immune stimulation-related genes, and receptor-related genes [19].

5. Further validation of NPC-RSS at the single-cell level

The single-cell RNA sequencing (scRNA-seq) data were processed and quantified using the Cell Ranger (version 2.0.1) pipeline. Initially, reads obtained from 10x Genomics were aligned to the human reference genome hg19.The FASTQ sequence files of four samples (Supplementary Table 4) were aligned to the human reference genome hg19 using STAR software and the Cell Ranger "count" module, generating files containing barcode tables and gene tables. The raw gene expression matrices for each sample, generated by the CellRanger software, were merged in R and converted to Seurat objects for downstream analysis using the Seurat R package. Subsequently, cells with the following characteristics were removed: (1) number of unique molecular identifiers (UMIs) less than 201, (2) fewer than 100 or more than 6,000 expressed genes (potential cellular duplicates), and (3) percentage of mitochondrial genes greater than 20%. Additionally, genes expressed in fewer than three cells were removed from the dataset. After quality control, 28,957 high-quality nasopharyngeal carcinoma cells were retained for further analysis. Subsequently, the gene expression matrix was normalized to the total cell reads and mitochondrial reads using linear regression with Seurat’s RegressOut function. Following data normalization, highly variable genes were identified across all individual cells while controlling for the relationship between mean expression and dispersion. Prior to the joint analysis of the four-sample scRNA-seq dataset, we employed the R package Harmony to mitigate batch effects and minimize their impact on downstream analyses [20]. Furthermore, we utilized Seurat’s "CellCycleScoring" function to identify the expression of cell cycle-related genes across various cell subpopulations. We employed a previously defined core set of 43 G1/S and 54 G2/M cell cycle genes [21]. Cells were categorized based on the maximum average expression ("cycle score") across both gene sets. Following cell cycle analysis, no bias attributable to cell cycle genes was observed.

6. Cell clustering analysis and identification of marker genes for cellular subpopulations

Normalization, clustering, differential gene expression analysis, and visualization were performed using the Seurat 3.1 R package. To reduce the dimensionality of the dataset, principal component analysis (PCA) was performed on the 2,000 variably expressed genes. The top 20 principal components were then used as input for the t-distributed stochastic neighbor embedding (t-SNE) algorithm, which was run using the default settings of the RunTSNE function in Seurat. Cell clustering was performed using the ’FindClusters’ function in Seurat with a resolution of 0.3, resulting in the identification of 17 distinct cell clusters (clusters 0-16). Subsequently, the FindMarkers function was employed to identify specific marker genes for each cell subpopulation. The process was conducted as follows: (1) Differentially expressed genes (DEGs) for each cell subpopulation were identified by comparing the cells of that subpopulation to all other cells using the Wilcoxon rank-sum test; (2) Bonferroni-corrected p-values less than 0.05 were used as the cut-off for determining statistically significant DEGs; (3) Genes with an average expression level more than 2-fold higher in a specific cell subpopulation compared to other subpopulations were selected as marker genes. Cell type annotation for each subpopulation was performed based on marker genes reported in previous studies [22], the CellMarker database, and the R package ’SingleR’ [23].

7. GSVA and GSEA analysis of NPC-RSS key genes

GSVA and GSEA analysis of key NPC-RSS genes Based on the expression levels of key NPC-RSS genes, we stratified the patients into high and low expression groups and thoroughly investigated the signaling pathway differences between these groups using Gene Set Enrichment Analysis (GSEA). We utilized the Molecular Signatures Database (MSigDB) gene sets as the annotated gene sets for the subtype pathways to conduct differential expression analysis between subtypes. Gene consistency was assessed using a concordance score calculated by sorting the genes according to their LogFC (adjusted P < 0.05). Furthermore, to investigate the association of NPC-RSS with known radiosensitivity-related genes, we searched for "Radiosensitivity" in the GeneCards database (https://www.genecards.org/) to obtain radiosensitization-related genes, and examined the expression levels of the top 20 genes based on their Relevance scores and their correlation with the top-5 genes in NPC-RSS.

8. Statistical Analysis

The differentially expressed genes between NPCs of the sensitive and resistant groups in the dataset were identified using the R package "limma". For normally distributed variables, statistical differences between two groups were determined by two-tailed t-test, while one-way ANOVA was used for multiple group comparisons. For non-normally distributed variables, statistical differences between two groups were determined by the Wilcoxon rank-sum test, while the Kruskal-Wallis test was used for multiple group comparisons. All statistical analyses were conducted using R software (version 4.1.2). Each experiment was performed independently in triplicate unless otherwise specified. Spearman’s rank correlation analysis was employed to assess the correlation between two variables.

In the single-cell sequencing data processing section, comparisons between two groups were performed using the unpaired two-tailed t-test (implemented in the R package limma) or the Mann-Whitney U test. All statistical analyses and graph visualizations were conducted using the R programming language.

For all experiments, samples from the same patient were processed in parallel. Cells from each patient sample were subjected to single-cell RNA sequencing (10x Genomics) simultaneously but in different passages and vials. Boxplots were generated using the R base software package with default parameters. The boxes span the interquartile range (IQR; from the 25th to the 75th percentile), with the center line representing the median. The lower whiskers extend to the smallest value no further than 1.5×IQR from the 25th percentile. The upper whiskers extend to the largest value no further than 1.5×IQR from the 75th percentile. Due to the large number of data points represented in the box-and-whisker plots, individual data points were not shown to maintain clarity in the overall distribution.

Results

1. Construction of the NPC-RSS Model

As shown in Fig.1, 48 models were successfully constructed, and the area under the curve (AUC) of each model was calculated in both the training and validation sets. The best model combination was found to be glmBoost+NaiveBayes after sample weighting: the glmBoost algorithm was used to screen out the most valuable gene features, and then the NaiveBayes algorithm was employed to identify the most reliable model (Fig. 2A). The model identified 18 genes, which were used as candidate genes for subsequent studies. The NPC-RSS was constructed using logistic regression, and the model prediction formula after secondary modeling was: P(Y=1) = -15.3906 + 2.7369 * C1QTNF3 + 3.8292 * CA11 + 7.6856 * CD9 - 15.9445 * CDK5RAP3 - 11.2475 * CLDN1 + 14.3984 * DMC1 - 6.2474 * EYA1 + 1.6897 * IFI44L + 10.0473 * KNG1 + 0.9464 * KREMEN1 - 6.5371 * NT5DC2 - 10.4192 * NTRK3 + 8.1306 * PSG4 - 0.5755 * RFX4 - 2.1728 * RHOBTB3 - 7.1597 * SLC1A2 + 22.6640 * SMARCA2 - 0.3858 * TRIM58. The top 5 genes characterized in the NPC-RSS, in order of their contribution to radiotherapy sensitivity and weighting coefficients, are SMARCA2, DMC1, CD9, PSG4, and KNG1. The genes with absolute values of weight coefficients greater than 10 were SMARCA2, DMC1, KNG1, NTRK3, CLDN1, and CDK5RAP3 (Fig. 2B).

Flowchart for constructing a predictive model for NPC radiotherapy sensitivity

Differentially expressed genes obtained from local NPC transcriptome data, grouped according to radiosensitivity and radioresistance, were used to predict radiotherapy sensitivity scores (NPC-RSS) of NPC patients using 12 machine learning algorithms, including Lasso, Ridge, Enet, Stepglm, SVM, glmBoost, LDA, plsRglm, RandomForest, GBM, XGBoost, and NaiveBayes. Additionally, 48 other combinations of validated frameworks were constructed to predict the radiotherapy sensitivity score (NPC-RSS) of nasopharyngeal carcinoma patients. The most effective NPC-RSS was finally constructed based on the combination of glmBoost+NaiveBayes, which yielded the best AUC. The role and biological significance of NPC-RSS in NPC radiotherapy sensitivity were comprehensively explored through tumor immune microenvironment analysis, pathway enrichment analysis, and single-cell transcriptomic analysis.

Consensus NPC-RSS construction and validation using an integrated machine learning approach

(A) Based on 48 combined validation frameworks, including 12 machine learning algorithms (Lasso, Ridge, Enet, Stepglm, SVM, glmBoost, LDA, plsRglm, RandomForest, GBM, XGBoost, and NaiveBayes), the area under the curve (AUC) of each model was calculated for both the training and validation sets. Heatmaps were generated based on the sample-weighted AUC values. (B) Demonstration of NPC-RSS genes with absolute values of weight coefficients greater than 10, based on the glmBoost and NaiveBayes combination. (C) Receiver operating characteristic (ROC) curves for the training and validation sets. (D) Volcano plot depicting differentially expressed genes between the sensitive and resistant groups of the CNE-2 cell line. (E) Timeline of resistance strain induction in the CNE-2 cell line. (F) Expression of the top 5 weighted genes in the NPC-RSS signature for the CNE-2 cell line. (G) Heatmap of differentially expressed genes between the sensitive and resistant groups of the CNE-2 cell line. (H) Heatmap of z-scores for NPC-RSS genes in the CNE-2 cell line. (I) Analysis of differences in sensitivity scores among CNE-2 cell lines.

The prediction model constructed using these 18 genes demonstrated good diagnostic efficacy, with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.996 for the training set. The GSE32389 dataset was employed as an external validation set, yielding an AUC of 0.823, which demonstrated the model’s robustness. The overall sample-weighted AUC for the combined training and validation sets was 0.932 (Fig. 2C). The genes used to construct the model were derived from our center’s NPC transcriptome expression data and underwent differential expression analysis, resulting in 2,932 differentially expressed genes (DEGs) (Fig. 2D).

Our previous analysis revealed that 18 gene features were consistently expressed in the NPC dataset (Supplementary Fig. 1). Furthermore, these 18 gene features exhibited significant differential expression between the sensitive and resistant groups (Supplementary Fig. 1).

2. In vitro validation of NPC-RSS

The induction process of CNE2-RS radiation-resistant cells is shown in Fig. 2E. RNA was transcribed into cDNA, sequenced, and analyzed upstream by a high-throughput sequencer to obtain count data, which was further transformed into FPKM values for subsequent analysis. Differential analysis was performed between the parental and resistant groups using the limma package. The analysis results showed that SMARCA2 and CD9, among the key genes of NPC-RSS, were significantly highly expressed in the sensitive group (p < 0.001, t-test), which was consistent with the expression trend of the key genes in the previously constructed NPC-RSS (Fig. 2F). The heatmap of the differential analysis is shown in Fig. 2G. We also clustered these data based on the 18 genes in NPC-RSS after Z-score normalization for the heatmap. The results were consistent with the expression trend in NPC-RSS (Fig. 2H). Meanwhile, we calculated the radiotherapy sensitivity scores of three pairs of replicated CNE2-P and CNE2-RS samples using NPC-RSS and used the mean values for statistical testing. The results showed that the sensitivity scores of the sensitive group were significantly higher than those of the resistant group, further confirming the reliability of NPC-RSS (Fig. 2I).

3. Annotation of key NPC-RSS genes and immune-related features

To thoroughly explore the potential mechanism by which the NPC-RSS predicts radiosensitivity, we stratified the local NPC data into groups based on the NPC-RSS developed in this study and examined the association between these groups and immune infiltration. The proportion of immune cell content in each patient and the correlations among immune cells were visualized using various formats. The results demonstrated a significant elevation in the levels of T follicular helper cells, regulatory T cells (Tregs), and activated NK cells in the samples from the radiosensitive group compared to those from the radioresistant group (Fig. 3A), suggesting that the radiosensitive group exhibited a more immunologically active infiltration profile.

Annotation analysis of in-house NPC cohorts based on NPC-RSS predictive grouping with immune-related features. (A) Comparison of immune cell infiltration between NPC-sensitive and resistant tissue groups. *P < 0.05, P < 0.01. (B) Bubble plot depicting the correlation between the top 5 weight-ranked gene features in the NPC-RSS (SMARCA2, DMC1, CD9, PSG4, KNG1) and tumor-immune-infiltrating cells in the radiotherapy-sensitive group. Bubble size represents the proximity of the P-value to zero, with orange and blue colors indicating the strength of positive and negative correlations, respectively. (C) Analysis of interactions among 22 different immune cell types in patients from the NPC-sensitive group. *P < 0.05, P < 0.01, *P < 0.001. (D) Correlation analysis of the top 5 weight-ranked genes in the NPC-RSS (SMARCA2, DMC1, CD9, PSG4, KNG1) with functionally diverse immune genes in the radiotherapy-sensitive group. *P < 0.05, P < 0.01, ***P < 0.001.

In this study, we further explored the relationship between NPC-RSS key genes and immune cells. We found that several key genes were highly correlated with immune cells: SMARCA2 was significantly positively correlated with resting CD4+ memory T cells; DMC1 was significantly positively correlated with resting NK cells; CD9 was significantly positively correlated with resting CD4+ memory T cells; and KNG1 was significantly negatively correlated with M1 macrophages, γδ T cells, monocytes, eosinophils, neutrophils, and other immune cell types (Fig. 3B). Furthermore, all these immune cell types were significantly correlated with each other (Fig. 3C). Correlations between these five key genes and various immune factors, including immunomodulators, chemokines, and cellular receptors, were obtained from the TISIDB database (Fig. 3D). These analyses suggest that the NPC-RSS key genes are closely associated with the level of immune cell infiltration and play a crucial role in the immune microenvironment.

4. Validation of NPC-RSS at the single-cell level

The expression patterns of key NPC-RSS genes at the single-cell level exhibited a stronger correlation with model predictions. We analyzed the single-cell transcriptome profiles of four patients. Among the four patients, three were sensitive to radiotherapy, while one was resistant. Following single-cell suspension preparation, library construction, sequencing analysis, and quality control of the obtained data, we acquired transcriptome data for a total of 28,957 cells. Among these, 22,796 cells were derived from the radiotherapy-sensitive group, and 6,161 cells were from the radiotherapy-resistant group. Unsupervised clustering was performed on the 28,957 cells obtained from sequencing to identify cell subpopulations based on the similarity of their gene expression profiles. As illustrated in Figure 4, 17 cell subpopulations were identified through clustering. By examining the expression profiles of cell type-specific marker genes, we determined that these 17 cell subpopulations belonged to 7 distinct cell types: T cells, myeloid cells, B cells, epithelial cells, neutrophils, fibroblasts, and mast cells (Figure 4A).

Biological characterization of key NPC-RSS genes at the single-cell level. (A) Clustered UMAP plot of three radiotherapy-sensitive and one radiotherapy-resistant sample with a total of 28,957 cells. Each color represents a cellular subpopulation (see cellular subpopulation annotations on the right). (B) Myeloid cells, epithelial cells, fibroblasts, and mast cells were significantly more abundant in samples from the radiotherapy-sensitive group compared to the radiotherapy-resistant group. (C) NPC-RSS scores displayed on all cells, with redder colors indicating higher scores. (D) NPC-RSS model gene expression in all cell subpopulations. Redder colors indicate higher expression, while bluer colors indicate lower expression. (E) Histogram showing the percentage of seven cell subpopulations in the radiotherapy-sensitive and radiotherapy-resistant groups. Different colors indicate different cell subpopulations. (F) Expression of marker genes used to annotate various cell subpopulations.

T cells expressed CD3D, CD8A, and CD4 genes. Myeloid cells expressed C1QB, LYZ, and AIF1 genes. B cells expressed MS4A1 and CD79A genes. Epithelial cells specifically expressed KRT5 and EPCAM genes. Neutrophils expressed specific marker genes. Fibroblasts expressed DCN and COL1A1 genes. Mast cells expressed CPA3, TPSAB1, and MS4A2 as marker genes. We used NPC-RSS to calculate the radiotherapy sensitivity scores of each cell. The predicted scores of cells in the sensitive group were significantly higher than those in the resistant group, especially in epithelial cells, myeloid cells, fibroblasts, and mast cells (Fig. 4B). The sensitivity scores of all cells are shown in Fig. 4C. Moreover, most of the 18 genes in the NPC-RSS were related to immune cells (Fig. 4D). Interestingly, the radiotherapy-sensitive group showed significant dominance in terms of immune cells, whereas the radiotherapy-resistant group was mainly composed of malignant epithelial and stromal cells (Fig. 4E). It is noteworthy that T cells, myeloid cells, epithelial cells, mast cells, and fibroblasts consisted of multiple cell subpopulations (Fig. 4F), suggesting that these cell types may be heterogeneous.

5. Pathway enrichment analysis findings

Next, we investigated the specific signaling pathways associated with the key genes of NPC-RSS and explored the potential molecular mechanisms by which these genes influence radiosensitivity in nasopharyngeal carcinoma. GSVA results demonstrated that high expression of SMARCA2 enriched signaling pathways such as WNT_BETA_CATENIN_SIGNALING, PI3K_AKT_MTOR_SIGNALING, TGF_BETA_SIGNALING, and NOTCH_SIGNALING (Fig5A). Similarly, high expression of DMC1 enriched INTERFERON_GAMMA_RESPONSE and IL2_STAT5_SIGNALING pathways (Fig5B).

Moreover, GSEA results showed that SMARCA2 was enriched in pathways such as the JAK-STAT signaling pathway, NF-kappa B signaling pathway, and T cell receptor signaling (Fig. 5C). DMC1 was enriched in pathways such as cell adhesion molecules, Hedgehog signaling pathway, and phosphatidylinositol signaling system (Fig. 5D). These findings suggest that NPC-RSS key genes may influence radiotherapy sensitivity in nasopharyngeal carcinoma patients through these pathways.

GSVA and GSEA analyses of NPC-RSS key genes and their correlation with radiosensitization genes. (A) GSVA of SMARCA2. (B) GSVA of DMC1. (C) GSEA of SMARCA2. (D) GSEA of DMC1. (E) Pearson correlation bubble plot of the top five NPC-RSS genes (SMARCA2, DMC1, CD9, PSG4, and KNG1) with radiosensitization-related genes. The larger the bubble, the closer the p-value is to zero; the more orange the color, the stronger the positive correlation; the more green the color, the stronger the negative correlation.

6. Correlation between NPC-RSS key genes and radiosensitization genes

Radiosensitivity-related genes were obtained by searching the GeneCards database (https://www.genecards.org/) for the term "Radiosensitivity." The local NPC data were grouped and analyzed based on NPC-RSS. The expression levels of the top 20 genes, ranked by their Relevance scores, were correlated. The results, presented in Supplementary Figure 2, showed significant differences in the expression of several radiosensitivity-related genes between the radiotherapy-resistant and sensitive groups.

Spearman correlation analysis indicated that the expression of SMARCA2 was positively correlated with high expression levels of TP53, EGFR, BRCA2, and ATM. Similarly, DMC1 expression was positively correlated with high expression levels of BRCA1, APTX, ATM, and CCR6, while CD9 expression was positively correlated with high expression levels of EGFR, TP53, ERBB2, and RAD51. The expression of KNG1 was correlated with high expression levels of HULC and significantly negatively correlated with high expression levels of BADA (Fig. 5E).

Discussion

Nasopharyngeal carcinoma (NPC), a malignant tumor, is prevalent in East and Southeast Asia, particularly in South China. Radiotherapy is the primary treatment modality for NPC; however, treatment failure occurs in approximately 20-30% of patients [1]. Radiotherapy sensitivity is a crucial factor influencing treatment outcomes in NPC patients. Thus, it is imperative to comprehensively investigate the mechanisms underlying radiotherapy sensitivity in NPC and identify effective therapeutic targets. Such investigations will facilitate the elucidation of therapeutic resistance mechanisms and provide a foundation for personalized treatment strategies. In the present study, we developed an integrative approach to establish a consistent nasopharyngeal carcinoma radiotherapy sensitivity signature (NPC-RSS) using NPC samples obtained from Zhujiang Hospital of Southern Medical University (Figure 1). In summary, we fitted 113 models to the training dataset using the leave-one-out cross-validation (LOOCV) framework. Further validation using the independent dataset GSE32389 from the Gene Expression Omnibus (GEO) database demonstrated that the glmBoost+NaiveBayes model exhibited the best performance.

The integrated procedure’s advantage lies in its utilization of various machine learning algorithms and their combinations to fit models for NPC radiotherapy sensitivity, achieving a consensus performance. Furthermore, the combination of algorithms can reduce the dimensionality of the variables, resulting in a more simplified and easily translatable model. The predictive validity of NPC-RSS has been further validated using NPC cell lines generated at our local center. Moreover, we performed a related validation on local single-cell data, which revealed that the radiotherapy-sensitive group exhibited a significant predominance of immune cells, whereas the radiotherapy-resistant group primarily comprised malignant epithelial and stromal cells.

Previous studies have demonstrated that the core genes of NPC-RSS, including SMARCA2, DMC1, CD9, PSG4, and KNG1, are associated with tumor radiosensitization to varying degrees [2428]. The protein encoded by the SMARCA2 gene plays a crucial role in various cellular processes, including cytokine response, DNA repair, regulation of cell proliferation, lineage specification and development, and cell adhesion [2933]. These processes are intimately linked to the radiosensitivity of tumor cells. Cell proliferation and DNA repair are critical determinants of tumor cell response to radiotherapy, and the involvement of SMARCA2 may influence this response [3438]. Furthermore, SMARCA2 is implicated in processes such as cell adhesion and cytokine response, which are tightly coupled to the interactions of tumor cells within the microenvironment. Cell adhesion influences the low-dose hyper-radiosensitivity of a broad spectrum of tumor cells, potentially further modulating their sensitivity to radiotherapy [39].

Radiation is capable of inducing DNA double-strand breaks, and cells must repair this damage through DNA repair mechanisms to maintain genomic stability. DMC1, a key factor in homologous recombination, is involved in the repair process of DNA double-strand breaks [4042]. Consequently, abnormal function or altered expression levels of DMC1 may influence cellular sensitivity to radiation and, consequently, the efficacy of radiation therapy. CD9 is believed to play a role in various stages of cancer development [43]. As a cell surface protein, CD9 is involved in regulating cell-cell adhesion and interactions, which may affect the response of tumor cells to radiation therapy. Moreover, CD9 may impact tumor sensitivity to radiation therapy by modulating the metastatic and invasive properties of tumor cells [4447]. It has been shown that all human PSGs initiate TGF-β1. In vivo, PSGs may increase the availability of active TGF-β1 in both soluble and matrix-bound forms of this potential cytokine [48]. Furthermore, TGF-β1 levels accurately predict radiosensitization in elderly patients with unresectable NSCLC undergoing 3D-CRT [49].

Pregnancy-specific glycoproteins (PSGs) may influence radiosensitivity by regulating transforming growth factor-beta 1 (TGF-β1) activity. In radiation therapy, radiation can induce an inflammatory response in tissues and affect angiogenesis. Kininogen-1 (KNG1), a factor associated with coagulation and inflammatory response [50], may play a regulatory role in radiosensitivity and the effects of radiation therapy. Furthermore, we found that the core genes of the nasopharyngeal carcinoma radiosensitivity signature (NPC-RSS), namely SMARCA2, DMC1, CD9, PSG4, and KNG1, had significant correlations with previously published genes related to radiosensitization (Fig. 5E). The roles of these genes in regulating radiosensitization further support the importance and reliability of the NPC-RSS developed in this study for predicting radiosensitization in nasopharyngeal carcinoma.

Radiotherapy plays a pivotal role in cancer treatment, and there is growing evidence of the potential for synergistic effects between radiation therapy and components of the immune response [51, 52]. However, there are limited data depicting the tumor immune microenvironment (TIME) associated with radiotherapy sensitivity. Indeed, recent preclinical studies have demonstrated that irradiated tumors can act as adjuvants to the immune system, promoting local tumor control and inducing regression of distant tumor deposits through abscopal effects [53]. In addition to potential systemic effects, radiation significantly alters the distribution of tumor antigens, recruits anti-tumor T cells, and induces a distinct inflammatory microenvironment [54]. Some studies have shown that tumors with high immune cell infiltration are more sensitive to radiation [55]; however, the relationship between radiation and immune response remains complex and context-dependent, and the optimal approach to integrating radiation and immunotherapy remains uncertain.

Studies have demonstrated that over one-third of tumor types in the radiotherapy-sensitive group of solid tumors exhibit an enrichment of CD8+ T cells, M1 macrophages, and NK+ cells [55]. To further investigate this phenomenon, we analyzed the immune infiltration between different subgroups of NPC-RSS and annotated the key NPC-RSS genes with immune-related features. The results demonstrated that the NPC radiosensitized group exhibited a higher level of immune infiltration and immune initiation, characterized by a greater abundance of activated NK cells, regulatory T cells (Tregs), and follicular helper T cells. The NPC-RSS key genes also exhibited a significant correlation and synergism with immune-related features. Notably, this synergistic correlation was most prominent in chemokines and chemokine receptor families, the major histocompatibility complex (MHC), immune cell interactions, and signaling pathways. These findings suggest that the radiosensitivity estimated by the NPC-RSS may be associated with a distinct tumor immune microenvironment, which could potentially be exploited through a combinatorial immunotherapy approach.

Notably, the results obtained from the collected single-cell data were consistent with the predicted trends of our constructed model. The radiotherapy-sensitive group exhibited a significant predominance of immune cells, while the radiotherapy-resistant group primarily comprised malignant epithelial and stromal cells. Furthermore, utilizing NPC-RSS to calculate the radiotherapy sensitivity score for each cell revealed that the predicted scores of cells in the sensitized group were significantly higher than those in the resistant group, particularly in epithelial cells, myeloid cells, fibroblasts, and mast cells. This finding suggests that the presence of immune cells might have influenced the tumor’s responsiveness to treatment to some extent prior to radiation therapy. Tumors with a predominance of immune cells may be more likely to exhibit a positive response to radiation therapy, either due to the involvement of immune cells in tumor surveillance and clearance or because the responsiveness of tumors to radiation therapy is modulated by immune cells.

Furthermore, pathway enrichment analysis suggested that the Wnt/β-catenin signaling pathway may be associated with the radiosensitivity of nasopharyngeal carcinoma (NPC). Gene Set Variation Analysis (GSVA) and Gene Set Enrichment Analysis (GSEA) indicated that the differential expression levels of SMARCA2, DMC1, CD9, PSG4, and KNG1 in the core of the NPC radiosensitivity signature (NPC-RSS) might influence various signaling pathways associated with disease progression, involving multiple biological processes such as DNA damage repair, cell death, immune regulation, cell proliferation, metabolism, tumor microenvironment regulation, and cell survival. For instance, the SMARCA2 high-expression group, which was the most significant, was enriched in the UV response, Wnt/β-catenin signaling pathway, NOTCH signaling pathway, TGFβ signaling pathway, PI3K-AKT-mTOR signaling pathway, Interferon Alpha response, Interferon Gamma response, and other signaling pathways. Previous studies have demonstrated that proteins related to the Wnt/β-catenin signaling pathway (DKK-3, β-catenin, and c-MYC) are involved in the development of NPC, and that knockdown of FOXO3a promotes radioresistance in NPC through the Wnt/β-catenin signaling pathway [56, 57]. Therefore, we hypothesized that SMARCA2 may regulate the radiosensitivity of NPC via the Wnt/β-catenin signaling pathway.

The novelty of this study lies in the utilization of nasopharyngeal carcinoma expression data collected from Zhujiang Hospital of Southern Medical University, combined with multiple machine learning models, to develop NPC-RSS, a tool capable of effectively identifying the patient population more sensitive to radiotherapy. The tool was validated using single-cell data and cell lines from the in-house cohort, with the validation results aligning with the predictions of our constructed model. However, this study has several limitations. Firstly, due to the relatively limited datasets in the public database of nasopharyngeal carcinoma that met our inclusion criteria, we utilized a relatively small sample size. Consequently, there was insufficient validation and support for the generalization performance and stability of the model with additional data. Secondly, we induced only a single nasopharyngeal carcinoma cell line and observed results consistent with the trend of gene contributions to our predictive model. In future studies, if available, additional cell lines could be incorporated into our research to provide further validation and support. These limitations should be addressed and the scope of the study expanded in future research.

Conclusion

In this study, we developed a robust Nasopharyngeal Carcinoma Radiotherapy Susceptibility Score (NPC-RSS) using multiple bioinformatics and machine learning algorithms. The expression of key genes SMARCA2, DMC1, CD9, and KNG1 was validated to be consistent with the NPC-RSS in our in-house cell lines. Compared to the resistant group, the sensitive group classified by NPC-RSS exhibited a more abundant and active state of immune infiltration. Furthermore, in single-cell samples, NPC-RSS was elevated in the radiotherapy-sensitive group, where immune cells played a predominant role. Additional analysis uncovered substantial correlations between key NPC-RSS genes and genes previously demonstrated to be involved in the regulation of radiosensitization. Single-gene pathway enrichment analysis revealed that the top 2 NPC-RSS genes were associated with several signaling pathways, including the Wnt/β-catenin signaling pathway, JAK-STAT signaling pathway, NF-kappa B signaling pathway, and T cell receptor signaling pathway. These pathways may influence the immune microenvironment of nasopharyngeal carcinoma, thereby affecting its susceptibility to radiotherapy. Our findings provide a novel perspective for understanding NPC radiosensitivity.

Author Contributions

KL wrote the article; PL,KL and JZ designed the research; KL,JL,NL,JF and XZ performed the research; KL,JY, NL,JF,XZ, PL, and JZ Writing-review and editing.

Acknowledgements

None.

Conflicts of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ethics Statement

The patients/participants provided their written informed consent to participate in this study and the research presented here has been performed in accordance with the Declaration of Helsinki and has been approved by the ethics committee of the Zhujiang Hospital of Southern Medical University.

Funding

None

Data avaliability

The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request.