1. Cancer Biology
  2. Computational and Systems Biology
Download icon

Identifying prostate cancer and its clinical risk in asymptomatic men using machine learning of high dimensional peripheral blood flow cytometric natural killer cell subset phenotyping data

  1. Simon P Hood
  2. Georgina Cosma  Is a corresponding author
  3. Gemma A Foulds
  4. Catherine Johnson
  5. Stephen Reeder
  6. Stéphanie E McArdle
  7. Masood A Khan
  8. A Graham Pockley  Is a corresponding author
  1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, United Kingdom
  2. Department of Computer Science, Loughborough University, United Kingdom
  3. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, United Kingdom
  4. Department of Urology, University Hospitals of Leicester NHS Trust, United Kingdom
Research Article
  • Cited 0
  • Views 875
  • Annotations
Cite this article as: eLife 2020;9:e50936 doi: 10.7554/eLife.50936

Abstract

We demonstrate that prostate cancer can be identified by flow cytometric profiling of blood immune cell subsets. Herein, we profiled natural killer (NK) cell subsets in the blood of 72 asymptomatic men with Prostate-Specific Antigen (PSA) levels < 20 ng ml-1, of whom 31 had benign disease (no cancer) and 41 had prostate cancer. Statistical and computational methods identified a panel of eight phenotypic features (CD56dimCD16high, CD56+DNAM-1-, CD56+LAIR-1+, CD56+LAIR-1-, CD56brightCD8+, CD56+NKp30+, CD56+NKp30-, CD56+NKp46+) that, when incorporated into an Ensemble machine learning prediction model, distinguished between the presence of benign prostate disease and prostate cancer. The machine learning model was then adapted to predict the D’Amico Risk Classification using data from 54 patients with prostate cancer and was shown to accurately differentiate between the presence of low-/intermediate-risk disease and high-risk disease without the need for additional clinical data. This simple blood test has the potential to transform prostate cancer diagnostics.

eLife digest

With an estimated 1.8 million new cases in 2018 alone, prostate cancer is the fourth most common cancer in the world. Catching the disease early increases the chances of survival, but this cancer remains difficult to detect.

The best diagnostic test currently available measures the blood level of a protein called the prostate-specific antigen (PSA for short). Heightened amounts of PSA may mean that the patient has cancer, but 15% of individuals with prostate cancer have normal levels of the protein, and many healthy people can have high amounts of PSA. This blood test is therefore not widely accepted as a reliable diagnostic tool.

Other methods exist to detect prostate cancer, yet their results are limited. A small piece of the prostate can be taken for analysis, but results from this invasive procedure are often incorrect. Scans can help to spot a tumor, but they are not accurate enough to be conclusive on their own. New tests are therefore urgently needed.

Prostate cancer is often associated with changes in the immune system that can be detected through a blood test. In particular, the appearance of a type of white blood (immune) cells called natural killer cells may be altered. Yet, it was unclear whether measurements based on these cells could help to detect prostate cancer and assess the severity of the disease.

Here, Hood, Cosma et al. collected and examined the natural killer cells of 72 participants with slightly elevated PSA levels and no other symptoms. Amongst these, 31 individuals had prostate cancer and 41 were healthy. These biological data were then used to produce computer models that could detect the presence of the disease, as well as assess its severity. The algorithms were developed using machine learning, where previous patient information is used to make prediction on new data. This work resulted in a new detection tool which was 12.5% more accurate than the PSA test in detecting prostate cancer; and in a detection tool that was 99% accurate in predicting the risk of the disease (in terms of clinical significance) in individuals with prostate cancer.

Although these new approaches first need to be validated in the clinic before being deployed, they could ultimately improve the detection and diagnosis of prostate cancer, saving lives and reducing the need for further tests.

Introduction

Early diagnosis and treatment increase curative rates for many cancers. The WHO considers that the burden of cancer on health services can be reduced by early detection and that this is achievable via three integrated steps: 1) awareness and accessing care, 2) clinical evaluation, diagnosis, and staging, 3) access to treatment (http://www.who.int/mediacentre/factsheets/fs297/en/). Although the clinical introduction of the Prostate-Specific Antigen (PSA) test in 1986 increased the early diagnosis of localized prostate cancer (Catalona et al., 1991; Hankey et al., 1999), elevated PSA levels are not necessarily indicative of prostate cancer because PSA levels can be raised by prostatitis, other localised infections, benign hyperplasia and/or factors such as physical stress. Contrastingly, 15% of men with ‘normal’ PSA levels typically have prostate cancer, with a further 15% of these cancers being high-grade (https://prostatecanceruk.org/prostate-information/prostate-tests/psa-test). The reliable diagnosis of prostate cancer based on PSA levels alone is therefore not possible and confirmation using invasive biopsies is currently required. In 2011/12 approximately 32,000 diagnostic biopsies (28,000 TRUS and 4,000 TPTPB) were performed by the NHS in England (NICE, 2014). Although the transrectal ultrasound guided prostate (TRUS) biopsy is the most commonly used technique, it is limited to taking 10 to 12 biopsies primarily from the peripheral zone of the prostate and has a positive detection rate between 26% and 33% (Aganovic et al., 2011; Nafie et al., 2014a; Naughton et al., 2000; Yuasa et al., 2008). The Transperineal Template Prostate biopsy (TPTPB) is a 36 core technique that samples all regions of the prostate and delivers a better positive detection rate between 55% and 68% (Dimmen et al., 2012; Nafie et al., 2014b; Pal et al., 2012). However, invasive biopsies are painful and associated with a significant risk of potentially serious side-effects such as urosepsis and erectile dysfunction (Chang et al., 2013). Given the potential challenges of invasive tests and the risk of significant side-effects, considerable interest in the potential of non-invasive blood or urine-based tests/approaches (‘liquid biopsies’) for diagnosing disease has developed (Quandt et al., 2017). Liquid biopsies can provide information about both the tumour (e.g. circulating cells, cell-free and exosomal DNA and RNA) and the immune response (e.g. immune cell composition and their gene, protein, and exosome expression profiles). Liquid biopsies are minimally invasive and enable serial assessments and ‘live’ monitoring speedily and cost-effectively (Quandt et al., 2017).

Based on the reciprocal interaction between cancer and the immune system, we have proposed that immunological signatures within the peripheral blood (the peripheral blood ‘immunome’) can discriminate between men with benign prostate disease and those with prostate cancer and thereby reduce the dependency of diagnosis on invasive biopsies. To this end, we have previously shown that the incorporation of a peripheral blood immune phenotyping-based feature set comprising five phenotypic features CD8+CD45RA-CD27-CD28- (CD8+ Effector Memory cells), CD4+CD45RA-CD27-CD28- (CD4+ Effector Memory cells), CD4+CD45RA+CD27-CD28- (CD4+ Terminally Differentiated Effector Memory Cells re-expressing CD45RA), CD3-CD19+ (B cells), CD3+CD56+CD8+CD4+ (NKT cells) into a computation-based prediction tool enables the better detection of prostate cancer and strengthens the accuracy of the PSA test in asymptomatic men having PSA levels < 20 ng/ml (Cosma et al., 2017). Herein, we have extended this new approach to determine if phenotypic profiling of peripheral blood natural killer (NK) cell subsets can also discriminate between the presence of benign prostate disease and prostate cancer in the same cohort of asymptomatic men. We also investigate the potential of the peripheral blood dataset to discriminate between low- or intermediate-risk prostate cancer and high-risk prostate cancer in those men having prostate cancer.

Results

Distinguishing between benign prostate disease and prostate cancer: statistical analysis of NK cell phenotypic features and PSA levels

Herein, we consider a ‘feature’ to be a single phenotypic variable (as determined using flow cytometry) or a pre-grouped set of phenotypic variables, as shown in Table 1. It was not possible to discriminate between men with benign prostate disease and men with prostate cancer based on differences between phenotypic features/profiles due to their similarity (Table 1, Figure 1, Figure 2).

Table 1
Descriptive statistics of the dataset.
Min.Max.MeanStd.IQRRangeDiff.
Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.
PSA4.704.7019.0019.008.268.343.313.283.304.0814.3014.30−0.08
CD56dim %
1CD16+83.8573.0496.6196.9890.9890.643.355.464.135.0212.7623.940.34
2CD16high24.3849.6687.4689.3372.8873.3211.7410.2215.0010.4563.0839.67−0.44
3CD16low5.176.5764.2244.0017.7416.8410.407.458.767.6659.0537.430.90
4CD16-1.411.2511.1118.064.834.892.453.482.582.689.7016.81−0.06
5CD56dimtotal91.2987.2498.7098.7095.8195.532.022.582.963.027.4111.460.28
CD56bright %
6CD16+0.460.655.105.881.911.831.061.041.640.924.645.230.08
7CD16high0.090.121.971.150.600.470.440.250.500.401.881.030.13
8CD16low0.340.403.114.951.271.350.720.860.970.632.774.55−0.07
9CD16-0.610.585.789.092.282.641.141.821.421.755.178.51−0.36
10CD56brighttotal1.301.308.7112.764.194.472.022.582.953.017.4111.46−0.28
CD8%
11CD56+CD8+21.889.2086.7080.4746.4340.7115.6414.6624.0320.0564.8271.275.72
12CD56+CD8-13.3019.5378.1290.8053.5759.2915.6414.6624.0320.0564.8271.27−5.72
13CD56dimCD8+19.638.6082.3877.4745.1839.1115.3114.1024.7219.3662.7568.876.07
14CD56brightCD8+0.370.254.756.641.411.701.071.410.701.604.386.39−0.29
NKp30 %
15CD56+NKp30+40.6956.8096.7498.4379.7888.5616.4210.4121.8010.4456.0541.63−8.78
16CD56+NKp30-3.261.5758.3444.5920.0511.4316.2210.4620.5410.4955.0843.028.61
NKp46 %
17CD56+NKp46+38.1145.3786.5295.8262.6569.8213.4911.5823.9012.7148.4150.45−7.18
18CD56+NKp46-14.024.3262.9755.6838.4030.8713.5811.6424.8913.4448.9551.367.53
DNAM-1 %
19CD56+DNAM-1+63.6988.5699.1899.6095.3596.466.812.593.373.4935.4911.04−1.11
20CD56+DNAM-1-0.860.4237.2911.664.743.596.962.613.453.5436.4311.241.14
NKG2D %
21CD56+NKG2D+85.1780.7998.7798.9693.4994.074.454.876.813.8313.6018.17−0.58
22CD56+NKG2D-1.221.0314.7619.126.445.844.364.766.803.9613.5418.090.60
PSA4.704.7019.0019.008.268.343.313.283.304.0814.3014.30−0.08
NKp44 %
23CD56+NKp44+0.430.283.716.771.161.340.821.200.781.253.286.49−0.18
24CD56+NKp44-96.1093.7099.5399.7098.8298.640.831.130.801.253.436.000.18
CD85j %
25CD56+CD85j+19.5314.2184.7391.5953.3755.1019.0418.3430.4920.2365.2077.38−1.74
26CD56+CD85j-14.938.5081.5486.0846.9445.2419.2118.4330.2821.4866.6177.581.69
LAIR-1 %
27CD56+LAIR-1+94.9721.4399.9099.8999.0797.471.0712.190.490.474.9378.461.60
28CD56+LAIR-1-0.020.055.2478.200.762.401.0212.150.420.435.2278.15−1.65
NKG2A %
29CD56+NKG2A+20.4319.0177.5773.0146.1444.2417.4113.7330.8217.4757.1454.001.90
30CD56+NKG2A-22.6227.1179.4080.8554.0155.9917.3913.6730.4817.9056.7853.74−1.98
2B4 %
31CD56+2B4+98.4197.0699.9999.9699.5399.500.390.590.320.331.582.900.02
32CD56+2B4-0.010.051.592.950.480.500.390.590.310.341.582.90−0.02
  1. Min. is the minimum value, Max. is maximum value, Mean is the mean or average value, and Std. is Standard Deviation. Range is the difference between the minimum and maximum values. The Interquartile range (IQR) is a measure of data variability and was derived by computing the distance between the Upper Quartile (i.e. top) and Lower Quartile (i.e. bottom) of the boxes illustrated in Figure 1. Difference is computed as diff = mean(Benign)-mean(Cancer).

NK cell phenotypic features in men with benign prostate disease and patients with prostate cancer.

Boxplots represent the flow cytometry values of each feature for patients with benign disease and with prostate cancer.

Mean and standard deviation values of flow cytometry features.

These findings highlight the difficulty in identifying combinations of features that can best identify the presence of cancer. These difficulties are compounded by the challenge of identifying the best combination of predictors which comprise n number of features, and that features within a combination, ideally, should not correlate. It is important to evaluate correlations between features, because if two features are highly correlated, then only one of these could serve as a candidate predictor. However, there may be occasions where both features are needed and besides the impact of this on the dimensionality of the dataset, there is no other negative impact. Furthermore, when two features are highly correlated and are important, it may be difficult to decide which feature to remove. Figure 3 shows the correlations between features, where +1.0 indicates a strong positive correlation between two features, and −1.0 indicates a strong negative correlation between two features.

Correlations between features.

The Kolmogorov-Smirnov and Shapiro-Wilk tests of normality were carried out to determine whether the dataset is normally distributed, as this would determine the choice of statistical tests, that is whether to use parametric (for normally distributed datasets), or non-parametric (for not normally distributed datasets) tests. The results of the normality tests are shown in Table 2. The results revealed that only 7–8 features (depending on the normality test) were normally distributed (with p>0.05), and for the remaining features the p value was less than 0.05 (p<0.05) which indicates that there is a statistically significant difference between the distribution of the data of those features and the normal distribution. Based on the results of the test, we can conclude that the dataset is not normally distributed.

Table 2
Tests of normality results.
Tests of normality
NK cell valuesKolmogorov-SmirnovaShapiro-Wilk
StatisticdfSig.StatisticdfSig.
1CD56dimCD16+0.1571.000.000.8571.000.00
2CD56dimCD16high0.1171.000.030.8971.000.00
3CD56dimCD16low0.1771.000.000.7971.000.00
4CD56dimCD16-0.1971.000.000.8271.000.00
5CD56dimCD56dimtotal%0.1571.000.000.9171.000.00
6CD56brightCD16+0.1371.000.000.8871.000.00
7CD56brightCD16high0.1571.000.000.8771.000.00
8CD56brightCD16low0.1471.000.000.8571.000.00
9CD56brightCD16-0.1671.000.000.8671.000.00
10CD56brightCD56brighttotal0.1571.000.000.9171.000.00
11CD8CD56+CD8+0.1071.000.060.9871.000.17
12CD8CD56+CD8-0.1071.000.060.9871.000.17
13CD8CD56dimCD8+0.0971.000.20*0.9871.000.24
14CD8CD56brightCD8+0.1971.000.000.8271.000.00
15NKp30CD56+NKp30+0.2171.000.000.8171.000.00
16NKp30CD56+NKp30-0.2171.000.000.8171.000.00
17NKp46CD56+NKp46+0.0871.000.20*0.9871.000.52
18NKp46CD56+NKp46-0.0771.000.20*0.9971.000.57
19DNAM-1CD56+DNAM-1+0.2371.000.000.5671.000.00
20DNAM-1CD56+DNAM-1-0.2371.000.000.5571.000.00
21NKG2DCD56+NKG2D+0.1971.000.000.8471.000.00
22NKG2DCD56+NKG2D-0.1871.000.000.8571.000.00
23NKp44CD56+NKp44+0.1871.000.000.7671.000.00
24NKp44CD56+NKp44-0.1771.000.000.7871.000.00
25CD85jCD56+CD85j+0.1171.000.050.9671.000.02
26CD85jCD56+CD85j-0.1071.000.070.9671.000.02
27LAIR-1CD56+LAIR-1+0.4371.000.000.1471.000.00
28LAIR-1CD56+LAIR-1-0.4371.000.000.1471.000.00
29NKG2ACD56+NKG2A+0.0971.000.20*0.9771.000.11
30NKG2ACD56+NKG2A-0.0871.000.20*0.9771.000.10
312B4CD56+2B4+0.2371.000.000.7571.000.00
322B4CD56+2B4-0.2371.000.000.7571.000.00
  1. *. This is a lower bound of the true significance.

    Those values in bold are of those features whose data is normally distributed.

  2. If the p>0.05, we can accept the null hypothesis, that there is no statistically significant difference between the data and the normal distribution, hence we can presume that the data of those features are normally distributed.

    If the p<0.05, we can reject the null hypothesis because there is a statistically significant difference between the data and the normal distribution, hence we can presume that the data of those features are not normally distributed.

Given that most features in the dataset are not normally distributed, the Kruskal-Wallis (also called the ‘one-way ANOVA on ranks’, a rank-based non-parametric test) tests were used to check for statistically significant differences between the mean ranks of the NK cell phenotypic features in men with benign prostate disease and patients with prostate cancer rather than its parametric equivalent (one-way analysis of variance, ANOVA). Although the Kruskal-Wallis test did not return any significant differences in the mean PSA values between individuals with benign disease and those with prostate cancer (χ2=0; p=0.949, Figure 4), statistically significant differences at the alpha level of α=0.05 in the mean ranks of the CD56brightCD8+ (ID14, p=0.007), CD56+NKp30+ (ID15, p=0.008), CD56+NKp30- (ID16, p=0.031), CD56+NKp46+ (ID17, p=0.023) populations in men with benign prostate disease and those with prostate cancer (Table 3) were observed.

PSA values by group.
Table 3
Results of the Kruskal-Wallis test.
Chi-Sq.(χ2)Asy. sig. p value
PSA00.949
NK cells
1CD56dimCD16+0.0010.981
2CD56dimCD16high0.0690.793
3CD56dimCD16low0.5550.456
4CD56dimCD16-0.0330.857
5CD56dimCD56dimtotal%0.0630.802
6CD56brightCD16+0.8360.361
7CD56brightCD16high0.2010.654
8CD56brightCD16low0.1060.744
9CD56brightCD16-0.0300.861
10CD56brightCD56brighttotal2.4150.120
11CD8CD56+CD8+2.4150.120
12CD8CD56+CD8-2.8490.091
13CD8CD56dimCD8+0.4170.518
14CD8CD56brightCD8+7.2300.007
15NKp30CD56+NKp30+7.1060.008
16NKp30CD56+NKp30-4.6380.031
17NKp46CD56+NKp46+5.1790.023
18NKp46CD56+NKp46-0.0010.981
19DNAM-1CD56+DNAM-1+0.0010.972
20DNAM-1CD56+DNAM-1-0.2930.588
21NKG2DCD56+NKG2D+0.3250.568
22NKG2DCD56+NKG2D-0.0330.857
23NKp44CD56+NKp44+0.0720.789
24NKp44CD56+NKp44-0.0490.825
25CD85jCD56+CD85j+0.0720.789
26CD85jCD56+CD85j-2.1350.144
27LAIR-1CD56+LAIR-1+1.3430.247
28LAIR-1CD56+LAIR-1-0.0600.807
29NKG2ACD56+NKG2A+0.0720.789
30NKG2ACD56+NKG2A-0.8790.348
312B4CD56+2B4+0.8900.346
322B4CD56+2B4-0.8900.346

This initial analysis provided insight into which phenotypic features might be good candidates for distinguishing between the presence of benign disease and prostate cancer. The next step was to examine whether using these as inputs into a machine learning algorithm can achieve this. An Ensemble Subspace kNN classifier was developed for the task at hand. The section which follows explains the approaches that were used to compare the diagnostic accuracy of the classifier when using the subset of features derived from the statistical analysis, and those features which were selected as a combination using the Genetic Algorithm (GA) for feature selection.

Distinguishing between benign prostate disease and prostate cancer: GA

The GA was used to identify a subset of features that, as a combination, provide an NK cell-based immunophenotypic ‘fingerprint’ which can determine if an asymptomatic individual with PSA levels below 20 ng ml-1 has benign prostate disease or prostate cancer. This fingerprint, or feature set, would then be used to construct a diagnostic/prediction model. Given that GAs stochastically select multiple individuals (i.e. features) from the current population (based on their ‘fitness’), each run can return different results. A common approach to identifying the best solution(s) is, therefore, to run the algorithm several times to obtain the frequency of the solution(s). Since the aim herein is to identify the most commonly occurring subset of NK cell phenotypic predictors, the GA was applied to the dataset and the most frequent subset of features returned was considered as being the best and most promising.

Let fc denote the number of times (frequency) a combination was returned during the n number of runs, then the relative frequency of a combination (Rfc) can be calculated using formula (Equation 1),

(1) Rfc=fcn

Table 4 shows the most frequent feature combinations returned at the end of each of the 30 runs when setting λ to different values. In Table 4, λ is the number of features in a combination. No. different comb is the number of unique combinations returned during the n number of runs (i.e. n = 30) for a given λ; Comb. with highest freq is the combination which was returned most frequently during the n number of runs; Freq of Comb. is the frequency of the most common combination found in the previous column; Relative Freq. (%) is computed using formula (Equation 1) converted to a percentage.

Table 4
Results of the Genetic Algorithm when searching for the best subset of features.
λNo. different combComb. with highest freq.Freq. of comb.Relative freq. (%)
2317,281653.3
3217,27,292376.7
412,20,27,2830100.0
523,20,27,28,322996.7
623,7,20,27,28,322686.7
733,7,20,23,27,28,322480.0
843,7,20,22,23,27,28,321963.3
933,7,19,20,22,23,27,28,322480.0
1032,3,7,19,20,22,23,27,28,322170.0

As the optimum number of features is not known, the GA was run by setting λ=2,3,,n where n is the total number of features in the dataset. Table 4 shows the results for the first 10 combinations. The results indicate that the combination comprising four features is the most promising in terms of its ability to discriminate between benign prostate disease and prostate cancer on NK cell phenotypic data alone. Features 2, 20, 27, 28, were returned in all 30 runs when searching for the best combination comprising of four features. Furthermore, features 20, 27, 28 were returned together in all combinations comprising more than three features (see feature ID’s in combinations λ=4 to λ=10 in Table 4). These results strongly suggest that these are good predictors when grouped. The fact that the same combination was returned in 30 iterations is a strong indicator that these four features are the most reliable for distinguishing between the presence of benign prostate disease and prostate cancer. Although the statistical analysis presented in Table 3 determined that features: ID14: CD56brightCD8+, ID15: CD56+NKp30+, ID16: CD56+NKp30-, and ID17: CD56+NKp46+ were the only ones with values which were significantly different in the two groups at α=0.05, and for which p values were therefore less than 0.05, none of the features selected by the statistical analysis were returned by the GA when searching for the best combination of features for discriminating between the presence of benign prostate disease and prostate cancer. The features selected by the GA were: ID2: CD56dimCD16high, ID20: CD56+DNAM-1-, ID27: CD56+LAIR-1+, and ID28: CD56+LAIR-1-. Referring back to Figure 3 and the correlation values between the selected features 2, 20, 27, 28, 14, 15, 16, 17, it is shown that these features do not have a strong positive correlation. There is a strong negative correlation between features 27 and 28, but we decided to keep both features since these were selected by the feature selection method.

The next step in the analysis involves evaluating the predictive performance of the feature subsets returned by the statistical test and by the GA. The features identified from the statistical and GA approaches were input into the proposed Ensemble Subspace kNN classifier to determine whether it can learn these features and discriminate between the presence of benign prostate disease and prostate cancer. For transparency of the machine learning model, it was important to keep the predictor selection and machine learning processes separate. The feature selection algorithm identified a set of novel NK cell phenotypic features for diagnosing the presence of prostate cancer which will be used to construct a transparent prediction tool.

Distinguishing between benign prostate disease and prostate cancer: machine learning

This section describes the outcome of experiments that were performed to determine the predictive performance of various feature subsets using the Ensemble Subspace kNN model, which was designed for the task. Machine learning classifiers that are constructed using small training sets have a large variance which means that the estimate of the target function will change if different training data are used (Skurichina and Duin, 2002). It is therefore expected, and normal, that classifiers will exhibit some variance. This means that small changes in input variable values can result in very different classification rules. To ensure that the proposed approach does not suffer from low variance, we evaluated the performance of the classifier using the 10-fold cross-validation approach which was repeated 30 times, for which the average and standard deviation of each run were recorded. Multiple runs of 10-fold cross-validation are performed using different partitions (i.e. folds), and the validation results are averaged over the runs to estimate a final predictive model. Each run of the cross-validation involves randomly partitioning a sample of data into complementary subsets, for which one subset is used as the training set, and the other is used as the validation subset. Cross validation randomly partitions the dataset into training and validation sets to limit overfitting problems, and to provide an insight into how the model will generalise to an independent dataset which was not previously seen by the model. A random seed generator was used to generate a different sequence of values each time the k-fold was run, and this was reseeded using a seed that was created using the current time. It is normal that a classifier returns a different validation accuracy in each fold and run, since it is training and validating on different samples. The aim is to create a low variance classifier, meaning that the results of each validation test are close together. The closer the results of each validation test, the more robust the classifier. To evaluate the predictive performance of various feature subsets derived from the computational and statistical feature selection approaches, each of these feature subsets was input into an Ensemble Subspace kNN classifier. Applying 10-fold validation resulted in 10 different partitions of the dataset of approximately 64 randomly selected samples for training and 7 randomly selected samples for validation in each partition (1 dataset comprising 63 training cases and 8 validation cases; and 9 datasets comprising 64 validation cases and 7 validation cases). All samples went through validation at some point during the evaluations. We consider 10-fold cross validation to be suitable given the small size of the dataset and the fact that sufficient samples are needed during the training process.

Table 5 shows the results of the comparison when running the 10-fold validation 30 times using six sets of features: 1) the four features selected by the GA; 2) the four features which were returned by the Kruskal-Wallis statistical test (STAT); 3) combined features selected by the GA and the statistical test (GA+STAT); 4) PSA values combined with features selected by the GA and the statistical test (PSA+GA+STAT); 5) PSA values alone as a predictor (PSA); and 6) using all 32 features (All features). The averages of the Area Under the Curve (AUC), Optimal ROC Point (ORP) False Positive Rate (FPR) of the AUC, ORP True Positive Rate (TPR) of the AUC, and Accuracy (ACC) of each fold are provided. The last column of Table 5 shows the Rank of each model, where 1 is the best model and 6 is the worst. The results of each k-fold were averaged, and these average values are plotted in the box plot shown in Figure 5. As shown in Table 5, combining the features selected by the GA ID2: CD56dimCD16high, ID20: CD56+DNAM-1-, ID27: CD56+LAIR-1+, ID28: CD56+LAIR-1-; with the four features which were returned by the Kruskal-Wallis statistical test as features with values which were statistically significant between individuals with benign prostate disease and patients with prostate cancer, ID14: CD56brightCD8+, ID15: CD56+NKp30+, ID16: CD56+NKp30-, ID17: CD56+NKp46+ yielded the highest classification accuracy, with AUC = 0.818, ORP FPR = 0.201, ORP TPR = 0.836 and Accuracy = 0.821. PSA values input into the classifier resulted in weak classification performance, AUC = 0.698, ORP FPR = 0.217, ORP TPR = 0.609, and Accuracy = 0.692. Although PSA is used as a screening test in clinical practice for identifying prostate cancer in men, it is the weakest of all the predictors. Importantly, predictive accuracy improved when PSA is combined with GA+STAT flow cytometry features (PSA+GA+STAT): AUC = 0.812, ORP FPR = 0.208, ORP TPR = 0.832, and ACC = 0.815. Combining PSA with the NK cell phenotypic fingerprint increased accuracy by +0.123 points when compared to using PSA alone.

Table 5
Naming of the models includes the feature selection method (GA) combined with the proposed Ensemble Subspace kNN classifier.

Validation results are presented at k = 10 fold cross validation.

Results of 10-fold cross validation over 30 runs
AUCOrp fprOrp tprACCMean std.Rank
GAMean0.7760.2960.8330.7814
Std.0.0240.0650.0260.0230.035
STATMean0.7690.3030.8280.7745
Std.0.0220.0570.0230.0210.031
GA+STATMean0.8180.2010.8360.8211
Std.0.0210.0270.0210.0200.022
PSA+GA+STATMean0.8120.2080.8320.8152
Std.0.0200.0310.0180.0190.022
PSAMean0.6980.2170.6090.6926
Std.0.0220.0250.0430.0200.028
All featuresMean0.8120.2130.8360.8153
Std.0.0220.0350.0210.0210.025
Boxplots illustrating the performance of the proposed model using various feature sets.

(a) Average AUC values, (b) Average Optimal ROC points (TPRs), (c) Average Optimal ROC points (FPRs), (d) Average Accuracy values. Each box plot contains 30 points, where each point is the average performance evaluation value (i.e. AUC, ORP TPR, ORP FPR, Accuracy) from one 10-fold run using the various feature sets.

The closer the standard deviation value is to 0 the less spread out are the results across the 30 runs, and hence the classifier variability is low (see Table 5). This results in a low variance classifier. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, whereas a high standard deviation indicates that the data points are spread out over a wider range of values. Observing the data shown in Table 5 and Figure 5 for each evaluation measure (i.e. AUC, ORP TPR, ORP FPR, Accuracy (ACC)), the aim is to have a high AUC and low Std.; low ORP FPR and low Std.; high ORP TPR and low Std.; and high Accuracy and low Std. The results show that the classifier achieved the best performance when using the GA+STAT input and the results using k-fold across the 30 runs returned the lowest mean standard deviation and hence the least variability in the results. The results reveal that using the GA+STAT predictors delivers a more reliable classification model with regards to training and validation on new data which will be generated in the future using the prediction model.

Importance of findings

The GA+STAT prediction model achieved the best performance, in that the ORP FPR was the lowest, and the AUC, ORP TPR, and Accuracy (ACC) were the highest compared to the other prediction models. The experimental results are promising and the proposed prediction model is expected to achieve even higher classification accuracy in identifying the presence of prostate cancer in asymptomatic individuals with PSA levels < 20 ng ml-1 based on peripheral blood NK cell phenotypic profiles as more data become available in the future. Table 5 shows the performance of the classifier when using various feature subsets. When using the GA+STAT features, the AUC is higher, and FPR is lower (this is an important distinction) than when using all features or the other alternative feature subsets. The most important aspect is that better performance was achieved using a much smaller set of biomarkers (features), which indicates that we have identified a fingerprint for detecting the presence of prostate cancer in asymptomatic men with PSA levels < 20 ng ml-1 which is indeed significant from a clinical perspective. Feature selection is important, as the fundamental aim of this project is to develop a subset of phenotypic biomarkers that is smaller than the original set of biomarkers (i.e. 32 biomarkers in total) which can confidently identify the presence of prostate cancer. Ultimately, the approach will be embedded into a software application to be used by clinicians, and the aim is to create an interface that requires the clinician to input a few values (features), that is 8 instead of 32. Importantly, identifying a small subset of 8 features which is needed for detecting the presence of prostate cancer, results in the construction of an explainable disease detection and categorization model. Working with a small set of the most promising biomarkers provides a better understanding of the disease and allows cancer immunobiologists and clinicians to focus on performing further laboratory evaluations using the specific subset of biomarkers, in a more cost effective and less time-consuming manner.

Comparing the performance of the proposed ensemble subspace kNN classifier with alternative classifiers

The experiments discussed thus far utilised a machine learning model comprised of an Ensemble of kNN learners (see Section ‘Proposed Ensemble Learning Classifier for the task of Predicting Prostate Cancer’). We then undertook experiments to determine the impact of using the proposed Ensemble method over conventional machine learning classifiers: simple kNN; Support Vector Machine; and Naive Bayes models. The last column of Table 6 shows the difference in the performance of the methods. The proposed method, denoted as EkNN, returned better performance than all other alternative classifiers. EkNN also returned the lowest Standard Deviation values and these are an indicator of a more stable and reliable model since the average values are clustered closely around the mean. SVM-linear returned the highest ORP TPR; however, the higher ORP FPR, higher Std. values, the low AUC, and low Accuracy values suggest that this model is worse than the proposed EkNN. Naive Bayes was the least efficient classifier, and although it returned the lowest ORP FPR, it also returned the lowest ORP TPR, lowest AUC and Accuracy values; and its Std. values were also higher than those of the EkNN model.

Table 6
Comparing the performance of the proposed Ensemble Subspace kNN model against conventional machine learning models when using the GA+STAT feature set.

Results of 10-fold cross validation over 30 runs.

Proposed ensemble subspace kNN (EkNN) model
(No. of learners (NL): 30; Subspace Dimension (SD): 16)
ParametersAUCORP FPRORP TPRACC
NL: 30, SD:16Mean0.8180.2010.8360.821
Std.0.0210.0270.0210.020
Simple kNN model (Distance: Euclidean)
AUCORP FPRORP TPRACCAcc. Diff.
k(EkNN vs. kNN)
2Mean0.7680.2410.7300.751+0.070
Std.0.1190.1600.3930.128−0.108
5Mean0.7780.3000.8330.783+0.038
Std.0.1070.2650.1030.103−0.083
10Mean0.7530.3710.8450.758+0.063
Std.0.1370.3500.1200.131−0.111
Support Vector Machine models
AUCORP FPRORP TPRACCAcc. Diff.
Kernel(EkNN vs. SVM)
LinearMean0.7820.3420.8600.784+0.037
Std.0.1260.3520.1100.120−0.100
GaussianMean.0.8080.3530.8760.799+0.022
Std.0.1120.4160.1070.111−0.091
Naive Bayes model
AUCORP FPRORP TPRACCAcc. Diff.
Predictor distributions(EkNN vs. Naïve Bayes)
NormalMean.0.6950.1320.4550.662+0.159
Std.0.1690.1630.4930.181−0.161

Statistically significant differences in predictive performance when using various feature subsets

The next step in the analysis is to determine whether statistically significant differences exist between the average AUC performance values of the classifier when using the various feature subsets, for which Friedman’s two-way Analysis of Variance (ANOVA) test was used. It was also important to observe whether including the PSA test values significantly strengthens the diagnostic accuracy and capacity. The average k-fold values across the 30 runs for each feature set were computed. A matrix C was derived which holds the results of the classifier when using one of five feature subsets. Friedman’s chi-square statistic compares the mean values of the columns of matrix C. The test returned a statistically significant difference in the AUC predictive performance depending on which type of feature subset was input into the classifier, χ2(4)=106.55, p=3.968E-22. This suggests that the mean AUC ranks of at least one feature subset are significantly different than the others. The mean ranks were as follows: GA = 12.050, STAT = 10.733, GA+STAT = 20.283, PSA = 3.067, PSA+GA+STAT = 18.867. A post hoc test was run alongside the Friedman test to pinpoint which feature subsets differ from each other. Post hoc analysis using a Bonferroni correction was used to reduce the likelihood of erroneously declaring a statistically significant due to multiple comparisons (a Type I error). Table 7 shows the results of multiple comparisons and adjusted p values. There were statistically significant differences between group 8 (GA+STAT vs. GA) and 10 (PSA vs. PSA+GA+STAT) (p=0.001). We can conclude that GA+STAT returned a significantly higher AUC than PSA, and the difference between their mean ranks is diff = 17.217. PSA returned a significantly lower AUC than PSA+GA+STAT (p=0.002), and the difference between their mean ranks is diff=-15.800.

Table 7
Ad hoc test results.
Ad hoc test
Group 1Group 2Ll 95%Diff. betw.meansUl 95%P
1GASTAT−12.6581.31715.2921.000
2GAGA+STAT−22.208−8.2335.7420.525
3GAPSA−4.9928.98322.9580.344
4GAPSA+GA+STAT−20.792−6.8177.1581.000
5STATGA+STAT−23.525−9.5504.4250.245
6STATPSA−6.3087.66721.6420.710
7STATPSA+GA+STAT−22.108−8.1335.8420.555
8GA+STATPSA3.24217.21731.1920.001
9GA+STATPSA+GA+STAT−12.5581.41715.3921.000
10PSAPSA+GA+STAT−29.775−15.800−1.8250.002
  1. The first two columns show the groups that are compared. The third and fifth columns show the lower and upper limits for 95% confidence intervals for the true mean difference. The fourth column shows the difference between the estimated group means. The sixth column contains the p-value for testing a hypothesis that the corresponding mean difference is equal to zero.

Comparing the best prediction models over 30 runs

With regard to constructing a model which has the potential to be used in clinical practice, it is necessary to finalise an initial prediction model, since the last experiment returned 30 different variations of each prediction model when using different training and validation data partitions. Those experiments were crucial in determining whether the prediction models (five models, a different one for each feature subset) suffer from low variance. We then observed the classification performance of each model for each run, to identify the highest performance achieved using a single 10-fold cross validation in any of the runs. This provides a way of comparing the performance of each prediction model as it would be used in the clinical setting. Table 8 provides the results of the highest performing model, and the performance of the models is ranked (with 1 being the best model and 5 the worst model).

Table 8
Results of the best prediction models created during the 30 runs.

Validation results are presented at k = 10 fold cross validation.

Best prediction model results
AUCOrp fprOrp tprAccuracyRank
GA0.8180.1920.8290.8203
GA+STAT0.8530.1570.8620.8551
PSA0.7340.2180.6850.7305
PSA+GA+STAT0.8440.1750.8640.8482
STAT0.8110.2270.850.8174

Predicting low-/intermediate risk cancer vs. high-risk cancer

The continuing, significant clinical challenge resides in distinguishing men with low- or intermediate-risk prostate cancer which is unlikely to progress (for both of which ‘active surveillance’ is the most appropriate approach), from men with intermediate disease which is likely to progress and men with high-risk prostate cancer (both of which require treatment). The diagnosis of men with low-risk or small volume intermediate-risk prostate cancer as having prostate cancer is unhelpful as these men will very rarely require treatment. The inappropriate assignment of men to potentially life-threatening invasive procedures and life-long surveillance for prostate cancer has significant psychological, quality of life, financial, and societal consequences. Furthermore, the definitive diagnosis of prostate cancer currently requires painful invasive biopsies with which is associated a risk of potentially life-threatening urosepsis in 5% of individuals. We, therefore, undertook experiments to train the proposed Ensemble Subspace kNN model to predict the D’Amico Risk Classification for those patients with prostate cancer (see subsection ‘The cancer patients dataset used for building the risk prediction modelin Methods), in terms of Low/Intermediate (L/I) risk and High (H) risk disease using NK cell phenotypic data alone.

The Ensemble model was modified to take as input all 32 features (described in Table 1), and was trained to classify the disease in patients with prostate cancer as being L/I or H risk disease (see Figure 9 in Materials and methods). Hence, given a new patient record, which comprises of 32 inputs, the model predicts whether the patient is D’Amico L/I risk (not clinically significant) or H (clinically significant) risk. The flow charts in Figure 6 illustrate the process to detect the presence and risk of prostate cancer and patient outcomes. Of those 54 patient records, a total of 10 randomly selected records (5 from the L/I group and 5 from the H group) were extracted from the dataset such that they can be used at the testing (mini clinical trial) stage. To ensure thorough experiments, a rigorous methodology was adopted. More specifically, a 10-fold cross validation method was adopted, and the experiments were run in 30 iterations, for which each iteration provided an average validation result across 10 folds. Each iteration consists of 10 different ‘train and validation’ data arrangements (hence 300 tests were carried out using a different mix of train and validation records). The 10 test records were input into each trained model (i.e. iteration) to predict their accuracy, and to evaluate the model when it is trained and validated using different variations of patient data. The model can highly accurately differentiate between L/I risk group and H risk group patients. The k-fold validation results across 30 iterations were AUC: 0.98(±0.03); FPR: 0.03(±0.05), TPR: 0.99(±0.01), Accuracy: 0.99(±0.02); and results using the test set were AUC: 0.98(±0.03); FPR: 0.03(±0.05), TPR: 0.99(±0.01), Accuracy: 0.97(±0.02). Accuracy has been near perfect in all iterations (i.e. using different train and validation data cases in each iteration). Figure 7 illustrates the performance of the model obtained across the 30 runs during the k-fold cross validation and independent testing using the 10 patient samples. The results demonstrate that the proposed model predicts with near-perfect accuracy, the result of the D’Amico Risk Classification (L/I vs High) using NK cell phenotypic data alone, and without requiring the PSA, Gleason, and tumor stage data.

Flow charts illustrating the process to detect the presence and risk of prostate cancer and patient outcomes.

Model 1: Distinguishes between men with benign prostate disease and prostate cancer; Model 2: predicts risk (in terms of clinical significance) in men identified as having prostate cancer in Stage 1. Note that Model 1 can detect prostate cancer in men with PSA < 20 ng ml-1.

Each box plot contains 30 points, where each point is the average performance evaluation value (i.e. AUC, FPR, TPR, Accuracy (ACC)) from one 10-fold run during (a) k-fold validation results, and (b) independent testing results (i.e. using 10 patient records).

The dataset that was utilized to identify the biomarker (that comprised eight features) for detecting the presence of prostate cancer (i.e. benign prostate disease vs prostate cancer) in 71 men, and thus it was large enough to perform the combinatorial feature selection task for finding the best subset of features. The GA that was used for the combinatorial feature selection task is described in Section Computational Methods. Given that detecting the presence of prostate cancer and its risk if present are two different tasks, it is expected that the biomarkers for those tasks will be different since a different target is given to the GA (i.e. the target for the prostate cancer detection model comprises 0 (benign prostate disease) and 1 (prostate cancer) values; the target for the prostate cancer risk prediction model comprises 0 (L/I risk) and 1 (High risk) values. For the L/I vs H risk task, the dataset was small (n = 54 men (L/I = 36, H = 16)), and we could not perform the combinatorial feature selection task with confidence. Hence, it was decided to use the entire feature set for the risk prediction task. The results obtained from the risk prediction model were very promising as shown experimentally, and this provided the confidence to report these preliminary results. The combinatorial feature selection task to identify the best subset of features for the risk prediction task will be performed once a larger dataset is available.

Herein, we demonstrate that all 32 phenotypic features are required to distinguish between low/intermediate risk cancer (L/I) and high risk (H) cancer. However, we expect to be able to identify smaller subset(s) of these features as the datasets increase and the prediction model is retrained on the larger dataset. As indicated above, the generation and delivery of additional datasets is beyond the scope of this paper.

Discussion

The clinical challenge in prostate cancer diagnosis resides in distinguishing men with low- or small volume intermediate-risk prostate cancer which is unlikely to progress (both require 'active surveillance') from men with intermediate disease which is likely to progress or high-risk disease (both of which require treatment). It is essential that men with low-risk prostate abnormalities are not diagnosed as having prostate cancer, as those with low-risk/grade disease do not require active treatment. Furthermore, unnecessarily labeling men as having prostate cancer can assign these men to life-long surveillance and have significant psychological, quality of life, financial and societal consequences. Recent findings from a decade-long study involving 415,000 British men (The Cluster Randomized Trial of PSA Testing for Prostate Cancer (CAP) Randomized Clinical Trial) have not supported single PSA testing for population-based screening and suggest that asymptomatic men should not be routinely tested to avoid unnecessary anxiety and treatment. It is therefore essential that new approaches for enabling more definitive, early detection of prostate cancer are developed. The reliable diagnosis of prostate cancer based on PSA levels alone is not possible and confirmation using invasive biopsies or other approaches such as MRI and biopsy are currently required. Although interest in the potential diagnostic capabilities of MRI scanning is developing, MRI cannot currently be used as a sole diagnostic as a positive MRI can be incorrect in approximately 25% of cases and a negative MRI can be incorrect in approximately 20% of cases Ahmed et al., 2017. Although the findings from the CAP study do not support using the PSA test as an approach for population-based screening, combining PSA measurements with other approaches that either identify individuals for additional testing or strengthen the capacity to diagnose prostate cancer have significant merit, and it is based on this concept that the current study has been performed. The studies presented herein have focused on asymptomatic men with a PSA < 20 ng/ml, as men with a PSA level > 20 ng/ml are more likely to harbour prostate cancer and are thereby less likely to pose a clinical diagnostic quandary. In contrast, men with a PSA < 20 ng/ml pose a major problem because although only 30–40% of these men will have prostate cancer, all currently undergo potentially unnecessary invasive prostate biopsies to determine who has the disease. It is, therefore, this group of men for which the development of new and more accurate approaches for the early detection of cancer is a clear unmet clinical need, and for whom the benefits of such an approach will be most relevant and significant.

Comparing results to the previous study

We have previously shown that incorporating peripheral blood immune phenotyping-based features into a computation-based prediction tool enables the better detection of prostate cancer and, furthermore, strengthens the accuracy of the PSA test in asymptomatic individuals having PSA levels < 20 ng/ml (Cosma et al., 2017). The phenotypic feature set which was shown to be discriminatory between benign disease and prostate cancer comprised CD8+CD45RA-CD27-CD28- (CD8+ Effector Memory cells), CD4+CD45RA-CD27-CD28- (CD4+ Effector Memory Cells), CD4+CD45RA+CD27-CD28-(CD4+ Terminally Differentiated Effector Memory Cells re-expressing CD45RA), CD3-CD19+ (B cells), CD3+CD56+CD8+CD4+ (NKT cells).

Using samples from the same cohort of asymptomatic individuals, herein we have further investigated the phenotype and function of NK cell subsets. Using a combination of statistical and computational feature selection approaches, we have identified a subset of eight phenotypic features CD56dimCD16high, CD56+DNAM-1-, CD56+LAIR-1+, CD56+LAIR-1-, CD56brightCD8+, CD56+NKp30+, CD56+NKp30-, CD56+NKp46+ which distinguish between the presence of benign prostate disease and prostate cancer. These features were used to implement a prediction model. The kNN machine learning approach developed in our previous study (Cosma et al., 2017) has been extended to an Ensemble of kNN learners to improve performance in identifying patterns in even more complex data. As was observed in our previous study, flow cytometry predictors significantly outperform the PSA test. The findings presented herein significantly reinforce our previous finding (Cosma et al., 2017) that complementing the PSA prediction model with a subset of flow cytometry-based phenotypic predictors can significantly increase the accuracy of the initial prostate cancer test and reduce misclassification. The performance of the prediction model which was built using the phenotypic ‘signature’ presented in our previous study -CD8+CD45RA-CD27-CD28-, CD4+CD45RA-CD27-CD28-, CD4+CD45RA+CD27-CD28-, CD3-CD19+, CD3+CD56+CD8+CD4+ (Cosma et al., 2017), is similar to the model which was built using the NK cell-based phenotypic signature presented herein, CD56dimCD16high, CD56+DNAM-1-, CD56+LAIR-1+, CD56+LAIR-1-, CD56brightCD8+, CD56+NKp30+, CD56+NKp30-, CD56+NKp46+. Specifically, the prediction model using the five flow cytometry features identified in Cosma et al., 2017 achieved Accuracy: 83.33% , AUC: 83.40%, ORP TPR: 82.93%, FPR: 16.13%, whereas the prediction model presented herein achieved AUC: 85.3%, ORP FPR: 15.7%, ORP TPR: 86.2%, Accuracy: 85.5%. Across the 30 runs the average performance of the prediction model presented herein is AUC: 81.8%, ORP TPR: 83.6%, FPR: 20.1%, Accuracy: 82.1%.

The difference in the performance of the model presented in the first study (Cosma et al., 2017) and the study described herein is a consequence of different data and prediction models being used in each study. Given that the phenotypic features that were used to create the prediction models were different, the studies resulted in different prediction models. In particular, the model presented previously (Cosma et al., 2017) was based on a kNN classifier, and herein the kNN classifier was extended to construct an Ensemble Subspace kNN method which comprised several kNN classifiers (see Figure 9). The dataset used herein was more complex, and it was therefore necessary to create a more complex classifier. At this point in the studies, it is not possible to determine which set of phenotypic features is better at identifying prostate cancer. However, it is evident that both approaches have significant promise. Since the publication of our previous study (Cosma et al., 2017), the model developed for that study was used to predict the outcomes of a further 20 new patients which were previously unseen by the prediction model. The model correctly identified the presence of prostate cancer in 19 of the 20 patients (data not shown).

Encouragingly, the prediction models generated in the study reported upon herein selected phenotypic features that are associated with the expression of activating receptors NKp30, NKp46, and DNAM-1 by NK cells. Pasero et al., 2015 demonstrated that these activating receptors, in addition to NKG2D, are involved in the recognition of prostate cancer cell lines. Furthermore, they identified that the intensity of NKp30 and NKp46 expression on the surface of NK cells isolated from the peripheral blood of patients with metastatic prostate cancer was predictive of time to hormone (castration) resistance and overall survival. This suggests that our computational analysis is selecting phenotypic features that are of biological/clinical relevance. Thus far, our identification of disease predictive phenotypic immune features has been limited to effector immune populations (T, B, and NK cells). The responsiveness of these cells is known to be influenced by the presence of innate immune cell populations that can be polarized by the tumor toward an immunosuppressive state (Vitale et al., 2014; Anderson et al., 2017). Therefore, future studies will investigate the identification and inclusion of phenotypic features from innate immune subpopulations such as monocytes and neutrophils into prediction models to assess whether their inclusion enhances predictive capability and enables a better assessment of patient prognosis in line with the D’Amico Risk Classification.

The proposed machine learning model was adapted to predict the D’Amico Risk Classification of patients with prostate cancer using NK cell phenotypic data alone. Experiments with data from 54 patients revealed the significant potential of using the proposed machine learning model for determining if men with prostate cancer are in the low-/intermediate- or high-risk groups, without the need for additional clinical data (i.e. PSA, Gleason, clinical stage data). One limitation of the current study is that the small patient numbers required for low- and intermediate-risk patients to be grouped. Future work, for which additional sample collections are required, will train the model to separately predict low-, intermediate- and high-risk cancer. Future work involves collecting more patient samples to conduct further testing of the proposed machine learning models. In terms of future work from a computational perspective, once we have a larger patient dataset we plan to design deep learning models and compare their performance to the conventional machine learning model which was proposed in this paper.

Potential impact

Currently available screening methods and tests for prostate cancer lack accuracy and reliability, the consequence of which is that many men unnecessarily undergo invasive tests such as biopsy and/or are misdiagnosed as having the disease. Furthermore, a biopsy involves removing samples of tissue from the prostate and it is an extremely uncomfortable procedure which also puts men at risk of developing life-threatening infections. As biopsy results are not definitive, there is a significant potential for misdiagnosis and over- and under- treatment. It is therefore essential that new non-invasive approaches such as blood tests that are more accurate than the Prostate Specific Antigen (PSA) test are developed to reduce misdiagnosis and unnecessary procedures. Misdiagnosis unnecessarily subjects many men to lifelong monitoring for prostate cancer which can have undesirable psychological and quality of life side-effects, as well as place a significant financial burden on the NHS and other healthcare systems. This paper proposes a computerised model, which detects the presence of prostate cancer in men by analyzing immune system cells in the blood. The model uses the data from the blood tests and artificial intelligence-based computing (machine learning) to more accurately detect the presence of prostate cancer. A preliminary model has also been presented to detect the clinical risk that any prostate cancer which is present poses. The tool has two elements, the first detects whether a man has prostate cancer. If prostate cancer is detected, the second element will detect the clinical risk of the disease (low, intermediate, high) and thereby enable the clinician to decide whether the patient requires no further investigation/treatment (‘watch and wait’) or whether further investigation and treatment are required.

To our knowledge, these are the first studies to employ computational modeling of peripheral blood NK cell phenotyping data for the early detection of cancer and its clinical significance. They also illustrate the potential for this approach to decipher clinically relevant immune features that can distinguish between benign prostate disease and prostate cancer in asymptomatic individuals for whom the management and treatment strategy is unclear. Of translational importance is that our prediction models are interpretable, can be explained to patients and clinicians and can be continually refined and improved as data are collected.

The novelty of this approach is that it interrogates the immunological response to the tumour, not the tumour itself and that it requires a simple blood test (liquid biopsy). Based on current practice, we expect that this approach could avoid up to 70% of prostate biopsies, thereby sparing men with benign prostate disease or low-risk prostate cancer from unnecessary invasive procedures with which are associated significant side-effects. Furthermore, more accurate diagnosis would reduce the demands of healthcare provision and resources associated with treatment and continual surveillance, thereby reducing costs and improving healthcare. We envisage that, in the future, men with a mildly elevated PSA will also undergo an immune status test and those with a suspicion for significant prostate cancer will then undergo an MRI. Although the current study focuses on prostate cancer, its fundamental principles and approaches are highly likely to be applicable across many, if not all, cancer entities.

Materials and methods

Key resources table
Reagent type
(species) or
resource
DesignationSource or
reference
Identifiers
Additional
information
Biological SampleHyclone fetal bovine serum (FBS)GE Healthcare Life SciencesSV30180.03
AntibodyMonoclonal mouse IgG1 kappa anti human DNAM-1 (CD226) (clone 11A8); FITCBioLegend3383045 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKG2D (CD314) (clone 1D11); PEeBioscience12-5878-425 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD56 (clone N901); ECD (PE-Texas Red)Beckman CoulterA829432.5 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD16 (clone 3G8); PerCP-Cy5.5BioLegend3020285 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp46 (CD335) (clone 9E2); PE-Cy7BioLegend3319165 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp30 (CD337) (clone P30-15); Alexa Fluor 647BioLegend3252125 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD3 (clone UCHT1); Alexa Fluor 700BioLegend3004242 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD19 (clone HIB19); Alexa Fluor 700BioLegend3022261 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD8 (clone SK1); APC-Cy7BioLegend3447142.5 μl per tube / 106cells
AntibodyMonoclonal mouse IgG2b anti human CD85j (ILT2) (clone GHI/75); FITCMiltenyi Biotec130-098-43710 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human LAIR-1 (CD305) (clone DX26); PEBD Biosciences55081120 μl per tube / 106cells
AntibodyMonoclonal mouse IgG2b anti human NKG2A (CD159a) (clone Z199); PE-Cy7(PC7)Beckman CoulterB1024620 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp44 (CD336) (clone P44-8); Alexa Fluor 647BioLegend3251125 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human 2B4 (CD244.2) (clone C1.7); FITCBioLegend3295065 μl per tube / 106cells
Chemical CompoundLIVE/DEAD Fixable Violet Dead StainThermo Fisher ScientificL349551 μl in 1 μl
Chemical CompoundNovagen Benzonase NucleaseMerck Millipore70664
Chemical CompoundCTL Wash SolutionCellular Technology LimitedCTLW-010
Chemical CompoundTrypan Blue viability stainSanta Cruzsc-216028
Chemical CompoundDimethyl sulfoxide (DMSO)Santa Cruzsc-202581
Chemical CompoundCalbiochem bovine serum albumin (BSA)Merck Millipore2905-OP
Chemical CompoundSigma-Aldrich sodium azideMerck MilliporeS8032
Chemical CompoundSigma-Aldrich lithium heparinMerck MilliporeH0878
Chemical CompoundFicoll-PaqueGE Healthcare Life Sciences17-1440-03
Chemical CompoundIsoton II isotonic buffered saline solutionBeckman Coulter844 80 11
Chemical CompoundRPMI mediumLonza12-167Q
Chemical CompoundPhosphate Buffered Saline (PBS)Lonza17-517Q
OtherLeucosep tubesGreiner Bio-One International227290
SoftwareKaluza v1.3Beckman Coulter

Data collection

Request a detailed protocol

Peripheral blood samples were obtained from individuals suspected of having prostate cancer that attended the Urology Clinic at Leicester General Hospital (Leicester, UK) between 24th October 2012 and 15th August 2014. Only patients who had provided informed consent and met the criteria of being biopsy naive, a benign feeling Digital Rectal Examination (DRE) with a PSA level of < 20 ng ml-1 and agreeing to undergo a simultaneous 12 core TRUS biopsy and a 36 core transperineal template prostate biopsy (TPTPB) were included in the study. Further details regarding the TPTPB technique are provided in Nafie et al., 2014b. A total of 71 males (30 patients diagnosed with benign disease and 41 patients diagnosed with cancer, as confirmed by pathological examination of TPTPB biopsies) met the criteria. Of the 30 patients diagnosed with benign disease; 9 patients were diagnosed with High Grade Prostatic Intraepithelial Neoplasia (PIN), 10 patients were diagnosed with Atypia and 2 patients were diagnosed with Atypical Small Acinar Proliferation. The remainder were diagnosed with benign disease. Of the men diagnosed with prostate cancer, 16 had Gleason 6 disease, 23 had Gleason 7 disease and 2 had Gleason 9 disease on biopsy-based evidence. The clinical features of individuals with benign disease and patients with prostate cancer are provided in Table 9.

Table 9
Patient clinical features.
Patient groupGleason scoreNumber of patientsAge range (years)PSA range (ng/ml)
BenignBenign964-715.3–15
BenignHGPIN954–705.1–12
BenignAtypia1050–764.7–19
BenignASAP259–605.3–7.8
CancerGleason 61655–804.7–11
CancerGleason 72353–774.7–19
CancerGleason 9265–756.3–18

The cancer patients dataset used for building the risk prediction model

Request a detailed protocol

Data derived from the 41 individuals with prostate cancer were extracted from the dataset shown in Table 9. All 41 patients had PSA < 20 ng ml-1. However, three of the 41 patients who had a High D’Amico risk were removed because their clinical profiles were very different from those of other high risk patients. They were patients with either a Gleason score 3+3 or had a benign biopsy. In the future, we aim to collect more data from such infrequent patient groups to train the algorithms on patients with such clinical profiles. The remaining 38 patients had PSA levels < 20 ng ml-1 and belonged to the D’Amico L/I risk group.

Data were collected from an additional 16 patients with prostate cancer who were diagnosed as having a D’Amico High risk profile (see Table 10). Thus, the new cancer patient dataset comprised 54 patients with prostate cancer, of which 38 patients belonged to the D’Amico L/I risk group and all had PSA<20 ng ml-1, and 16 patients belong to the D’Amico H risk group and have PSA 4.3 ng ml-1≤ PSA ≤ 2617 ng ml-1. The 16 patients were diagnosed with Gleason scores of: 4+4 = 8 (n = 2), 5+4 = 9 (n = 2), and 4+5 = 9 (n = 11), and 1 patient was diagnosed with small cell cancer. The combined dataset (i.e. 38+16 = 54) comprised 15 patients with Gleason 6 (3+3), 18 patients with Gleason 7 (3+4), 5 patients with Gleason 7 (4+3), 2 patients with Gleason 8 (4+4), 11 patients with Gleason 9 (4+5), 2 patients with Gleason 9 (5+4), and 1 patient with small cell cancer.

Table 10
Dataset used for differentiating between patients with L/I and H cancer.
Patient groupCount%
L/I3870.37
H1629.63

Since 11 of those 16 patients had a PSA > 20 ng ml-1, their data could only be utilised for building the prostate cancer risk prediction model, as the detection model focuses on detecting prostate cancer in asymptomatic men with PSA< 20 ng ml-1.

Flow cytometric analysis

Request a detailed protocol

Peripheral blood (60 ml) was collected from all patients using standard clinical procedures. Aliquots (30 ml) were transferred into two sterile 50 ml polypropylene (Falcon) tubes containing 300 μl of sterilized Sigma Aldrich Lithium Heparin (1000 U/ml, Merck Millipore). Anti-coagulated samples were transferred to the John van Geest Cancer Research Centre at Nottingham Trent University (Nottingham, UK) and processed immediately upon receipt (always within 3 hr of collection). Peripheral blood (60 ml) was mixed with Phosphate Buffered Saline (PBS, 30 ml, Lonza) and layered over Ficoll-Paque (GE Healthcare Life Sciences) in Leucosep tubes (20 ml blood per tube) and then centrifuged at 800 g for 20 min. The peripheral blood mononuclear cell (PBMC) fraction was harvested and washed twice with PBS before being re-suspended in Hyclone fetal bovine serum (FBS, GE Healthcare Life Sciences). Viable cells were counted using trypan blue (0.1 % v/v trypan blue, Santa Cruz) and a haemocytometer. Cells were frozen in 90% v/v FBS, 10% v/v DMSO (Santa Cruz) in aliquots of 10 × 106 PBMC/vial and stored in liquid nitrogen until phenotypic analysis. At the time of analysis, one vial from each patient was thawed by mixing with 10 ml ‘thaw’ solution (90% v/v RPMI (Lonza), 10% v/v CTL wash solution (Cellular Technology Limited) and 10 μl of Novagen Benzonase (Merck Millipore) at room temperature.

PBMCs were centrifuged at 400 g for 5 min followed by resuspension in 1 ml of RPMI (supplemented with 10% v/v FBS, 1% v/v L-glutamine (Lonza)). Cells were rested for 1 hr at 37, after which viable cells were counted using trypan blue dye (Santa Cruz) exclusion. For each monoclonal antibody (mAb) panel shown in Table 11, 1 × 106 cells were washed and incubated in 100 μl of Wash Buffer (PBS +2% w/v Calbiochem bovine serum albumin (BSA, Merck Millipore) +0.02% w/v sodium azide (Sigma)) containing the relevant mAb cocktail for 15 min, after which cells were washed with 1 ml PBS and then incubated in 1 ml LIVE/DEAD Fixable Violet dead stain (Thermo Fisher Scientific) for 30 min. All incubations were performed at 4 protected from light. The cells were washed with PBS and then re-suspended in Beckman Coulter Isoton isotonic buffered saline solution.

Table 11
Antibody panels for measuring the phenotype of Natural Killer cells.
AntibodyFluorochromeClone no.Supplier
Panel 1
DNAM-1 (CD226)FITC11A8BioLegend
NKG2D (CD314)PE1D11eBioscience
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCP-Cy5.53G8BioLegend
NKp46 (CD335)PE-Cy79E2BioLegend
NKp30 (CD337)Alexa Fluor 647P30-15BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
Live/DeadDye (violet)Thermo Fisher Scientific
Panel 2
CD85j (ILT2)FITCGHI/75Miltenyi Biotec
LAIR-1 (CD305)PEDX26BD Biosciences
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCP-Cy5.53G8BioLegend
NKG2A (CD159a)PC7 (PE-Cy7)Z199Beckman Coulter
NKp44 (CD336)Alexa Fluor 647P44-8BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
LIVE/DEADDye (violet)Thermo Fisher Scientific
Panel 3
2B4 (CD244.2)FITCC1.7BioLegend
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCp-Cy5.53G8BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
LIVE/DEADDye (violet)Thermo Fisher Scientific

Data (on viable cells) were acquired within 1 hr using a 10-color/3-laser Beckman Coulter Gallios flow cytometer and analyzed using Beckman Coulter Kaluza v1.3 data acquisition and analysis software. Controls used a Fluorescence Minus One (FMO) approach. A typical gating strategy for the analyses is presented in Figure 8.

Representative gating strategy for analyzing the expression of activating and inhibitory receptors on peripheral blood natural killer (NK) cells.

Using density plots, the NK cell phenotypic profiles were determined by first gating on ‘live cells’ in the forward scatter (FSc) linear vs side scatter (SSc) linear density plot and then gating on single cells (determined by FSc Linear vs FS time of flight). The expression of activating and inhibitory receptors was determined by gating on CD3-CD19-CD56+ cells using fluorescence minus one (FMO) controls. The expression of each NK cell receptor was measured using the ‘Logical’ setting.

Computational methods

Request a detailed protocol

Initially, the GA by Ludwig and Nunes, 2010 was adopted to identify the best subset of features (i.e. predictors), and thereafter a prediction model was constructed using the Ensemble classifier. This section also explains the metrics adopted for evaluating the performance of the prediction model.

GA for selecting the best subset of features

Request a detailed protocol

The GA is a metaheuristic, commonly used to generate solutions to optimization and search problems. Given the large number of combinations, the process of selecting the best subset of flow cytometry features for creating the prediction algorithm is performed using a GA. The GA adopted in the experiments was developed by Ludwig and Nunes, 2010. The particular GA performs combinatorial optimization to identify a subset of features that comprises the optimum feature set, in which the order of features has no relation with their importance. The algorithm works by maximising the mutual information between the target y (where y can have a value 1 for cancer or 0 for benign) and the input features (i.e. these are the 32 features listed in Table 1). Mutual information is the measure of the mutual dependence between the two variables, i.e. an input feature and the target. Adopting a GA eliminates the computational effort which is necessary to evaluate all the possible combinations of features. The fitness function of the GA (Ludwig and Nunes, 2010) is based on the principle of max-relevance and min-redundancy (mRMR), for which the objective is that the outputs of the selected features present discriminant power, thereby avoiding redundancy. The principle of max-relevance and min-redundancy corresponds to searching the set of feature indexes that are mutually exclusive and correlated to the target output. Let m×n be a feature-by-patient matrix, X=[xij] with m features and n patients. Thus, the matrix element xij is the flow cytometry value i of patient j. Let y be a vector of size 1×n which holds the diagnosis of each patient (1 for cancer and 0 for benign). Hence, each patient x is mapped to a diagnosis y. The GA takes three inputs: 1) the feature-by-patient matrix X; 2) the vector y which holds the corresponding labels for each patient record; and 3) the desired number of features, λ. The GA returns the IDs of the best subset of features, where the subset has size λ. GAs stochastically select multiple features from the current population and thus each run of the GA can return different results. Consequently, we proposed an approach to identify the best subset of features by running the algorithm several times and then obtaining the frequency of the subsets.

Proposed ensemble learning classifier for identifying the presence of prostate cancer

Request a detailed protocol

This section discusses the machine learning classifier which was developed for the task of identifying the presence of benign prostate disease or prostate cancer using the identified subset of phenotypic features. The challenging task is that a suitable and reliable classifier must be developed using only 72 patient records. A limitation is that classifiers that have been trained on small sample size data are likely to be unstable because small changes in the training set cause large changes in the classifier. It was for this reason that the Ensemble machine learning classifier was preferred as an approach for developing a more stable and reliable classifier. Ensemble classifiers achieve stability and reliability by constructing many ‘weak’ classifiers instead of a single classifier and then combine the weak classifiers (i.e. weak learners) to create a more powerful decision rule than that constructed when using a single classifier. In clinical applications, it is important to construct prediction models which have a low bias, meaning that the classifier suggests fewer assumptions about the form of the target function. Because Ensemble learning makes fewer assumptions about the form of the target function, it was considered to be a suitable classifier for the task. Several techniques for combining the classifiers of an Ensemble model exist and these include Boosting, Bagging, and Random Subspace Dimension.

In the proposed method, the Random Subspace Dimension approach was utilised as a strategy for combining the kNN classifiers, to create the Ensemble of kNN classifiers. In machine learning, the Random Subspace Method (Ho, 1998), also called attribute bagging (Bryll et al., 2003) or feature bagging, is an Ensemble learning method which attempts to reduce the correlation between estimators in an Ensemble by training them on random samples of features instead of the entire feature set. In the Random Subspace method, classifiers are constructed in random subspaces of the data feature space. These classifiers were combined by simple majority voting in the final decision rule, and we used the k Nearest Neighbor method (see Figure 9). In particular, we used the Random Subspace ensemble-aggregation method coupled with k Nearest Neighbours weak learners to produce an Ensemble of classifiers, and this resulted to a better classification rule. Thus, the Random Space modifies the training data set, builds classifiers on these modified training sets, and then combines them into a final decision rule by simple or weighted majority voting.

Proposed Ensemble Subspace kNN model.

Ensembles combine predictions from different models to generate a final prediction. Because Ensemble approaches combine baseline predictions, they perform at least as well as the best baseline model.

Figure 9 provides an overview of the architecture of the proposed kNN Ensemble learning, and the description that follows explains the architecture in more detail. Let m be the number of dimensions (variables) to sample in each learner minus 1. Let d be the number of dimensions in the data, which is the number of predictors in the data matrix X. Let n be the number of learners in the ensemble. The basic random subspace algorithm performs the following steps using the above-mentioned parameters:

  1. Choose without replacement a random set of m predictors from the d possible values.

  2. Train a weak learner using just the m chosen predictors.

  3. Repeat steps 1 and 2 until there are n weak learners.

  4. Predict by taking an average of the score prediction of the weak learners and classify the category with the highest average score.

Performance evaluation measures

Request a detailed protocol

A variety of relevant evaluation metrics were adopted for the task of evaluating the performance of the machine learning prostate cancer presence and risk prediction models.

Prostate cancer presence prediction models: Let |TP| be the total number of patients with cancer who were correctly classified as having cancer; |TN| be total the number of individuals with benign disease who were correctly classified as having benign disease; |FP| be the total number of individuals with benign disease who were incorrectly classified as having cancer; |FN| be the total number of patients with cancer who were incorrectly classified as having benign disease; |P| be the total number of patients with cancer that exist in the dataset, where |P|=|TP|+|FN|; and |N| be the total number of individuals with benign disease that exist in the dataset, where |N|=|FP|+|TN|. The following commonly used evaluation measures can be defined.

(2) Accuracy=|TP|+|TN||TP|+|FP|+|FN|+|TN|,[0,1].
(3) TPR=|TP||TP|+|FN|,[0,1].
(4) TNR=|TN||TN|+|FP|,[0,1].
(5) FNR=|FN||TP|+|FN|=1-Sensitivity,[0,1].
(6) FPR=|FP||FP|+|TN|=1-Specificity,[0,1].

The closer the values of Accuracy, True Positive Rate (i.e. TPR, Sensitivity) and True Negative Rate (i.e. TNR, Specificity) are to 1.0, then the better the classification performance of a system.

The Receiver Operating Characteristic (ROC) is an effective measure for evaluating the quality of a prediction model’s performance. The ROC curve has an optimal ROC point which comprises two values: the False Positive Rate (FPR) and the True Positive Rate (TPR) values. The optimal ROC point is computed by function (Equation 7) for finding the slope, S.

(7) S=Cost(P|N)-Cost(N|N)Cost(N|P)-Cost(P|P)×NP,

where Cost(N|P) is the cost of misclassifying a positive class (i.e. cancer) as a negative class (i.e. benign); Cost(P|N) is the cost of misclassifying a negative class, as a positive class; P, and N, are the total instance counts in the cancer and benign class, respectively. The optimal ROC point is identified by moving the straight line with slope S from the upper left corner of the ROC plot (FPR=0, TPR=1) down and to the right, until it intersects the ROC curve.

The Area Under the ROC Curve (AUC) is another important performance evaluation metric which reflects the capacity of a model capacity to discriminate between the data obtained from individuals with benign disease and patients with cancer. The larger the AUC, the better the overall capacity of the classification system to correctly identify benign disease and cancer.

Prostate cancer risk prediction models: When applying the above-mentioned measures to evaluate the performance of the risk prediction models, the Positive class, P, was changed to be the High-risk group and the Negative class, N, was changed to be the L/I group.

References

  1. 1
    Prostate Cancer detection rate and the importance of premalignant lesion in rebiopsy
    1. D Aganovic
    2. A Prcic
    3. B Kulovac
    4. O Hadziosmanovic
    (2011)
    Medicinski Arhiv 65:109–112.
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    The random subspace method for constructing decision forests
    1. TK Ho
    (1998)
    IEEE Transactions on Pattern Analysis and Machine Intelligence 20:832–844.
    https://doi.org/10.1109/34.709601
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21

Decision letter

  1. Wilbert Zwart
    Reviewing Editor; Netherlands Cancer Institute, Netherlands
  2. Eduardo Franco
    Senior Editor; McGill University, Canada
  3. Yongsoo Kim
    Reviewer; Amsterdam University Medical Center (UMC)
  4. Hongming Xu
    Reviewer; Cleveland Clinic, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

As Prostate-Specific Antigen (PSA) testing in prostate cancer diagnostics suffers from high false-positive rates, a pressing unmet medical need presents to develop a sensitive and specific biomarker for prostate cancer diagnosis. The current work presents a flow-cytometry based approach to detect prostate cancer that may prove to be of high clinical value.

Decision letter after peer review:

Thank you for sending your article entitled "Identifying prostate cancer using Machine Learning of peripheral blood natural killer cell subset phenotyping data" for peer review at eLife. Your article is being evaluated by two peer reviewers, and the evaluation is being overseen by Reviewing Editor Wilbert Zwart and Eduardo Franco as the Senior Editor.

As you can appreciate from the detailed reviewer comments below, several important concerns were raised. The reviewers' unedited critiques are copied below.

Reviewer #1:

The authors present a machine-learning-based classification of 1) benign/prostate cancer patients and 2) high/low-risk patients. The authors used multiple machine learning algorithms (genetic algorithm and ensemble learning) and indicated their methods perform better than PSA-based classification.

Overall, though the key elements of the manuscript are machine learning and statistics, the description of both data and algorithms are mostly insufficient and poorly written. It is unclear why a particular algorithm is chosen, and how they are implemented. In particular, the genetic algorithm is typically an optimization algorithm, for which objective function needs to be defined. The genetic algorithm can be used for other problems (e.g., clustering) depends on how the objective function is designed. However, the authors just mention genetic algorithm as if this is a classification method, without defining an objective function. Also, the Ensemble learning algorithm takes multiple simple classification methods, but the choice of the simple classifier can vary and thus needs to be specified. In general, it is not clear if these complicated algorithms are really necessary, as the authors do not show the performance from the simpler classifier (i.e. weak classifier). Also, the text requires extensive re-writing. The text is jargon-rich (especially those related to machine learning), and often repetitive.

Specific comments:

1) Comparisons of the features between benign / prostate cancer patients are presented in figures and tables, but they are difficult to read. For instance, Figure 2 has two panels separated by the category (benign and cancer), but comparison per feature can be done if the authors simply put them side-by-side per feature in one boxplot. Also, per feature, a p-value from a simple t-test can help readers to understand if the features are different or not. Also, the authors should match the order of row/columns in Figure 4A and B.

2) It is unclear why the Ensemble learning is used. The authors should show if the simple classifier is sufficient to predict the outcome and then show if the performance improved by combining the simple classifiers.

3) It is not at all clear what genetic algorithm does. It completely depends on how objective function is chosen, and it is not defined in the text. Also, it is again unclear how the Genetic algorithm can perform feature selection.

4) The authors removed 3 patients with high-risk category (D'Amico High risk), but I do not understand why this is justifiable.

5) The authors discuss yet another Ensemble approach, namely the Random subspace dimension approach. However, the analysis is inconclusive and thus not clear why the authors brought this up.

6) It is not helping at all to mention the outcome without any data shown.

Reviewer #2:

This study extends their own research (Cosma et al., 2017) and tries to predict prostate cancer based on flow cytometric profiling of blood immune cell subsets. The computational algorithm is very straight forward. They first generate 32 features using flow cytometry. Then the genetic algorithm and statistical test is used for feature selection. The ensemble learning classifier is finally used with 10-fold cross validation for evaluation. Compared with their previous study, the technical difference is that they replace KNN classifier (used in the previous study) with the ensemble random subspace method used in this study. It is claimed that more patients are included in this study, and that peripheral blood NK cell phenotyping data is first-time used with computational modeling. As a computational scientist, I personally think that this study is limited in technical contribution. But I cannot fairly judge the contribution in medical domain. Overall the paper is well written and easily understood.

My major concerns are listed below:

1) Looks like that the authors first run the genetic algorithm and statistical test to select subsets of features using the whole patient data. They then run 10 fold-cross validation with selected features. Considering that the feature selection is performed on the whole data, the prediction performance might be boosted. The typical procedure is that feature selection is performed on training set during each folder of cross-validation. The selected features are then applied on the corresponding testing set. Under the current evaluation process, the authors are suggested to include more patients for independent testing. The newly testing patients should not be used for feature selection.

2) It is not justified or explained why all features are used to predict L/I risk caner vs H risk cancer on 54 patients, without using feature selection?

3) It is suggested to provide quantitative comparisons with their own study (Cosma et al., 2017) during the experiments. Thus it would be more technical convincing that the new method is better than the previous one.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your revised article "Identifying prostate cancer and its clinical risk using Machine Learning of blood NK cell subset phenotyping data" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Eduardo Franco as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Yongsoo Kim (Reviewer #1); Hongming Xu (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare another revised submission. We are happy to see the effort you and your co-authors made addressing the first round of reviews. However, our conclusion sis that additional work is needed.

As the editors have judged that your manuscript is of interest, but as described below that additional analyses are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)

Overall, the authors revised the manuscript and there is a significant improvement in clarity of technical discussion. In particular, the benefit of the ensemble k-nearest neighbor classifier is well-described in the new version of the manuscript, by comparing it with the weak classifier. However, both reviewers and the reviewing editor felt that a number of issues are not sufficiently addressed at this point, which would be required for the paper to be considered eligible for publication.

Essential revisions:

1).The authors mention the co-linearity issue and found a substantial proportion of features are correlated to each other (376 out of 496 feature pairs; also indicated in Figure 3) in Results paragraph two. The authors claim that the final predictor should not combine features with a high correlation. Are the features sets suggested by the authors (8 features from STAT+GA) indeed not correlated? In Figure 3, we can find a high positive or negative correlation among 15th-18th features. And in statistical analysis, the 14th-18th features are statistically significant (in Table 2), which might correspond to the highly correlated features. If that is the case, it sounds like the authors did something they should not do, according to their own opinion. Please address this.

2) The authors chose λ=4 after examining the stability of GA. I do agree that stability is an important aspect. However, I think it is also important to know how good the performance of the final solution found by GA is. In that regard, it is worth reporting the final mutual information next to the Relative Frequency in Table 3.

3) The authors used the Random subspace dimension approach, which is the best performing. The authors did not present any data to support the claim. This should be provided.

4) The authors claimed that they demonstrated all of 32 features are required. However, the performance of the algorithm needs to be assessed with subsets of the features to make the claim.

5) In the revised version, overall, the authors have tried to address and answer the concerns I listed before. Because of difficulty and time-requirement for collecting more data, only cross-validation was performed in this study. External testing for the presented method is hard for current situation now. So it is suggested to mention this limitation in the paper's discussion part.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your revised article "Identifying prostate cancer and its clinical risk using Machine Learning of blood NK cell subset phenotyping data" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Yongsoo Kim (Reviewer #1); Hongming Xu (Reviewer #2).

We are happy to see the effort you made at amending the paper to accommodate the concerns and suggestions from the reviewers. Once again, we are unable to accept it in its present form for publication. However, we are willing to consider a new revised version if you can address the additional concerns and suggestions below. The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The authors have addressed the vast majority of issues raised, and all reviewers are in general satisfied with the revised work. One final question that remained unanswered in the previous round would still require attention at this stage.

Major comments

1) My main concern is still the Revision#4 question, which was asked in the second round review. In the experiments of predicting low/intermediate risk cancer vs high risk cancer, the authors used all 32 features to train the ensemble model by arguing due to the small dataset. But usually if the dataset is small, it is better to use a smaller number of features. In the experiments of benign disease and prostate cancer distinction, the author claimed that 8 features selected by GA+STAT provided the best performance. So it is still suggested to use those 8 features for L/I vs high risk distinction. If using those 8 features cannot provide better performance, the authors can provide some insight discussions about this phenomenon. In addition, for the distinction between L/I and H cancer, 16 more high risk cancer patients were included. Are there any reasons why these 16 patients were not used for cancer and benign distinction?

https://doi.org/10.7554/eLife.50936.sa1

Author response

As you can appreciate from the detailed reviewer comments below, several important concerns were raised. The reviewers' unedited critiques are copied below.

Reviewer #1:

The authors present a machine-learning-based classification of 1) benign/prostate cancer patients and 2) high/low-risk patients. The authors used multiple machine learning algorithms (genetic algorithm and ensemble learning) and indicated their methods perform better than PSA-based classification.

Genetic Algorithms (GA) are not machine learning algorithms, they are computational intelligence algorithms (optimisation algorithms), and the output of the GA (selected subset of features) was input into a machine learning algorithm, the Ensemble classifier (these are two separate processes). Please refer to Materials and methods section which explains the algorithms.

The first sentence in Experiment Methodology has been revised in order to point the reader to the section which describes the Genetic Algorithm.

Overall, though the key elements of the manuscript are machine learning and statistics, the description of both data and algorithms are mostly insufficient and poorly written. It is unclear why a particular algorithm is chosen, and how they are implemented. In particular, the genetic algorithm is typically an optimization algorithm, for which objective function needs to be defined. The genetic algorithm can be used for other problems (e.g., clustering) depends on how the objective function is designed. However, the authors just mention genetic algorithm as if this is a classification method, without defining an objective function.

The Genetic Algorithm (GA) performs feature selection and not classification. The section entitled “Genetic Algorithm for Selecting the Best Subset of Features” explains that the GA was implemented for feature selection nowhere in the paper is it stated that the GA was implemented for classification. The features selected by the GA are used to build an Ensemble classifier, as described in the Section “Proposed Ensemble Learning Classifier for the task of Predicting Prostate Cancer”.

Please be more specific with respect to those elements of the manuscript that are considered to be “insufficient” and “poorly written” so that we can specifically address this issue.

Also, the Ensemble learning algorithm takes multiple simple classification methods, but the choice of the simple classifier can vary and thus needs to be specified. In general, it is not clear if these complicated algorithms are really necessary, as the authors do not show the performance from the simpler classifier (i.e. weak classifier).

The algorithms used in the study are necessary and are, in our opinion and practice, relatively uncomplicated. The GA is necessary for finding the best subset of biomarkers because, as described in the “statistical analysis” section, it is not possible to find the best subset of biomarkers which, together as a combination (“feature set”), would make a good predictor of prostate cancer presence using conventional statistical approaches (see Table 6). The best classifier for the data was an Ensemble classifier which was built using the subset of features which were strategically selected using a novel methodology, as described in the Results section “Identifying Prostate Cancer using PSA and Immunophenotyping Data” (Table 5 for results).

Please refer to our response to your Specific Comment (2) where we carried out new experiments to demonstrate that the simple kNN model is not performing as well as the proposed Ensemble model. Please note that we have also carried out experiments with Naïve Bayes, Support Vector Machines, and many other machine learning algorithms (with various settings), but the Ensemble approach performed the best. The paper is already lengthy, and we would like to focus it on discussing the results using the best method. However, to satisfy reviewer 1’s comment we included a section which compares the simple kNN against the more complex approach which is an Ensemble of kNN learners (please refer to our answer to your specific comment (2)). We do not feel that it is relevant to include detailed comparisons of various machine learning approaches (which we can provide if required) as we do not consider that this will add to the findings of the study, but are likely to detract from its key findings and impact.

Also, the text requires extensive re-writing. The text is jargon-rich (especially those related to machine learning), and often repetitive.

We are sorry to hear this. These comments contrast with those of reviewer 2 who stated that “overall the paper is well written and easily understood”. However, we have taken this comment on board and sought input and feedback on presentation and understanding from colleagues who are either experts on machine learning or immunology.

We appreciate the reviewer’s comment and have added additional text to improve understanding for the readership which is not familiar with machine learning. For example, at the end of subsection “A comparison of the best prediction models over 30 runs” we added extra text to explain in lay terms “Discussion on importance of findings.”

Specific comments:

1) Comparisons of the features between benign / prostate cancer patients are presented in figures and tables, but they are difficult to read. For instance, Figure 2 has two panels separated by the category (benign and cancer), but comparison per feature can be done if the authors simply put them side-by-side per feature in one boxplot.

Figure 2 has now been revised as requested.

Also, per feature, a p-value from a simple t-test can help readers to understand if the features are different or not.

The data were not normally distributed, as a consequence of which the parametric t-test cannot be used. We therefore used the Kruskal-WallisHtest(also called the "one-way ANOVA on ranks", a rank-based non-parametrictest)which determines if there are any statistically significant differences between two or more groups of an independent variable on a continuous or ordinal dependent variable. We used this to determine if there were any statistically significant differences between the measured NK cell parameters in individuals with benign disease and patients with prostate cancer. Table 4 shows the results of the Kruskal-Wallis test. As indicated in the manuscript, such statistical analysis only identified 4 features which could be potential phenotypic “fingerprints” for distinguishing between the different clinical settings. However, when these 4 features where used to build a predictive model, performance was not satisfactory (see Table 6).

We have extended the sentence on Kruskall-Wallis in the Statistical Analysis section to include justification for using the Kruskal-Wallis test over the parametric alternatives:

“Kruskal-Wallis (also called the "one-way ANOVA on ranks", a rank-based non-parametric test) tests were utilised to check for statistically significant differences between the mean ranks of the NK cell phenotypic features in individuals with benign disease and patients with prostate cancer. As the data were not normally distributed, the non-parametric Kruskal-Wallis test was used as a suitable alternative to its parametric equivalent of one-way analysis of variance (ANOVA).”

Also, the authors should match the order of row/columns in Figure 4A and B.

The Spy plot was removed and Figure 5 was increased in size to make it more readable. The Spy was not needed as the correlated features are now clearly shown in Figure 4 (previously Figure 5.)

2) It is unclear why the Ensemble learning is used. The authors should show if the simple classifier is sufficient to predict the outcome and then show if the performance improved by combining the simple classifiers.

In the Section “Identifying Prostate Cancer using PSA and Immunophenotyping Data” the following paragraph and Table were added. These experiments compare the performance of the simple kNN vs the proposed Ensemble approach, and we hope this is enough evidence to justify the Ensemble approach:

“Finally, the experiments discussed in this Section thus far utilised a machine learning model comprised of an Ensemble of kNN learners (see Section “Proposed Ensemble Learning Classifier for the task of Predicting Prostate Cancer”). Before ending the discussions in this Section, the results of experiments carried out to determine the impact of using the proposed Ensemble method over the simple kNN classifier are summarised. Table 7 shows the performance of a simple kNN tuned when setting the value of k nearest neighbours to 2, 5, and 10 and the distance metric the Euclidean. The last column of Table 7 shows the difference in performance of the two methods. The proposed method, denoted as EkNN, returned better performance than all other kNN alternatives, and hence resulted in higher Mean Accuracy values (+) and lower Standard Deviation values (Std.). Lower Standard Deviation values are an indicator of a more stable and reliable model, since the average values are clustered closely around the mean.”

3) It is not at all clear what genetic algorithm does. It completely depends on how objective function is chosen, and it is not defined in the text. Also, it is again unclear how the Genetic algorithm can perform feature selection.

We respectfully refer the reviewer to section Results – Experiment Methodology, which explains how the GA was used. An extensive explanation is also presented in Materials and methods – Genetic Algorithm for Selecting the Best Subset of Features which also explains the objective function used The fitness function of the Genetic Algorithm is based on the principle of max-relevance and min-redundancy (this is the well-known mRMR), for which the objective is that the outputs of the selected features present discriminant power, thereby avoiding redundancy. The principle of max-relevance and min-redundancy corresponds to searching the set of feature indexes that are mutually exclusive and totally correlated to the target output”. More details about the GA and the objective function can be found in Ludwig and Nunes, 2010 (to which reference is made in the manuscript).

We have modified section “Identifying Predictors from Immunophenotyping Data using a Genetic Algorithm” and the first sentence to improve clarity of the aim of the Genetic Algorithm. It previously said “the aim of this method”, and this may have been the cause of confusion. We apologise and hope that this clarifies the aim of using a GA.

Revised text as follows: “The aim of the Genetic Algorithm is to identify a subset of features which, as a combination, provide an NK cell-based immunophenotypic “fingerprint” which can identify if an asymptomatic individual with PSA levels below 20 ng ml-1 has benign disease or prostate cancer in the absence of definitive biopsy-based evidence.”

We respectfully refer the reviewer to section Results – Experiment Methodology, which explains how the GA was used. An extensive explanation is also presented in Materials and methods – Genetic Algorithm for Selecting the Best Subset of Features which also explains the objective function used The fitness function of the Genetic Algorithm is based on the principle of max-relevance and min-redundancy [this is the well-known mRMR], for which the objective is that the outputs of the selected features present discriminant power, thereby avoiding redundancy. The principle of max-relevance and min-redundancy corresponds to searching the set of feature indexes that are mutually exclusive and totally correlated to the target output”. More details about the GA and the objective function can be found in Ludwig and Nunes, 2010 (to which reference is made in the manuscript).

4) The authors removed 3 patients with high-risk category (D'Amico High risk), but I do not understand why this is justifiable.

We thank the reviewer for highlighting this issue, which does indeed require clarification. On inspection, the clinical profiles of the three patients with low PSA <20 ng/ml and high D'Amico risk were different to the rest of the high risk patients, and were removed since more examples from patients with similar profiles to those three would need to be added to the dataset for the machine learning algorithm to be able to robustly learn those (more complex) profiles.

We have provided a clearer explanation as to the rationale/reason for excluding data from these 3 patients from the analysis in the revised manuscript. The following justification as to why the three patients were removed has been added in the manuscript in subsection “The cancer patients dataset”:

“However, three of the 41 patients who had a High D'Amico risk were removed because their clinical profile was very different to that of the other high risk patients. They were patients with either a Gleason score 3+3 or had a benign biopsy. In the future we aim to collect more data from such infrequent patient groups in order to train the algorithms on patients with such clinical profiles.”

5) The authors discuss yet another Ensemble approach, namely the Random subspace dimension approach. However, the analysis is inconclusive and thus not clear why the authors brought this up.

The Materials and methods Section “Proposed Ensemble Learning Classifier for the task of Predicting Prostate Cancer” discusses the methods used. In the paper we only discuss and use one method, namely the Ensemble Random Space Approach which used an ensemble of kNN learners. We would like to stress that these are not new / additional methods, rather these are the methods that were described and used in the Results section (and therefore the same one). The ensemble classifier we refer to is also called the Ensemble Random Subspace Machine Learning classifier, as explained in the section mentioned above. eLife requires that the Materials and methods section is placed after Results.

6) It is not helping at all to mention the outcome without any data shown.

It is not clear what the reviewer means here. However, we hope that addressing the reviewer’s previous comments has addressed this comment.

Reviewer #2:

This study extends their own research (Cosma et al., 2017) and tries to predict prostate cancer based on flow cytometric profiling of blood immune cell subsets. The computational algorithm is very straight forward. They first generate 32 features using flow cytometry. Then the genetic algorithm and statistical test is used for feature selection. The ensemble learning classifier is finally used with 10-fold cross validation for evaluation. Compared with their previous study, the technical difference is that they replace KNN classifier (used in the previous study) with the ensemble random subspace method used in this study. It is claimed that more patients are included in this study, and that peripheral blood NK cell phenotyping data is first-time used with computational modeling. As a computational scientist, I personally think that this study is limited in technical contribution.

We thank the reviewer for these positive comments – they have understood the paper perfectly. We would like to stress that the contribution to the medical domain is significant, as the “holy grail” of prostate cancer diagnosis and management is to be able to clearly distinguish benign prostate disease (no cancer) and non-clinically significant prostate cancer (neither of which require treatment) from clinically significant prostate cancer (which requires further investigation and treatment). This is currently not possible using the PSA test alone and without the use of invasive biopsies that are extremely uncomfortable and associated with a high rate (~5%) of significant and potentially life-threatening side-effects. Furthermore, “standard” biopsies only provide a definitive diagnosis in ~30% of cases. It should also be noted that 15% of men with “normal” PSA levels typically have prostate cancer, with 15% of these cancers being high-grade. Given the poor diagnostic specificity of PSA, PSA-based prostate cancer screening is not currently supported by the NHS or promoted in any other country. Reliable diagnosis of prostate cancer based on PSA levels alone is therefore not possible and must be confirmed using approaches such as invasive biopsies (see above) and/or MRI scans. However, it should also be noted that with respect to MRI scans, ~25% of “positive” MRI scans and ~20% of “negative” MRI scans can be incorrect.

Asymptomatic men with higher than normal PSA levels, but less than 20ng/ml pose significant problems to the clinician because although only 30%-40% of these men will have prostate cancer, all currently must undergo potentially unnecessary invasive prostate biopsies. It is therefore essential to develop better approaches for distinguishing benign disease and low-risk/grade or small volume intermediate-risk prostate cancer which very rarely require treatment from clinically-significant disease which is likely to progress and requires treatment. Distinguishing men with prostate cancer which is unlikely to progress (for whom “active surveillance” is the most appropriate approach), from men with prostate cancer which is likely to progress and requires treatment is a significant clinical challenge and unmet clinical need. Inappropriate assignment of men to potentially life-threatening invasive procedures and lifelong surveillance for prostate cancer has significant psychological, quality of life, financial and societal consequences.

We expect that our approach to accurately determine the presence of prostate cancer and its clinical significance will avoid the need for up to 70% of prostate biopsies, thereby sparing a significant number of men with benign disease or low risk cancer from unnecessary invasive biopsies and other procedures and also reduce demands of providing healthcare and treatment costs.

But I cannot fairly judge the contribution in medical domain.

The following paragraph has been added to the “Potential Impact” section as a summary for the lay reader:

“Currently available screening methods and tests for prostate cancer lack accuracy and reliability, the consequence of which is that many men unnecessarily undergo invasive tests such as biopsy and/or are misdiagnosed as having the disease. […] If prostate cancer is detected, the second part of the tool will detect the clinical risk of the disease (low, intermediate, high) which will help the clinician decide whether the patient requires no further investigation/treatment (“watch and wait”) or whether further investigation and treatment are required.”

Overall the paper is well written and easily understood.

My major concerns are listed below:

1) Looks like that the authors first run the genetic algorithm and statistical test to select subsets of features using the whole patient data. They then run 10 fold-cross validation with selected features. Considering that the feature selection is performed on the whole data, the prediction performance might be boosted. The typical procedure is that feature selection is performed on training set during each folder of cross-validation. The selected features are then applied on the corresponding testing set. Under the current evaluation process, the authors are suggested to include more patients for independent testing. The newly testing patients should not be used for feature selection.

The features were indeed selected by applying a GA on the entire set (and without using the classifier). However, we used the random nature of the GA to our advantage. Given that the GAs returned different solutions in each iteration, we devised a new methodology for selecting the features by running the GA 30 times to find the most frequent and stable subset of features (Table 5). By doing this, we were confident that we had chosen the best feature set. With this limitation in mind, it is important to mention that the selected features have been considered by expert immunologists who are all co-authors of the paper in order to ensure that the selected features “made sense” from an immunological and clinical perspective. We hope that the reviewer accepts our response. Although it would be straightforward to carry out experiments in which we leave out a subset of the cases for testing the features selected, the dataset is comparatively small and we would be less confident in the results compared to those derived using the methodology described in the paper.

Although the dataset was complex, it was small in comparison to gene array detests that are typically analysed, and the use of the approaches described herein to generate meaningful clinical information from the dataset which we have is a unique element of the study and contribution to knowledge and the literature.

It should also be noted that the dataset on which the manuscript is based is novel, unique and “one of its kind” in the world. The samples were collected by Co-Author Professor Khan (a clinical urologist at the University Hospitals of Leicester) and, as indicated above, the contribution to the medical domain is significant, as the “holy grail” of prostate cancer diagnosis and management is the clear distinguishing benign disease and non-clinically-significant prostate cancer (neither of which needs treatment) from clinically-significant prostate cancer (which requires further investigation and treatment). This is currently not possible based on the PSA test alone and without the use of invasive biopsies that are extremely uncomfortable and associated with a high rate (~5%) of significant and potentially life-threatening side-effects. Inappropriate assignment of men to potentially life-threatening invasive procedures and lifelong surveillance for prostate cancer has significant psychological, quality of life, financial and societal consequences.

2) It is not justified or explained why all features are used to predict L/I risk caner vs H risk cancer on 54 patients, without using feature selection?

We have added the following text in the revised manuscript to clarify this issue:

“Although the combination of 8 biomarkers defined in our previous study discussed in subsection “Identifying Prostate Cancer using PSA and Immunophenotyping Data” was suitable for detecting the presence of cancer, a second, but key clinical question relates to the clinical significance of any prostate cancer which is present. The work described in this subsection, was performed subsequent to our first study and revealed that all 32 phenotypic features are required to distinguish between low/intermediate risk cancer (L/I) and high risk (H) cancer. However, we expect to be able to identify a subset of these features as the datasets increase and the prediction model is retrained on the larger dataset. As indicated above, the generation and delivery of additional datasets is beyond the scope of this paper.”

3) It is suggested to provide quantitative comparisons with their own study (Cosma et al., 2017) during the experiments. Thus it would be more technical convincing that the new method is better than the previous one.

The Discussion section compares the findings of this study with those of the previous study. We have added an extra bold sentence “Comparing results to the previous study:” to make the explanation more apparent.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions:

1) The authors mention the co-linearity issue and found a substantial proportion of features are correlated to each other (376 out of 496 feature pairs; also indicated in Figure 3) in Results paragraph two. The authors claim that the final predictor should not combine features with a high correlation. Are the features sets suggested by the authors (8 features from STAT+GA) indeed not correlated? In Figure 3, we can find a high positive or negative correlation among 15th-18th features. And in statistical analysis, the 14th-18th features are statistically significant (in Table 2), which might correspond to the highly correlated features. If that is the case, it sounds like the authors did something they should not do, according to their own opinion. Please address this.

The 8 features (Features: 2, 20, 27, 28, 14,15,16,17) which were used to implement the prediction model are not highly positively correlated (i.e. the darkest red). As shown in Figure 3, and the rows corresponding to each feature, the only set of features which are highly correlated are features 27 and 28 (negative correlation). There are no other correlations amongst the selected set of features.

The following paragraph has been added to the manuscript in subsection “Distinguishing Between Benign Prostate Disease and Prostate Cancer: Genetic Algorithm”.

“Referring back to Figure 3 and the correlation values between the selected features 2, 20, 27, 28, 14, 15, 16, 17, it is shown that these features do not have a strong positive correlation. Although there is a strong negative correlation between features 27 and 28, we decided to keep both features since these were selected by the feature selection method.”

The wording of a sentence in subsection “Distinguishing Between Benign Prostate Disease and Prostate Cancer: Statistical Analysis of NK Cell Phenotypic Features and PSA levels” has been rephrased to:

“These difficulties are compounded by the challenge of identifying the best combination of predictors which comprise n number of features, and that features within a combination, ideally, should not correlate.”

Please note: there is no major disadvantage to have correlated features in the feature set (between inputs only) which is used to train the machine learning model. We try to avoid including features which highly correlate because when two features are highly correlated, only one of those features is most likely to be useful, and it will probably be possible to remove one of the features. The main advantage of removing one of the features which are highly correlated is for dimensionality reduction purposes. Our feature set is small and therefore, in this case, there is no harm done to keep both features 27 and 28 until we can evaluate these features further using a larger dataset.

The following explanation has also been added to the paper in subsection “Distinguishing Between Benign Prostate Disease and Prostate Cancer: Statistical Analysis of NK Cell Phenotypic Features and PSA levels”:

“It is important to evaluate correlations between features, because if two features are highly correlated, then only one of these could serve as a candidate predictor. However, there may be occasions were both features are needed and besides the impact of this on the dimensionality of the dataset, there is no other negative impact. Furthermore, when two features are highly correlated and are important, it may be difficult to decide which feature to remove.”

2) The authors chose λ=4 after examining the stability of GA. I do agree that stability is an important aspect. However, I think it is also important to know how good the performance of the final solution found by GA is. In that regard, it is worth reporting the final mutual information next to the Relative Frequency in Table 3.

The concept was to find the most frequent (and hence promising and stable) subset of features over various iterations using an optimisation method in order to speed up the search process. For this reason we utilised the GA proposed by Ludwig and Nunes and “wrapped it around” our experimental methodology to find the best set of features, as described in the paper. As described in Minor Point #1 below: Ludwig’s and Nune’s Feature selector which we utilised performs combinatorial optimisation by using Genetic Algorithms, and it is based on the principle of minimum-redundancy/maximum-relevance (mRMR), which maximizes the mutual information indirectly. The output of the method by Ludwig and Nunes is a vector with the indexes of the features that composes the optimum feature set, in which the order of features has no relation with their importance. Therefore, as MI was not the only method used by the feature selector, it is not appropriate to include MI values in Table 3, as this would suggest that this was the only method used for feature selection, which it was not.

3) The authors used the Random subspace dimension approach, which is the best performing. The authors did not present any data to support the claim. This should be provided.

We have now extended the comparisons to include other machine learning classifiers, each of which were tuned to achieve their highest accuracy for the task.

We have updated section “Comparing the performance of the proposed Ensemble kNN vs a single kNN model”, to include other conventional machine learning classifiers when tuned to achieve their best performance for the task. The results have been included in Table 5 (which is Table 6 in the revised manuscript).

We have retitled the subsection as follows: “Comparing the performance of the proposed Ensemble Subspace kNN classifier with alternative classifiers.”

We have updated the content of the subsection as follows:

“The experiments discussed thus far utilised a machine learning model comprised of an Ensemble of kNN learners (see Section “Proposed Ensemble Learning Classifier for the task of Predicting Prostate Cancer”). […] Naive Bayes was the least efficient classifier, and although it returned the lowest ORP FPR, it also returned the lowest ORP TPR, lowest AUC and Accuracy values; and its Std. values were also higher than those of the EkNN model.”

Sentence “The kNN Ensemble Learning classifier was chosen as being the most suitable for the data and task at hand.” has now been updated to “An Ensemble Subspace kNN classifier was developed for the task at hand.”

4) The authors claimed that they demonstrated all of 32 features are required. However, the performance of the algorithm needs to be assessed with subsets of the features to make the claim.

Although we do see the point the reviewers are making, we would have carried out this analysis if we had a larger dataset. The existing analysis was carried out with a thorough experimental methodology to determine whether there is an underlying pattern that can be detected by the proposed algorithm in predicting whether a cancer patient is in the L/I or H group. Using our proposed model and the full set of features, the proposed model was able to find a pattern and return high performance values during the k-fold and the independent test, and this is a significant finding. However, the dataset is small to be able to confidently identify the most promising subset of predictors which as a combination can be utilised to build a classifier. Therefore, before excluding any features as predictors of L/I or H we wish to explore these with a larger dataset. For this reason, we performed thorough experiments using all features, identified that the proposed machine learning classifier can find a pattern in differentiating between patients with benign and cancer disease, but more experiments will be needed with a larger dataset before we start to exclude features from a set of promising predictors of disease stage. As already described in the paper “Of those 54 patient records, a total of 10 randomly selected records (5 from the L/I group and 5 from the H group) were extracted from the dataset such that they can be used at the testing (mini clinical trial) stage. To ensure thorough experiments, a rigorous methodology was adopted. More specifically, a 10-fold cross validation method was adopted, and the experiments were run in 30 iterations, for which each iteration provided an average test result across 10 folds.”

We noticed some inconsistency in the usage of the words validation and testing with our previous discussion of results, and we have made small updates to the manuscript to improve consistency in terminology.

5) In the revised version, overall, the authors have tried to address and answer the concerns I listed before. Because of difficulty and time-requirement for collecting more data, only cross-validation was performed in this study. External testing for the presented method is hard for current situation now. So it is suggested to mention this limitation in the paper's discussion part.

We have updated the manuscript to address the reviewers comment. The changes are outlined below.

We have added the following paragraph to the last paragraph of subsection “Comparing results to the previous study” where future work is mentioned.

“Future work involves collecting more patient samples to conduct further testing of the proposed machine learning models. In terms of future work from a computational perspective, once we have a larger patient dataset we plan to design deep learning models and compare their performance to the conventional machine learning model which was proposed in this paper.”

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Major comments

1) My main concern is still the Revision#4 question, which was asked in the second round review. In the experiments of predicting low/intermediate risk cancer vs high risk cancer, the authors used all 32 features to train the ensemble model by arguing due to the small dataset. But usually if the dataset is small, it is better to use a smaller number of features. In the experiments of benign disease and prostate cancer distinction, the author claimed that 8 features selected by GA+STAT provided the best performance. So it is still suggested to use those 8 features for L/I vs high risk distinction. If using those 8 features cannot provide better performance, the authors can provide some insight discussions about this phenomenon. In addition, for the distinction between L/I and H cancer, 16 more high risk cancer patients were included.

We thank the reviewers for their comments. The question consists of 2 parts, so we will address those separately.

We have updated the paper by adding the following explanation to address the first comment.

“The dataset that was utilised to identify the biomarker (that comprised 8 features) for detecting the presence of prostate cancer (i.e. benign prostate disease vs prostate cancer) in 71 men, and thus it was large enough to perform the combinatorial feature selection task for finding the best subset of features. […] The combinatorial feature selection task to identify the best subset of features for the risk prediction task will be performed once a larger dataset is available.”

Please note that experiments using the same features that were detected to predict the presence of prostate cancer (i.e. benign prostate disease vs prostate cancer) were not suitable for predicting the risk (L/I vs H) of any prostate cancer that was present. This was expected since the tasks are different, and so it was not appropriate to report those results since the optimisation algorithm was searching for a set of features to differentiate between benign prostate disease and prostate cancer, and not risk (L/I vs H) of any prostate cancer that was present. This explanation was not added to the revised version of the paper as the above text justifies why the biomarker based on 8 features was unsuitable.

Are there any reasons why these 16 patients were not used for cancer and benign distinction?

1. We have updated the title of the section “The cancer patients’ dataset” to “The cancer patient dataset used for building the risk prediction model” to improve clarity.

2. Updated the section “The cancer patient dataset used for building the risk prediction model” to include the following explanation, highlighted in blue text in the paper.

3. The 16 patients were diagnosed with Gleason scores of: 4+4=8 (n=2), 5+4=9 (n=2), and 4+5=9 (n=11), and 1 patient was diagnosed with small cell cancer.

4. Since 11 of those 16 patients had a PSA $>20$ ng ml $^{-1}$, their data could only be utilised for building the prostate cancer risk prediction model, as the detection model focuses on detecting prostate cancer in asymptomatic men with PSA$<20$ ng ml $^{-1}$.

https://doi.org/10.7554/eLife.50936.sa2

Article and author information

Author details

  1. Simon P Hood

    John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Present address
    Cancer Research UK Manchester Institute, University of Manchester, Manchester, United Kingdom
    Contribution
    Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - review and editing
    Contributed equally with
    Georgina Cosma and A Graham Pockley
    Competing interests
    No competing interests declared
  2. Georgina Cosma

    Department of Computer Science, Loughborough University, Loughborough, United Kingdom
    Contribution
    Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Simon P Hood and A Graham Pockley
    For correspondence
    g.cosma@lboro.ac.uk
    Competing interests
    Named inventor on filed patent application entitled 'Machine learning models and methods for detecting presence and clinical significance of prostate cancer' (Application Number GB1910689.7).
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4663-6907
  3. Gemma A Foulds

    1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    2. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Contribution
    Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Catherine Johnson

    1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    2. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Contribution
    Investigation, Methodology
    Competing interests
    No competing interests declared
  5. Stephen Reeder

    1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    2. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Contribution
    Investigation, Methodology
    Competing interests
    No competing interests declared
  6. Stéphanie E McArdle

    1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    2. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Contribution
    Formal analysis, Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  7. Masood A Khan

    Department of Urology, University Hospitals of Leicester NHS Trust, Leicester, United Kingdom
    Contribution
    Resources, Data curation, Validation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  8. A Graham Pockley

    1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    2. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Simon P Hood and Georgina Cosma
    For correspondence
    graham.pockley@ntu.ac.uk
    Competing interests
    Named inventor on filed patent application entitled 'Machine learning models and methods for detecting presence and clinical significance of prostate cancer' (Application Number GB1910689.7).
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9593-6431

Funding

The John and Lucille van Geest Foundation (Core / Programme Grant)

  • A Graham Pockley

ERDF (Healthcare and Bioscience iNet Research Grant)

  • A Graham Pockley

PROSTaid Prostate Cancer Charity (Funding Support)

  • Stéphanie E McArdle
  • A Graham Pockley

Nottingham Trent University (PhD Studentship)

  • Simon P Hood
  • A Graham Pockley

Leverhulme Trust (Research Project Grant RPG-2016-252)

  • Georgina Cosma

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors acknowledge the financial support of the John and Lucille van Geest Foundation, the Healthcare and Bioscience iNet, an ERDF funded initiative managed by Medilink East Midlands, PROSTaid, and Nottingham Trent University. This work was also supported by a Nottingham Trent University Vice Chancellor PhD Studentship Bursary to SPH. Dr Cosma acknowledges the financial support of The Leverhulme Trust (Research Project Grant RPG-2016–252). The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

Ethics

Human subjects: Research Protocols were registered and approved by the National Research Ethics Service (NRES) Committee East Midlands and by the Research and Development Department in the University Hospitals of Leicester NHS Trust. All participants were given information sheets explaining the nature of the study and all provided informed consent. All samples were collected by suitably qualified individuals using standard procedures. Ethical approval for the collection and use of samples from the TPTPB cohort (Project Title: Defining the role of Transperineal Template-guided prostate biopsy) was given by NRES Committee East Midlands- Derby 1 (NREC Reference number: 11/EM/3012; UHL11068).

Senior Editor

  1. Eduardo Franco, McGill University, Canada

Reviewing Editor

  1. Wilbert Zwart, Netherlands Cancer Institute, Netherlands

Reviewers

  1. Yongsoo Kim, Amsterdam University Medical Center (UMC)
  2. Hongming Xu, Cleveland Clinic, United States

Publication history

  1. Received: August 8, 2019
  2. Accepted: June 25, 2020
  3. Version of Record published: July 28, 2020 (version 1)

Copyright

© 2020, Hood et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 875
    Page views
  • 57
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Cancer Biology
    2. Immunology and Inflammation
    Biancamaria Ricci et al.
    Research Article

    Cancer-associated fibroblasts (CAFs) are a heterogeneous population of mesenchymal cells supporting tumor progression, whose origin remains to be fully elucidated. Osterix (Osx) is a marker of osteogenic differentiation, expressed in skeletal progenitor stem cells and bone-forming osteoblasts. We report Osx expression in CAFs and by using Osx-cre;TdTomato reporter mice we confirm the presence and pro-tumorigenic function of TdTOSX+ cells in extra-skeletal tumors. Surprisingly, only a minority of TdTOSX+ cells expresses fibroblast and osteogenic markers. The majority of TdTOSX+ cells express the hematopoietic marker CD45, have a genetic and phenotypic profile resembling that of tumor infiltrating myeloid and lymphoid populations, but with higher expression of lymphocytic immune suppressive genes. We find Osx transcript and Osx protein expression early during hematopoiesis, in subsets of hematopoietic stem cells and multipotent progenitor populations. Our results indicate that Osx marks distinct tumor promoting CD45- and CD45+ populations and challenge the dogma that Osx is expressed exclusively in cells of mesenchymal origin.

    1. Cancer Biology
    Joana G Marques et al.
    Research Article

    The NuRD complex subunit CHD4 is essential for fusion-positive rhabdomyosarcoma (FP-RMS) survival, but the mechanisms underlying this dependency are not understood. Here, a NuRD-specific CRISPR screen demonstrates that FP-RMS is particularly sensitive to CHD4 amongst the NuRD members. Mechanistically, NuRD complex containing CHD4 localizes to super-enhancers where CHD4 generates a chromatin architecture permissive for the binding of the tumor driver and fusion protein PAX3-FOXO1, allowing downstream transcription of its oncogenic program. Moreover, CHD4 depletion removes HDAC2 from the chromatin, leading to an increase and spread of histone acetylation, and prevents the positioning of RNA Polymerase 2 at promoters impeding transcription initiation. Strikingly, analysis of genome-wide cancer dependency databases identifies CHD4 as a general cancer vulnerability. Our findings describe CHD4, a classically defined repressor, as positive regulator of transcription and super-enhancer accessibility as well as establish this remodeler as an unexpected broad tumor susceptibility and promising drug target for cancer therapy.