Identifying prostate cancer and its clinical risk in asymptomatic men using machine learning of high dimensional peripheral blood flow cytometric natural killer cell subset phenotyping data

  1. Simon P Hood
  2. Georgina Cosma  Is a corresponding author
  3. Gemma A Foulds
  4. Catherine Johnson
  5. Stephen Reeder
  6. Stéphanie E McArdle
  7. Masood A Khan
  8. A Graham Pockley  Is a corresponding author
  1. John van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent University, United Kingdom
  2. Department of Computer Science, Loughborough University, United Kingdom
  3. Centre for Health, Ageing and Understanding Disease (CHAUD), School of Science and Technology, Nottingham Trent University, United Kingdom
  4. Department of Urology, University Hospitals of Leicester NHS Trust, United Kingdom
9 figures, 12 tables and 2 additional files

Figures

NK cell phenotypic features in men with benign prostate disease and patients with prostate cancer.

Boxplots represent the flow cytometry values of each feature for patients with benign disease and with prostate cancer.

Mean and standard deviation values of flow cytometry features.
Correlations between features.
PSA values by group.
Boxplots illustrating the performance of the proposed model using various feature sets.

(a) Average AUC values, (b) Average Optimal ROC points (TPRs), (c) Average Optimal ROC points (FPRs), (d) Average Accuracy values. Each box plot contains 30 points, where each point is the average performance evaluation value (i.e. AUC, ORP TPR, ORP FPR, Accuracy) from one 10-fold run using the various feature sets.

Flow charts illustrating the process to detect the presence and risk of prostate cancer and patient outcomes.

Model 1: Distinguishes between men with benign prostate disease and prostate cancer; Model 2: predicts risk (in terms of clinical significance) in men identified as having prostate cancer in Stage 1. Note that Model 1 can detect prostate cancer in men with PSA < 20 ng ml-1.

Each box plot contains 30 points, where each point is the average performance evaluation value (i.e. AUC, FPR, TPR, Accuracy (ACC)) from one 10-fold run during (a) k-fold validation results, and (b) independent testing results (i.e. using 10 patient records).
Representative gating strategy for analyzing the expression of activating and inhibitory receptors on peripheral blood natural killer (NK) cells.

Using density plots, the NK cell phenotypic profiles were determined by first gating on ‘live cells’ in the forward scatter (FSc) linear vs side scatter (SSc) linear density plot and then gating on single cells (determined by FSc Linear vs FS time of flight). The expression of activating and inhibitory receptors was determined by gating on CD3-CD19-CD56+ cells using fluorescence minus one (FMO) controls. The expression of each NK cell receptor was measured using the ‘Logical’ setting.

Proposed Ensemble Subspace kNN model.

Ensembles combine predictions from different models to generate a final prediction. Because Ensemble approaches combine baseline predictions, they perform at least as well as the best baseline model.

Tables

Table 1
Descriptive statistics of the dataset.
Min.Max.MeanStd.IQRRangeDiff.
Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.Beni.Canc.
PSA4.704.7019.0019.008.268.343.313.283.304.0814.3014.30−0.08
CD56dim %
1CD16+83.8573.0496.6196.9890.9890.643.355.464.135.0212.7623.940.34
2CD16high24.3849.6687.4689.3372.8873.3211.7410.2215.0010.4563.0839.67−0.44
3CD16low5.176.5764.2244.0017.7416.8410.407.458.767.6659.0537.430.90
4CD16-1.411.2511.1118.064.834.892.453.482.582.689.7016.81−0.06
5CD56dimtotal91.2987.2498.7098.7095.8195.532.022.582.963.027.4111.460.28
CD56bright %
6CD16+0.460.655.105.881.911.831.061.041.640.924.645.230.08
7CD16high0.090.121.971.150.600.470.440.250.500.401.881.030.13
8CD16low0.340.403.114.951.271.350.720.860.970.632.774.55−0.07
9CD16-0.610.585.789.092.282.641.141.821.421.755.178.51−0.36
10CD56brighttotal1.301.308.7112.764.194.472.022.582.953.017.4111.46−0.28
CD8%
11CD56+CD8+21.889.2086.7080.4746.4340.7115.6414.6624.0320.0564.8271.275.72
12CD56+CD8-13.3019.5378.1290.8053.5759.2915.6414.6624.0320.0564.8271.27−5.72
13CD56dimCD8+19.638.6082.3877.4745.1839.1115.3114.1024.7219.3662.7568.876.07
14CD56brightCD8+0.370.254.756.641.411.701.071.410.701.604.386.39−0.29
NKp30 %
15CD56+NKp30+40.6956.8096.7498.4379.7888.5616.4210.4121.8010.4456.0541.63−8.78
16CD56+NKp30-3.261.5758.3444.5920.0511.4316.2210.4620.5410.4955.0843.028.61
NKp46 %
17CD56+NKp46+38.1145.3786.5295.8262.6569.8213.4911.5823.9012.7148.4150.45−7.18
18CD56+NKp46-14.024.3262.9755.6838.4030.8713.5811.6424.8913.4448.9551.367.53
DNAM-1 %
19CD56+DNAM-1+63.6988.5699.1899.6095.3596.466.812.593.373.4935.4911.04−1.11
20CD56+DNAM-1-0.860.4237.2911.664.743.596.962.613.453.5436.4311.241.14
NKG2D %
21CD56+NKG2D+85.1780.7998.7798.9693.4994.074.454.876.813.8313.6018.17−0.58
22CD56+NKG2D-1.221.0314.7619.126.445.844.364.766.803.9613.5418.090.60
PSA4.704.7019.0019.008.268.343.313.283.304.0814.3014.30−0.08
NKp44 %
23CD56+NKp44+0.430.283.716.771.161.340.821.200.781.253.286.49−0.18
24CD56+NKp44-96.1093.7099.5399.7098.8298.640.831.130.801.253.436.000.18
CD85j %
25CD56+CD85j+19.5314.2184.7391.5953.3755.1019.0418.3430.4920.2365.2077.38−1.74
26CD56+CD85j-14.938.5081.5486.0846.9445.2419.2118.4330.2821.4866.6177.581.69
LAIR-1 %
27CD56+LAIR-1+94.9721.4399.9099.8999.0797.471.0712.190.490.474.9378.461.60
28CD56+LAIR-1-0.020.055.2478.200.762.401.0212.150.420.435.2278.15−1.65
NKG2A %
29CD56+NKG2A+20.4319.0177.5773.0146.1444.2417.4113.7330.8217.4757.1454.001.90
30CD56+NKG2A-22.6227.1179.4080.8554.0155.9917.3913.6730.4817.9056.7853.74−1.98
2B4 %
31CD56+2B4+98.4197.0699.9999.9699.5399.500.390.590.320.331.582.900.02
32CD56+2B4-0.010.051.592.950.480.500.390.590.310.341.582.90−0.02
  1. Min. is the minimum value, Max. is maximum value, Mean is the mean or average value, and Std. is Standard Deviation. Range is the difference between the minimum and maximum values. The Interquartile range (IQR) is a measure of data variability and was derived by computing the distance between the Upper Quartile (i.e. top) and Lower Quartile (i.e. bottom) of the boxes illustrated in Figure 1. Difference is computed as diff = mean(Benign)-mean(Cancer).

Table 2
Tests of normality results.
Tests of normality
NK cell valuesKolmogorov-SmirnovaShapiro-Wilk
StatisticdfSig.StatisticdfSig.
1CD56dimCD16+0.1571.000.000.8571.000.00
2CD56dimCD16high0.1171.000.030.8971.000.00
3CD56dimCD16low0.1771.000.000.7971.000.00
4CD56dimCD16-0.1971.000.000.8271.000.00
5CD56dimCD56dimtotal%0.1571.000.000.9171.000.00
6CD56brightCD16+0.1371.000.000.8871.000.00
7CD56brightCD16high0.1571.000.000.8771.000.00
8CD56brightCD16low0.1471.000.000.8571.000.00
9CD56brightCD16-0.1671.000.000.8671.000.00
10CD56brightCD56brighttotal0.1571.000.000.9171.000.00
11CD8CD56+CD8+0.1071.000.060.9871.000.17
12CD8CD56+CD8-0.1071.000.060.9871.000.17
13CD8CD56dimCD8+0.0971.000.20*0.9871.000.24
14CD8CD56brightCD8+0.1971.000.000.8271.000.00
15NKp30CD56+NKp30+0.2171.000.000.8171.000.00
16NKp30CD56+NKp30-0.2171.000.000.8171.000.00
17NKp46CD56+NKp46+0.0871.000.20*0.9871.000.52
18NKp46CD56+NKp46-0.0771.000.20*0.9971.000.57
19DNAM-1CD56+DNAM-1+0.2371.000.000.5671.000.00
20DNAM-1CD56+DNAM-1-0.2371.000.000.5571.000.00
21NKG2DCD56+NKG2D+0.1971.000.000.8471.000.00
22NKG2DCD56+NKG2D-0.1871.000.000.8571.000.00
23NKp44CD56+NKp44+0.1871.000.000.7671.000.00
24NKp44CD56+NKp44-0.1771.000.000.7871.000.00
25CD85jCD56+CD85j+0.1171.000.050.9671.000.02
26CD85jCD56+CD85j-0.1071.000.070.9671.000.02
27LAIR-1CD56+LAIR-1+0.4371.000.000.1471.000.00
28LAIR-1CD56+LAIR-1-0.4371.000.000.1471.000.00
29NKG2ACD56+NKG2A+0.0971.000.20*0.9771.000.11
30NKG2ACD56+NKG2A-0.0871.000.20*0.9771.000.10
312B4CD56+2B4+0.2371.000.000.7571.000.00
322B4CD56+2B4-0.2371.000.000.7571.000.00
  1. *. This is a lower bound of the true significance.

    Those values in bold are of those features whose data is normally distributed.

  2. If the p>0.05, we can accept the null hypothesis, that there is no statistically significant difference between the data and the normal distribution, hence we can presume that the data of those features are normally distributed.

    If the p<0.05, we can reject the null hypothesis because there is a statistically significant difference between the data and the normal distribution, hence we can presume that the data of those features are not normally distributed.

Table 3
Results of the Kruskal-Wallis test.
Chi-Sq.(χ2)Asy. sig. p value
PSA00.949
NK cells
1CD56dimCD16+0.0010.981
2CD56dimCD16high0.0690.793
3CD56dimCD16low0.5550.456
4CD56dimCD16-0.0330.857
5CD56dimCD56dimtotal%0.0630.802
6CD56brightCD16+0.8360.361
7CD56brightCD16high0.2010.654
8CD56brightCD16low0.1060.744
9CD56brightCD16-0.0300.861
10CD56brightCD56brighttotal2.4150.120
11CD8CD56+CD8+2.4150.120
12CD8CD56+CD8-2.8490.091
13CD8CD56dimCD8+0.4170.518
14CD8CD56brightCD8+7.2300.007
15NKp30CD56+NKp30+7.1060.008
16NKp30CD56+NKp30-4.6380.031
17NKp46CD56+NKp46+5.1790.023
18NKp46CD56+NKp46-0.0010.981
19DNAM-1CD56+DNAM-1+0.0010.972
20DNAM-1CD56+DNAM-1-0.2930.588
21NKG2DCD56+NKG2D+0.3250.568
22NKG2DCD56+NKG2D-0.0330.857
23NKp44CD56+NKp44+0.0720.789
24NKp44CD56+NKp44-0.0490.825
25CD85jCD56+CD85j+0.0720.789
26CD85jCD56+CD85j-2.1350.144
27LAIR-1CD56+LAIR-1+1.3430.247
28LAIR-1CD56+LAIR-1-0.0600.807
29NKG2ACD56+NKG2A+0.0720.789
30NKG2ACD56+NKG2A-0.8790.348
312B4CD56+2B4+0.8900.346
322B4CD56+2B4-0.8900.346
Table 4
Results of the Genetic Algorithm when searching for the best subset of features.
λNo. different combComb. with highest freq.Freq. of comb.Relative freq. (%)
2317,281653.3
3217,27,292376.7
412,20,27,2830100.0
523,20,27,28,322996.7
623,7,20,27,28,322686.7
733,7,20,23,27,28,322480.0
843,7,20,22,23,27,28,321963.3
933,7,19,20,22,23,27,28,322480.0
1032,3,7,19,20,22,23,27,28,322170.0
Table 5
Naming of the models includes the feature selection method (GA) combined with the proposed Ensemble Subspace kNN classifier.

Validation results are presented at k = 10 fold cross validation.

Results of 10-fold cross validation over 30 runs
AUCOrp fprOrp tprACCMean std.Rank
GAMean0.7760.2960.8330.7814
Std.0.0240.0650.0260.0230.035
STATMean0.7690.3030.8280.7745
Std.0.0220.0570.0230.0210.031
GA+STATMean0.8180.2010.8360.8211
Std.0.0210.0270.0210.0200.022
PSA+GA+STATMean0.8120.2080.8320.8152
Std.0.0200.0310.0180.0190.022
PSAMean0.6980.2170.6090.6926
Std.0.0220.0250.0430.0200.028
All featuresMean0.8120.2130.8360.8153
Std.0.0220.0350.0210.0210.025
Table 6
Comparing the performance of the proposed Ensemble Subspace kNN model against conventional machine learning models when using the GA+STAT feature set.

Results of 10-fold cross validation over 30 runs.

Proposed ensemble subspace kNN (EkNN) model
(No. of learners (NL): 30; Subspace Dimension (SD): 16)
ParametersAUCORP FPRORP TPRACC
NL: 30, SD:16Mean0.8180.2010.8360.821
Std.0.0210.0270.0210.020
Simple kNN model (Distance: Euclidean)
AUCORP FPRORP TPRACCAcc. Diff.
k(EkNN vs. kNN)
2Mean0.7680.2410.7300.751+0.070
Std.0.1190.1600.3930.128−0.108
5Mean0.7780.3000.8330.783+0.038
Std.0.1070.2650.1030.103−0.083
10Mean0.7530.3710.8450.758+0.063
Std.0.1370.3500.1200.131−0.111
Support Vector Machine models
AUCORP FPRORP TPRACCAcc. Diff.
Kernel(EkNN vs. SVM)
LinearMean0.7820.3420.8600.784+0.037
Std.0.1260.3520.1100.120−0.100
GaussianMean.0.8080.3530.8760.799+0.022
Std.0.1120.4160.1070.111−0.091
Naive Bayes model
AUCORP FPRORP TPRACCAcc. Diff.
Predictor distributions(EkNN vs. Naïve Bayes)
NormalMean.0.6950.1320.4550.662+0.159
Std.0.1690.1630.4930.181−0.161
Table 7
Ad hoc test results.
Ad hoc test
Group 1Group 2Ll 95%Diff. betw.meansUl 95%P
1GASTAT−12.6581.31715.2921.000
2GAGA+STAT−22.208−8.2335.7420.525
3GAPSA−4.9928.98322.9580.344
4GAPSA+GA+STAT−20.792−6.8177.1581.000
5STATGA+STAT−23.525−9.5504.4250.245
6STATPSA−6.3087.66721.6420.710
7STATPSA+GA+STAT−22.108−8.1335.8420.555
8GA+STATPSA3.24217.21731.1920.001
9GA+STATPSA+GA+STAT−12.5581.41715.3921.000
10PSAPSA+GA+STAT−29.775−15.800−1.8250.002
  1. The first two columns show the groups that are compared. The third and fifth columns show the lower and upper limits for 95% confidence intervals for the true mean difference. The fourth column shows the difference between the estimated group means. The sixth column contains the p-value for testing a hypothesis that the corresponding mean difference is equal to zero.

Table 8
Results of the best prediction models created during the 30 runs.

Validation results are presented at k = 10 fold cross validation.

Best prediction model results
AUCOrp fprOrp tprAccuracyRank
GA0.8180.1920.8290.8203
GA+STAT0.8530.1570.8620.8551
PSA0.7340.2180.6850.7305
PSA+GA+STAT0.8440.1750.8640.8482
STAT0.8110.2270.850.8174
Key resources table
Reagent type
(species) or
resource
DesignationSource or
reference
Identifiers
Additional
information
Biological SampleHyclone fetal bovine serum (FBS)GE Healthcare Life SciencesSV30180.03
AntibodyMonoclonal mouse IgG1 kappa anti human DNAM-1 (CD226) (clone 11A8); FITCBioLegend3383045 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKG2D (CD314) (clone 1D11); PEeBioscience12-5878-425 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD56 (clone N901); ECD (PE-Texas Red)Beckman CoulterA829432.5 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD16 (clone 3G8); PerCP-Cy5.5BioLegend3020285 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp46 (CD335) (clone 9E2); PE-Cy7BioLegend3319165 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp30 (CD337) (clone P30-15); Alexa Fluor 647BioLegend3252125 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD3 (clone UCHT1); Alexa Fluor 700BioLegend3004242 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD19 (clone HIB19); Alexa Fluor 700BioLegend3022261 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human CD8 (clone SK1); APC-Cy7BioLegend3447142.5 μl per tube / 106cells
AntibodyMonoclonal mouse IgG2b anti human CD85j (ILT2) (clone GHI/75); FITCMiltenyi Biotec130-098-43710 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human LAIR-1 (CD305) (clone DX26); PEBD Biosciences55081120 μl per tube / 106cells
AntibodyMonoclonal mouse IgG2b anti human NKG2A (CD159a) (clone Z199); PE-Cy7(PC7)Beckman CoulterB1024620 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human NKp44 (CD336) (clone P44-8); Alexa Fluor 647BioLegend3251125 μl per tube / 106cells
AntibodyMonoclonal mouse IgG1 kappa anti human 2B4 (CD244.2) (clone C1.7); FITCBioLegend3295065 μl per tube / 106cells
Chemical CompoundLIVE/DEAD Fixable Violet Dead StainThermo Fisher ScientificL349551 μl in 1 μl
Chemical CompoundNovagen Benzonase NucleaseMerck Millipore70664
Chemical CompoundCTL Wash SolutionCellular Technology LimitedCTLW-010
Chemical CompoundTrypan Blue viability stainSanta Cruzsc-216028
Chemical CompoundDimethyl sulfoxide (DMSO)Santa Cruzsc-202581
Chemical CompoundCalbiochem bovine serum albumin (BSA)Merck Millipore2905-OP
Chemical CompoundSigma-Aldrich sodium azideMerck MilliporeS8032
Chemical CompoundSigma-Aldrich lithium heparinMerck MilliporeH0878
Chemical CompoundFicoll-PaqueGE Healthcare Life Sciences17-1440-03
Chemical CompoundIsoton II isotonic buffered saline solutionBeckman Coulter844 80 11
Chemical CompoundRPMI mediumLonza12-167Q
Chemical CompoundPhosphate Buffered Saline (PBS)Lonza17-517Q
OtherLeucosep tubesGreiner Bio-One International227290
SoftwareKaluza v1.3Beckman Coulter
Table 9
Patient clinical features.
Patient groupGleason scoreNumber of patientsAge range (years)PSA range (ng/ml)
BenignBenign964-715.3–15
BenignHGPIN954–705.1–12
BenignAtypia1050–764.7–19
BenignASAP259–605.3–7.8
CancerGleason 61655–804.7–11
CancerGleason 72353–774.7–19
CancerGleason 9265–756.3–18
Table 10
Dataset used for differentiating between patients with L/I and H cancer.
Patient groupCount%
L/I3870.37
H1629.63
Table 11
Antibody panels for measuring the phenotype of Natural Killer cells.
AntibodyFluorochromeClone no.Supplier
Panel 1
DNAM-1 (CD226)FITC11A8BioLegend
NKG2D (CD314)PE1D11eBioscience
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCP-Cy5.53G8BioLegend
NKp46 (CD335)PE-Cy79E2BioLegend
NKp30 (CD337)Alexa Fluor 647P30-15BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
Live/DeadDye (violet)Thermo Fisher Scientific
Panel 2
CD85j (ILT2)FITCGHI/75Miltenyi Biotec
LAIR-1 (CD305)PEDX26BD Biosciences
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCP-Cy5.53G8BioLegend
NKG2A (CD159a)PC7 (PE-Cy7)Z199Beckman Coulter
NKp44 (CD336)Alexa Fluor 647P44-8BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
LIVE/DEADDye (violet)Thermo Fisher Scientific
Panel 3
2B4 (CD244.2)FITCC1.7BioLegend
CD56ECD (PE-Texas Red)N901Beckman Coulter
CD16PerCp-Cy5.53G8BioLegend
CD3Alexa Fluor 700UCHT1BioLegend
CD19Alexa Fluor 700HIB19BioLegend
CD8APC-Cy7SK1BioLegend
LIVE/DEADDye (violet)Thermo Fisher Scientific

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Simon P Hood
  2. Georgina Cosma
  3. Gemma A Foulds
  4. Catherine Johnson
  5. Stephen Reeder
  6. Stéphanie E McArdle
  7. Masood A Khan
  8. A Graham Pockley
(2020)
Identifying prostate cancer and its clinical risk in asymptomatic men using machine learning of high dimensional peripheral blood flow cytometric natural killer cell subset phenotyping data
eLife 9:e50936.
https://doi.org/10.7554/eLife.50936