Removing unwanted variation with CytofRUV to integrate multiple CyTOF datasets

  1. Marie Trussart  Is a corresponding author
  2. Charis E Teh
  3. Tania Tan
  4. Lawrence Leong
  5. Daniel HD Gray
  6. Terence P Speed
  1. Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Australia
  2. School of Mathematics and Statistics, The University of Melbourne, Australia
  3. The Walter and Eliza Hall Institute of Medical Research, Australia
  4. Department of Medical Biology, The University of Melbourne, Australia
6 figures, 3 tables and 1 additional file

Figures

Visualisation of batch effects on the median protein expression across batches .

(A) Multi-dimensional scaling plot of the 24 samples computed using median protein expression. (B) Heatmap of the median protein expression of 19 lineage proteins and 12 functional proteins across all cells measured for each sample in the dataset.

Figure 2 with 2 supplements
Distribution of BCL-2 expression.

(A) Distributions of BCL-2 expression in one sample from one treated CLL cancer patient, replicated across 8 CyTOF batches, coloured by batch. (B) Distributions of BCL-2 expression in one sample from each of 6 different CLL cancer patients at screening, processed in a single CyTOF batch, coloured by patient.

Figure 2—figure supplement 1
Protein distributions before normalisation for the samples CLL2 and HC1 across batch 1 and batch 2.

(A) Lineage protein expression distributions for the sample CLL2 across batches, before normalisation. (B) Same as (A) with functional proteins. (C) Lineage protein expression distributions for the sample HC1 across batches, before normalisation. (D) Same as (C) with functional proteins.

Figure 2—figure supplement 2
Protein distributions after CytofRUV normalisation with k = 5 for the samples CLL2 and HC1 across batch 1 and batch 2.

(A) Lineage protein expression distributions for the CLL2 sample across batches, after CytofRUV normalisation. (B) Same as (A) with functional proteins. (C) Lineage protein expression distributions for the sample HC1 across batches, after CytofRUV normalisation. (D) Same as (C) with functional proteins.

Figure 3 with 4 supplements
Cell clustering plots show batch effects in cells from the same cancer patient CLL1 sample replicated across 2 CyTOF runs.

(A) Cell clustering identification. t-SNE plot based on the arcsinh-transformed expression of the 19 lineage proteins in the cells. For display purposes, 2000 cells were randomly selected from each of the samples. Cells are coloured according to the 20 clusters obtained using FlowSOM clustering stratified by batch 1 (left) or 2 (right) of the corresponding replicated sample. (B) Same as in (A) selecting only cluster 9 cells but coloured by the batch 1 or 2 of the corresponding replicated sample. (C) Same as in (A) but after CytofRUV normalisation with k = 5. (D) Same as in (B) but after CytofRUV normalisation with k = 5. (E) Linear discriminant analysis applied to data on two cell types from the same sample replicated across two batches, with shape indicating cell type and colour indicating batch. (F) Cluster proportions. Barplot of the relative abundance (percentage) of the cells in clusters 2, 6 and 7 by batch.

Figure 3—figure supplement 1
Heatmap of the median lineage protein expression across clusters with the associated cluster percentages measured for all cells and all samples in the first dataset of samples from 3 patients with CLL and 9 HC.
Figure 3—figure supplement 2
BCL-2 median expression in the main CLL cluster from the 3 CLL samples replicated across 2 CyTOF runs.

(A) Before and (B) After CytofRUV normalisation with k = 5.

Figure 3—figure supplement 3
Linear discriminant analysis plot to show batch effects in cells from the same cancer patient CLL1 sample replicated across batches after CytofRUV normalisation with k = 5.
Figure 3—figure supplement 4
Boxplot of the differences of median protein expression differences across batches before and after CytofRUV normalisation (ΔΔ, see Materials and methods) with k=5 within the main prominent cell subpopulation.

(A) ΔΔ differences were computed on the 3 replicated CLL samples within the main CLL cluster. (B) ΔΔ differences were computed on the 9 replicated HC samples within the main HC cluster.

CytofRUV’s R-Shiny application for the identification of batch effects in cluster proportions across batches.

All diagnostic plots can be obtained by the user selecting an option at the top left corner by from: Median Protein Expression, Protein Expression Distributions, Clustering Results and Cluster Proportions. The selected option displays barplots of cluster proportions across samples before normalisation and by conditions CLL or HC on a subsample of the whole dataset. Vertical black boxes contain the same replicated sample across batches one and batch2.

Figure 5 with 1 supplement
Metrics to assess the effectiveness of the normalisation methods.

In all panels, the colour indicates either the raw data or the method used for normalisation. (A) Boxplots of the Earth Movers Distances (EMD) between paired protein expression distributions across batches for each CLL sample. (B) Hellinger distances between paired cluster proportions across batches for each CLL sample. (C) Mean Silhouette scores computed for all CLL samples on the cluster types (bio) on the x-axis and on batch (batch) on the y-axis. (D) Same as (A) for the HC samples. (E) Same as (B) for the HC samples. (F) Silhouette scores computed for all HC samples on the cluster types (bio) on the x-axis and on batch (batch) on the y-axis.

Figure 5—figure supplement 1
EMD for all the proteins by cluster, before and after CytofRUV normalisation of the CLL2 sample, k = 5.

(A) EMD for lineage proteins by cluster before (black) and after (blue) CytofRUV normalisation (blue). (B) Same as (A), with functional proteins.

Figure 6 with 3 supplements
CytofRUV performance on two other datasets with multiple batches.

(A) Barplot of proportions of clusters across 28 samples from the BatchAdjust dataset (Schuyler et al., 2019) before normalisation, by samples and coloured by cluster. Vertical black boxes contain the same sample (Stimulated or Unstimulated) replicated across 14 batches. (B) Protein expression distribution from the CytoNorm dataset (Van Gassen et al., 2019) before normalisation of all cells from the stimulated samples across 10 batches and coloured by batch. (C) Same as (A) but after CytofRUV normalisation with k = 10. (D) Same as (B) but after CytofRUV normalisation with k = 5.

Figure 6—figure supplement 1
Metrics to assess the effectiveness of the normalisation methods on the BatchAdjust dataset.

In all panels, the colour indicates either the raw data or the method used for normalisation. (A) Boxplots of the Earth Movers Distance (EMD) between paired protein expression distributions for each batch’s stimulated sample compared to that in the first batch. (B) Hellinger distances between paired cell subpopulation proportions for each batch’s stimulated sample compared to that in the first batch. (C) Mean Silhouette scores computed for all stimulated and unstimulated samples with those for cell subpopulations (bio) on the x-axis and for batch (batch) on the y-axis. (D) Same as (A) for the unstimulated samples. (E) Same as (B) for the unstimulated samples.

Figure 6—figure supplement 2
Metrics to assess the effectiveness of the normalisation methods on samples two from the CytoNorm dataset using stimulated samples two as replicated reference samples.

In all panels, the colour indicates either the raw data or the method used for normalisation. (A) Boxplots of the Earth Movers Distance (EMD) between paired protein expression distributions for each batch’s stimulated sample across compared to that in the first batch. (B) Hellinger distances between paired cell subpopulation proportions for each batch’s stimulated sample compared to that in the first batch. (C) Mean Silhouette scores computed for all stimulated and unstimulated samples with those for cell subpopulations (bio) on the x-axis and for batch (batch) on the y-axis. (D) Same as (A) for the unstimulated samples. (E) Same as (B) for the unstimulated samples.

Figure 6—figure supplement 3
Metrics to assess the effectiveness of the normalisation methods on samples two from the CytoNorm dataset using both the stimulated and unstimulated samples one as replicated reference samples.

In all panels, the colour indicates either the raw data or the method used for normalisation. (A) Boxplots of the Earth Movers Distance (EMD) between paired protein expression distributions for each batch’s stimulated sample across compared to that in the first batch. (B) Hellinger distances between paired cell subpopulation proportions for each batch’s stimulated sample compared to that in the first batch. (C) Mean Silhouette scores computed for all stimulated and unstimulated samples with those for cell subpopulations (bio) on the x-axis and for batch (batch) on the y-axis. (D) Same as (A) for the unstimulated samples. (E) Same as (B) for the unstimulated samples.

Tables

Table 1
Samples descriptions.

The first column indicates the sample id, the second the patient condition, either healthy controls (HC) or chronic lymphocytic leukaemia (CLL), the third column indicates the patient id and the last indicates the batch number, 1 or 2.

Sample IdConditionPatient IdBatch
HC1_B1HCVBDR9961
HC2_B1HCVBDR10891
HC3_B1HCVBDR10901
HC4_B1HCVDBR10981
HC5_B1HCVDBR11081
HC6_B1HCVDBR11031
HC7_B1HCVDBR11051
HC8_B1HCVDBR11071
HC9_B1HCVBDR11111
CLL1_B1CLLDG33-011
CLL2_B1CLLDG23-011
CLL3_B1CLLDG27-011
HC1_B2HCVBDR9962
HC2_B2HCVBDR10892
HC3_B2HCVBDR10902
HC4_B2HCVDBR10982
HC5_B2HCVDBR11082
HC6_B2HCVDBR11032
HC7_B2HCVDBR11052
HC8_B2HCVDBR11072
HC9_B2HCVBDR11112
CLL1_B2CLLDG33-012
CLL2_B2CLLDG23-012
CLL3_B2CLLDG27-012
Table 2
Lineage surface proteins selected.

The first column indicates the transition element isotope (mass number, element name), the second column indicates the antigen selected, and the last two columns indicate the clone name and vendor.

MetalLineage (surface) protein antibodyCloneVendor
189 YCD45HI30BioLegend
2115 InHLA-DRL243BioLegend
3140 CeCD27M-T271BioLegend
4141 PrCD235a/bHIR2BioLegend
5142 NdCD19HIB19BioLegend
6143 NdCD5UCHT2BioLegend
7144 NdCD38HIT2BioLegend
8145 NdCD4RPA-T4BioLegend
9146 NdCD8RPA-T8BioLegend
10147 SmCD20H1BD
11148 NdCD163G8BioLegend
12151 EuCD1236H6BioLegend
13155 GdCD56B159BioLegend
14156 GdCD14HCD56BioLegend
15159 TbCD11cBu15BioLegend
16169 TmCD45RAHI100BioLegend
17170 ErCD3UCHT1BioLegend
18171 YbCD66CD66a-B1.1DVS
19209 BiCD61VI-PL2DVS
Table 3
Set of intracellular functional proteins selected.

The first column transition element isotope (mass number, element name), the second column indicates the antigen selected, and the last two columns indicate the clone name and vendor.

MetalFunctional (intracellular) protein antibodyCloneVendor
1140 CeBAK7D10WEHI
2153 EuBcl-xLE18Abcam
3154 SmBax1B4WEHI
4157 GdBcl-2100WEHI
5160 GdMcl-1Y37Abcam
6161 DycMycD84C12CST
7163 DyBFL-1SP435Abcam
8165 HoBim3C5WEHI
9166 ErpRb [S807/811]J112-906BD
10172 YbBCLW16H12WEHI
11173 YbcCaspase3C92-605BD
12174 Ybp537F5CST

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Marie Trussart
  2. Charis E Teh
  3. Tania Tan
  4. Lawrence Leong
  5. Daniel HD Gray
  6. Terence P Speed
(2020)
Removing unwanted variation with CytofRUV to integrate multiple CyTOF datasets
eLife 9:e59630.
https://doi.org/10.7554/eLife.59630