Avoiding false discoveries in single-cell RNA-seq by revisiting the first Alzheimer’s disease dataset

  1. Alan E Murphy  Is a corresponding author
  2. Nurun Fancy
  3. Nathan Skene  Is a corresponding author
  1. UK Dementia Research Institute at Imperial College London, United Kingdom
  2. Department of Brain Sciences, Imperial College London, United Kingdom
3 figures, 4 tables and 1 additional file

Figures

Pseudobulk differential expression results in far less dubious disease-related genes.

(a, b) The log2 fold change and -log10 false discovery rate (FDR) of the differentially expressed genes (DEGs) from the authors’ original work (Mathys et al.) and our reanalysis (Our analysis). In (b), we have marked an FDR of 5 × 10–7, dashed grey line, to highlight how small the p-values from Mathys et al.’s analysis are. For (a, b), n is based on the number of DEGs: 26 for our analysis and 23,923 for Mathys et al. (c–g) show the Pearson correlation between the cell counts after quality control (QC) and the number of DEGs identified - n is the 6 cell types tested. For (f, g) analysis, the samples have been randomly mixed between case and control patients - n = 100 random permutations. The different cell types are astrocytes (Astro), excitatory neurons (Exc), inhibitory neurons (Inh), microglia (Micro), oligodendrocytes (Oligo), and oligodendrocyte precursor cells (OPC).

The nuclei that were removed from our quality control approach as their proportion of mitochondrial reads were ≥10%, but kept in the authors’.

(a) shows the proportion of mitochondrial reads across the different cell types. (b) gives the number of removed nuclei which were kept by the authors. The different cell types are astrocytes (Ast), excitatory neurons (Ex), inhibitory neurons (In), microglia (Mic), oligodendrocytes (Oli), and oligodendrocyte precursor cells (Opc).

The proportion of cells left after quality control (QC) from the authors’ processing approach (Mathys et al.) and our standardised pipeline approach – scFlow (Our analysis).

Tables

Table 1
Overview of the aggregated number of cells across samples removed at each step of the quality control (QC) as part of scFlow.

Note that cells can fail QC for more than one check, so only the total failed and total passed rows will sum to 100%.

QC stepsTotal cellsPercentage
Pre-QC35,389,440
Total failed35,337,87499.85
 Minimum library size (n < 200)35,307,28199.77
 Maximum library size47420.01
 Minimum expressed genes (n < 200)35,312,43499.78
 Maximum library size/expressed genes (MAD> 4)21490.01
 Proportion of mitochondrial genes (≥ 0.1)1,097,7383.10
 Multiplets (pK = 0.0054)5810.00
Total passed51,5660.15
  1. MAD, median absolute deviation.

Table 2
The differentially expressed genes from our reanalysis using the same processed data the authors used and pseudobulk differential expression approach.
CelllogFClogCPMLRp-Valueadj_pvalHGNC
Mic2.701789136.9979461926.14184153.17E-070.00061349ACRBP
Mic1.489300718.0624087728.63612178.73E-080.00019303APOC1
Mic1.093276698.6419976921.53230143.48E-060.00336416CD81
Mic–1.41576817.9388487523.99554679.66E-070.00135806CD83
Mic3.37827276.8618354832.08044011.48E-084.58E-05CLEC1B
Mic2.840724526.7437054221.77455093.07E-060.00316269EGF
Mic2.557696586.7834508718.04688722.16E-050.01699007ELOVL7
Mic–1.20560988.3319749922.66440451.93E-060.00229576IFI44L
Mic–1.66160697.1536663916.48012744.92E-050.03306938IFI6
Mic–1.98094257.0039628917.91808232.31E-050.01699007IFIT3
Mic2.765026726.7297880520.65436375.50E-060.00472825ITGA2B
Mic1.909634037.0155223316.32001895.35E-050.03448474MAP1A
Mic–1.81945088.2620888745.22210081.76E-111.36E-07NAMPT
Mic2.09450447.1104845620.80685245.08E-060.00462318NEXN
Mic–2.37897626.9389698522.39124412.22E-060.00245752NR4A2
Mic–2.85534626.7371386222.80298681.79E-060.00229576NR4A3
Mic3.328738296.8494272130.9553272.64E-086.81E-05PF4
Mic3.42139866.8732638333.26216578.05E-093.11E-05PKHD1L1
Mic3.645256776.9342217438.6612725.04E-102.60E-06PPBP
Mic2.304826798.1057044360.79326976.34E-159.81E-11PTPRG
Mic–1.03824688.1145026615.59682737.84E-050.04850839RORA
Mic2.546366496.6920298117.25326063.27E-050.02300507SDPR
Mic–0.96296178.843433417.93191312.29E-050.01699007SYTL3
Mic–1.42153747.9962980625.47362724.48E-070.00077092TMEM2
Mic2.989015966.7727664124.21008198.64E-070.00133637TUBB1
Opc–2.82747185.0337129222.13345812.54E-060.04176231EGR1
  1. CPM - Counts per Million, LR - fold change ratio, HGNC - HUGO Gene Nomenclature Committee.

Table 3
Pearson correlation between our pseudobulk differential expression analysis and the authors’ pseudoreplication analysis on all genes found to be significant at different adjusted p-value cut-offs from the authors’ pseudoreplication analysis.
Pseudoreplication adjusted p-value cut-offNumber of genes comparedPearson correlation
0.0120,1520.8646269
0.0523,9030.8708275
0.126,3820.8721126
0.2532,1170.8764692
0.542,0220.8751554
184,4670.826248
Table 4
The differentially expressed genes from our reanalysis using the reprocessed data and pseudobulk differential expression approach.
CelllogFClogCPMLRp-Valueadj_pvalensembl_idHGNC
OPC–4.15446634.9210080321.69114453.20E-060.04985906ENSG00000166573GALR1
Astro–4.58452764.796514322.23678472.41E-060.037634ENSG00000137959IFI44L
Micro–3.76166197.3287531626.81496882.24E-070.00077905ENSG00000077238IL4R
Micro–2.06814467.8873644117.59290952.74E-050.0346187ENSG00000105835NAMPT
Micro–1.67575567.5847250619.17368291.19E-050.02076348ENSG00000118257NRP2
Micro–3.15564036.8523265319.20646271.17E-050.02076348ENSG00000135363LMO2
Micro–3.43392656.929047219.59755899.56E-060.02076348ENSG00000138135CH25H
Micro–2.81831096.7750067616.9079593.92E-050.04550806ENSG00000142408CACNG8
Micro2.900766478.3456061745.51442661.52E-112.11E-07ENSG00000144724PTPRG
Micro3.258675896.9167101316.55191474.73E-050.0490155ENSG00000163106HPGDS
Micro–2.02909057.1232116616.47467464.93E-050.0490155ENSG00000171612SLC25A33
Micro–3.46573016.9330722119.78833018.65E-060.02076348ENSG00000172243CLEC7A
Micro–4.1728077.1681358334.35158074.60E-093.20E-05ENSG00000174600CMKLR1
Micro–3.19845886.8731055518.53358891.67E-050.0232342ENSG00000227531RP11-202G18.1
Micro3.405628876.938170318.55265021.65E-050.0232342ENSG00000228058RP11-552D4.1
Micro4.460733017.6655916329.77166794.86E-080.00022549ENSG00000253496RP11-13N12.1

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Alan E Murphy
  2. Nurun Fancy
  3. Nathan Skene
(2023)
Avoiding false discoveries in single-cell RNA-seq by revisiting the first Alzheimer’s disease dataset
eLife 12:RP90214.
https://doi.org/10.7554/eLife.90214.3