Overview of the benchmark. (A) The datasets used consist of silver standards generated from single-cell RNA-seq data, gold standards from imaging-based data, and a case study from liver data. Our simulation engine synthspot enables creation of artificial tissue patterns. (B) We evaluated deconvolution methods on three overall performance metrics (RMSE, AUPR, and JSD), and further checked specific aspects of performance, i.e., how well methods detect rare cell types and handle reference datasets from different sequencing technologies. (C) Our benchmarking pipeline is entirely accessible and reproducible through the use of Docker containers and Nextflow. (D) To evaluate performance on the liver case study, we leveraged prior knowledge of the localization and composition of cell types in the liver to calculate the AUPR and JSD. We also investigated method performance on three different sequencing protocols.

(a) Methods ordered according to the aggregated rankings of performance and scalability. (b) Performance of each method across metrics, artificial abundance patterns in the silver standard, and data sources. The ability to detect rare cell types and robustness against different reference datasets are also included. (c) Average runtime across silver standards and scalability towards more spots.

Method performance on synthetic datasets, evaluated using root-mean-squared error (RMSE), area under the precision-recall curve (AUPR), and Jensen-Shannon divergence (JSD). NNLS is the baseline algorithm (shaded). Methods are ordered from best to worst summed ranks. (a) The rank distribution of each method across all 54 silver standards, based on the best median value across ten replicates for that standard. (b) Gold standards of seqFISH+ datasets (10,000 genes) and STARMap (400 genes). We took the average over seven field of views for the seqFISH+ dataset.

Detection of the rare cell type in the two rare cell type abundance patterns. (a) Area under the precision-recall curve in six datasets, averaged over ten replicates. Methods generally have better AUPR if the rare cell type is present in all regions compared to just one region. (b) Most methods can detect moderately and highly abundant cells, but their performance drops for lowly abundant cells.

Stability against different reference datasets. For the same synthetic spatial datasets, two reference datasets were given and the Jensen-Shannon divergence (JSD) was computed between the predicted proportions. Methods are ordered based on stability, with a lower JSD indicating higher stability.

Method performance based on detecting portal/central vein endothelial cells in portal and central veins (AUPR), and by comparing distributions of all cell types (JSD). The biological variation is the average JSD between four snRNA-seq samples. All reference datasets here contain 9 cell types. Methods are ordered based on summed rank of all data points.

(a) Runtime over the 540 silver standards, ordered by median runtime. GPU acceleration is used whenever possible (asterisks). Cell2location, stereoscope, DestVI, and STRIDE first run with a model building step for each single-cell reference (red points), which can be reused for 90 synthetic datasets. (b) Scalability of the methods based on varying the dimensions of the spatial dataset. We added the model building and fitting time to model-based methods. Methods are ordered based on total time.