Background

The advent of high-throughput single-cell RNA sequencing (scRNA-seq) has enabled the generation of an array of single-cell atlases from pancreatic islets. These findings have expanded our understanding of the major cell types of the pancreas along with how they are implicated in both type 1 diabetes (T1D) and type 2 diabetes (T2D). Notably, these studies have: 1) identified multiple reliable transcriptomic markers for endocrine and exocrine pancreatic cell types; 2) provided insight into novel subtypes of cells; and 3) generated large cellular atlases spurring innovation in the development of single cell methods and analysis. scRNA-seq analysis has enabled a more robust characterization of islet cell heterogeneity which may underlie diversity in diabetes risk and drug response 15. In published comparisons between the islet single-cell transcriptomes of humans versus mice and pigs, differences were observed in the relative proportion of major cell types as well as in many cell type-specific genes 6. These findings further support a focus on integrating human islet transcriptomic data to delineate disease processes and identify therapeutic targets and biomarkers.

As bulk RNA-seq and scRNA-seq human islet datasets have become increasingly available, so too has the need for new computational tools to integrate the transcriptomics and donor metadata. Some sets of islet scRNA-seq data have been combined and made searchable through web portals to browse datasets and compare cell types and marker genes 69. To further utilize these types of integrated transcriptomic data and donor metadata, we previously developed DEGAS as a flexible deep transfer learning framework that can be used to overlay disease status, survival hazard, drug response, and other clinical information directly onto single cells. Machine learning tools like DEGAS have been primarily used in cancer datasets, but not in human pancreatic islets until now.

The reasons why some obese individuals succumb to T2D while other do not likely involves both genetic and environmental factors, but the factors that underlie this transition are incompletely understood10. Analyses of human genomics and islet transcriptomics indicate many β-cell genes have causal roles in T2D 1115. Subsets of β-cells which are more resilient or susceptible to failure under the secretory pressure of insulin resistance may be uncovered by combining transcriptomics with machine learning approaches like DEGAS. Here we have implemented DEGAS to predict T2D- and obesity-associated subclusters of human pancreatic islet β-cells using a combination of publicly available scRNA-seq and bulk RNA-seq human islet data and associated donor metadata. Through this analysis we sought to identify novel and established genes implicated in T2D and obesity, which were up- or down-regulated in subpopulations of β-cells identified by DEGAS, and to validate our findings at the protein level using immunohistochemistry of pancreas tissue from non-diabetic and T2D organ donors. Our current findings applying DEGAS to islet data have implications for β-cell heterogeneity in T2D and obesity. The abundance of T2D-related factors and functional β-cell genes in our analysis validates applying DEGAS to islet data to identify disease associated phenotypes and increase confidence in the novel candidates.

Data description

Human islet bulk transcriptomic dataset acquisition, processing, and analysis

In this study, human islet bulk RNA-seq raw count data (aligned to GRCh38) and donor metadata from Marselli, et al. GSE159984 was downloaded from the Gene Expression Omnibus 16 (Table S1). The rationale for using this dataset was its large sample size of both non-diabetic and T2D samples and the agreement of differentially-expressed genes between the GSE159984 dataset and a similarly large independent dataset17. All genes in the read count data were filtered based on the one-to-one identifier correspondences between gene symbol, Entrez gene identifier, and Ensembl identifier from the Matched Annotation from NCBI and EBI table (MANE GRCh38 v1.1) 18.

Sample data were labeled and grouped by their RRID and disease status (ND, non-diabetes vs. T2D, type 2 diabetes). After filtering, the GSE159984 read count table contained 19,058 genes from N = 27 T2D samples and N = 58 non-diabetes samples (Fig 1A). Differential gene expression analysis between ND and T2D groups was performed using the edgeR likelihood ratio test and cutoffs were FDR ≤ 0.05 and |log2 fold-change| ≥ 0.58. The processed and filtered read count table and edgeR results are provided in Table S2 and the R script used for processing is available on GitHub (https://github.com/kalwatlab/Islet_DEGAS_v1).

Data acquisition and workflow to train DEGAS using human pancreatic islet transcriptomic data for prediction of T2D-associated cells.

A) Read count data for human islet bulk RNA-seq from GSE138748 was downloaded, processed, and matched with donor metadata. The dataset included 58 non-diabetic and 27 type 2 diabetic (T2D) samples. B) Human islet single-cell RNA-seq (scRNA-seq) count matrices were generated by realignment of reads and the datasets were integrated in Seurat. C) DEGAS transfers the bulk donor expression data and clinical trait information to individual cells in the single cell matrix for the purpose of prioritizing cells. This allows cells to be assigned scores which can then be thresholded for downstream analysis. D,E) Bulk RNA-seq data from Marselli et al. was analyzed by edgeR and displayed as both (D) volcano plot and (E) MA plot. Differentially expressed genes (DEGs) are those with p<0.05 and >1.5-fold changed and are highlighted in red. The MA plot in (E) provides a sense of relative transcript abundance among DEGs. F) Gene set enrichment analysis (GSEA) results for Hallmark_Inflammatory_Response show an enrichment for DEGs in T2D human islets. G) A simple comparison of up- and down-regulated genes in T2D human islets in RNA-seq data between Marselli et al. and Asplund et al. shows consistent findings. Significant DEGs from Marselli et al are outlined in black. H) DEGs from Marselli et al. that were also found on the Type 2 Diabetes Knowledge Portal database of known diabetes effector genes.

Human islet single-cell data acquisition, filtering, and integration

We obtained read count tables for five single-cell human islet datasets (GSE84133; GSE85241; E-MTAB-5061; GSE81608; GSE86469 (Table S1)) which were realigned to GRCh38.p5 (hg38) 8 (Fig 1B). Metadata were downloaded from GEO or obtained from the supplemental information of the respective publications. These five datasets were previously integrated and analyzed 8. To prepare the data for our machine learning analyses, we repeated the integration and analysis of these datasets in R using Seurat to exclude cells with low expressed genes and cells with over 20% mitochondrial gene expression (low viability). The upper and lower limits varied between the datasets to account for differences in library preparation and sequencing platform following guidelines in the Seurat documentation.

Datasets were integrated into a single dataset using Seurat version 4.1.3, the Seurat objects were normalized using regularized negative binomial regression (SCTransform) to correct for batch effect and clustered using default settings. The most variable features were used to identify integration anchors which are passed to the IntegrateData function to return a Seurat object containing an integrated expression matrix of all cells.

The clusters were visualized using UMAP and expression of pancreatic hormone genes INS, GCG, SST, PPY, GHRL were used to annotate β, α, delta, γ and ε cell clusters, respectively. Clustering analysis used the Louvain algorithm 19 and labelling was performed to identify different cell types because our integrated dataset contained a mixture of cells from pancreatic islets. In total, 22 clusters were identified from 17,273 pancreatic cells and the clusters were classified into endocrine and non-endocrine cell types based on cell-specific marker expression (Fig 2A). R scripts are available on GitHub (https://github.com/kalwatlab/Islet_DEGAS_v1). The Seurat object containing all read count matrices and merged data and metadata is available on Mendeley20.

Validation of merged scRNA-seq datasets from five human islet studies.

A) After integrating the five scRNA-seq datasets from Fig 1B, gene expression plots were created for the major endocrine and exocrine pancreas genes to demonstrate successful clustering of specific cell types. Metadata was overlaid onto single cell UMAP plots for show cex (B), BMI (C), age (D), and diabetes status (E).

Analyses

Differentially expressed genes from T2D bulk RNA-seq human islets show inflammatory signature and correlate with independent islet datasets

We selected bulk RNA-seq data from Marselli, et al. 2020 which contains 58 non-diabetic and 27 T2D samples with associated metadata for BMI, age, and sex 16 (Fig 1A-C). To determine the suitability of this data for DEGAS, we reanalyzed the read count matrix from GSE159984 using the edgeR likelihood ratio test to identify up- and down-regulated genes, visualized in both volcano (Fig 1D) and MD plot (Fig 1E, Fig S1A) formats to highlight fold-change versus significance and versus overall expression levels, respectively (Table S2). We observed altered expression of genes well-known to be dysregulated in T2D, including IAPP 21, 22, PAX4 23, 24, SLC2A2 25, FFAR426, and ENTPD3 27, 28 (Fig 1D-E).

Gene set enrichment analysis (GSEA) indicated that the gene expression profile of T2D human islets from this dataset was associated with an inflammatory response phenotype (Fig 1F). To compare our analysis of Marselli, et al. GSE159984 with an independent cohort, we selected a bulk RNA-seq human islet transcriptomic analysis from Asplund et al. 17. Because the raw read count data from Asplund et al. is not accessible, we used the published supplementary table of log2-fold changes for differentially expressed genes and compared it with the Marselli dataset (Fig 1G) (Table S2). Significant differentially expressed genes from both Marselli and Asplund data included up-regulation of SFRP4 and PODN, and down-regulation of UNC5D and FFAR4 (Fig S1B). We also found substantial agreement in the directional changes of most other differentially expressed genes from both datasets, indicating that the Marselli islet data represents a suitable cohort for DEGAS analysis. Finally, we compared up- and down-regulated genes from T2D islets in the Marselli dataset with T2D effector genes from the T2D Knowledge Portal 29. Down-regulated genes PAX4 and SLC2A2 and up-regulated genes APOE each have evidence of a strong or causal role in T2D at the genetic level, while CYTIP was up-regulated in T2D islets and has intron variant SNPs associated with T2D (e.g. rs13384965) 30 (Fig 1H).

DEGAS revealed T2D-associated β-cells and marker genes within integrated human islet scRNA-seq data

We obtained five realigned scRNA-seq datasets 8 including Baron (GSE84133), Muraro (GSE85241), Segerstolpe (E-MTAB-5061), Lawlor (GSE86469), and Xin (GSE81608), and integrated them using Seurat 4.3.0 to subtype each of the major cell types for use with DEGAS. All five datasets were successfully integrated resulting in 17,273 cells (Fig S2A) and different cell types were clearly stratified from the scRNA-seq data (Fig 2A). Distinct clusters of cell types were enriched for their respective markers including β-cells (INS), α-cells (GCG), δ-cells (SST), ε-cells (GHRL), PP-cells (PPY), acinar cells (REG1A, PRSS1), ductal cells (KRT19, CFTR), stellate cells (COL6A1), endothelial cells (PLVAP), and mesenchymal cells (CD44) (Fig 2A).

Next, we implemented DEGAS 31, a unique tool that leverages the increased sequencing depth, larger number of clinical covariates and sample sizes of bulk sequencing data. We used the deep learning architecture within DEGAS to project bulk islet sequencing data into the same latent representation as single cell islet data via domain adaptation. From this common latent representation, clinical covariates from the bulk sequencing data (e.g. T2D status and BMI) were projected onto single cells in a process called transfer learning. This process generates unitless disease-association scores for each single cell. After mapping the T2D-association scores to the single cell map (Fig 3A), we observed that β-cells had the lowest scores compared to other cell types (Fig 3B). This may be interpreted as reflecting fewer β-cells or lower insulin expression after onset of T2D 32. We focused on differential T2D-association scores only among β-cells by subsetting them based on INS expression (Fig 4A) and superimposed the T2D-association scores onto each β-cell in the UMAP plot (Fig 4B). β-cells with higher scores (pink coloration) transcriptionally associate with human islet T2D expression profiles more than the lower scoring (black) β-cells. Next, we set quantile thresholds for the β-cell T2D-association scores and grouped the cells into high (upper 20%), medium and low (lower 20%) categories (Fig 4C). We compared the high versus low βT2D-DEGAS groups to genes enriched in either subset of β-cells (Fig 4D, Table S3). High-scoring βT2D-DEGAS cells had elevated expression of CDKN1C (p57Kip2) which binds and inhibits G1 cyclin/CDK complexes and is also involved in cellular sensescence 33 34, and reduced expression of IAPP, consistent with results from bulk T2D islet expression data (Fig 1G). These findings show that the DEGAS can successfully identify diabetes-relevant genes enriched in specific subclusters.

DEGAS analysis based on T2D status.

A) DEGAS T2D association scores (T2D-DEGAS) for each cell was overlaid onto the single cell UMAP plot. Higher scores in pink/red indicate a strong positive association and negative scores in darker black indicate a negative association of those cells with T2D. B) Violin plot displaying the aggregate T2D-DEGAS scores per cell type.

Identification of differentially-expressed genes in high- and low-scoring βT2D-DEGAS cell populations.

A) β-cells were subsetted from all other cells based on INS expression and reclustered. B) DEGAS scores for T2D association (βT2D-DEGAS) were overlaid onto the cell plot. C) β-cells were classified as low, medium and high-scoring βT2D-DEGAS subpopulations for downstream analysis. D) Differential expression analysis of high vs. low βT2D-DEGAS scores. Genes with p<0.05 and >1.5-fold change are highlighted in red. E) Genes from (D) were filtered to remove DEGs that could be identified by comparing β-cells of T2D vs non-diabetic donors in the single cell data (Fig S1D). Gene ontology (GO) analysis of genes enriched in the high (F) and low (G) scoring βT2D-DEGAS subpopulations. H) GSEA results for high-vs. low-scoring βT2D-DEGAS subpopulations. I) Bubble plot highlighting genes driving GO and GSEA categories.

As an additional way to highlight genes that were identified by DEGAS, we excluded genes that were differentially expressed between all non-diabetic (ND) and T2D β-cells (based on the merged scRNA-seq donor metadata Fig S4D, Table S3). This process left over 90% of significant differentially expressed genes remaining which were only identified through DEGAS analysis (Fig 4E). This emphasizes the advantage of using DEGAS to generate continuous variables (disease-association scores) which enable thresholding and prioritizing cellular subpopulations. Gene ontology (GO) analysis on high-scoring βT2D-DEGAS cells showed enrichment in biological process categories including neutrophil degranulation, cell growth, and negative regulation of cell development (Fig 4F, Table S4). These categories were driven in part by immune-related genes (e.g. HLA-B, PSAP) and cell cycle regulator BTG1. Conversely, GO analysis on low-scoring βT2D-DEGAS cells showed enrichment of categories primarily related membrane protein targeting, translation, and mitochondrial electron transport (Fig 4G, Table S4). The top six enriched GO categories were driven largely by RPS/RPL genes. We also performed ranked-list GSEA on these same high-vs. low-scoring βT2D-DEGAS subpopulations which largely agreed with the GO analysis (Fig 4H). We observed significant positive enrichment for hypoxia, TNFα signaling, estrogen response, and glycolysis (Fig S3D), while significant negative enriched pathways included overlapping genes in oxidative phosphorylation and Myc-target pathways (Fig S3E). Examples of genes that drive the GO and GSEA categories are show (Fig 4I) and heterogeneity in the expression of these genes can be seen in β-cell UMAP plots (Fig S3C).

β-cell clusters with high BMI-association scores (high βobese-DEGAS) show distinct differences between non-diabetic and T2D donors

Obesity and T2D are related, but not mutually inclusive. We surmised that obesity-association scores in β-cells may highlight unique subpopulations compared to the βT2D-DEGAS cells. Therefore, we categorized bulk RNA-seq donors by BMI (lean, <25; overweight, 25-30; obese >30) and ran the DEGAS analysis based on these categories to calculate BMI-association scores similar to our approach with T2D. Overlaying the BMI-DEGAS scores onto the islet single-cell data showed a distinct pattern of cell labeling between the obesity-association score (Fig 5A) compared to lean- or overweight-association scores (Fig S5A,B). It is interesting to note that the subpopulation of βlean-DEGAS cells appear to overlap with the subpopulation of low-scoring βT2D-DEGAS cells (Fig S4C), suggesting that single β-cells share features of low BMI and potentially lower T2D risk. The cell types with the highest obesity-association scores were β-cells, acinar cells, and stellate cells (Fig 5B).

DEGAS analysis based on obesity status.

A) DEGAS obesity association scores (obese-DEGAS) for each cell was overlaid onto the single cell UMAP plot. Higher scores in pink/red indicate a strong positive association and negative scores in darker black indicate a negative association of those cells with obesity. B) Violin plot displaying the aggregate obese -DEGAS scores per cell type.

Reclustering β-cells with overlaid obesity-association scores enabled us to distinguish two major subpopulations of high-scoring βobese-DEGAS cells (Fig 6A), a feature that did not occur with βT2D-DEGAS cells (Fig 4B). We observed that a distinguishing difference between these two groups of obesity-associated β-cells was the donor diabetes status (ND vs T2D) in the single-cell metadata. We applied a median quantile threshold to extract these two subpopulations and classified the cell clusters based on their donor diabetes status to create ND-βobese-DEGAS and T2D-βobese-DEGAS subpopulations (Fig 6B). We next determined differentially expressed genes between the ND-βobese-DEGAS and T2D-βobese-DEGAS cells (Fig 6C). Gene ontology analysis of genes upregulated in the ND-βobese-DEGAS cluster indicated an enrichment for unfolded protein response (UPR)_ processes like vesicle transport, translation, and protein folding and stability (Fig 6D). Specifically, the ND-βobese- DEGAS cluster was enriched for expression of adaptive (e.g. MANF and HSPA5) and maladaptive (e.g. TRIB3, DDIT3) UPR genes, as well as genes involved in ER-associated degradation (e.g. SEL1L, DERL2, and SDF2L1). Gene ontology analysis of genes upregulated in the T2D-βobese-DEGAS cluster showed enrichment for hormone transport and secretion pathways (e.g. IGFBP5, UCN3, G6PC2) and inflammatory/immune-related pathways (e.g., HLA-A/B/E, STAT3) (Fig 6E). In parallel, we performed ranked-list GSEA on the ND- and T2D-βobese- DEGAS subpopulations which was largely consistent with gene ontology results (Fig 6F). Significantly enriched pathways for ND-βobese- DEGAS included the UPR and proliferation (E2F and MTORC signaling hallmarks) (Fig S4I), while T2D-βobese- DEGAS enriched pathways were similar to that of high-scoring βT2D-DEGAS cells and included hypoxia and glycolysis (Fig S4J). Examples of genes driving the GO and GSEA categories are shown, as well as CDKN1C and DLK1, given their differential enrichment in both βT2D-DEGAS and T2D-βobese- DEGAS cells (Fig 6G). The transcript heterogeneity for genes highlighted in Fig 6G can be observed in single β-cell expression plots (Fig S4H). Our results support the idea that these genes and pathways may underlie heterogeneity in β-cell resilience or susceptibility to obesity-related stress in the context of T2D development.

Two subpopulations of high βobese-DEGAS cells defined by underlying single cell donor T2D status uncovers stress signature in non-diabetic high-scoring obese-DEGAS β-cells.

A) β-cells were subsetted from Fig 5A based on INS expression and obese-DEGAS scores overlaid onto the single cells. B) β-cells were further subsetted based on both their high obese-DEGAS scores and by the underlying T2D status of the single cell donors, leading to two major subpopulations, ND-βobese-DEGAS and T2D-βobese-DEGAS cells. C) Differential expression analysis comparing ND-βobese-DEGAS and T2D-βobese-DEGAS subpopulations. Genes with p<0.05 and >1.5-fold change are highlighted in red. Gene ontology (GO) analysis of genes enriched in the ND-βobese-DEGAS (D) and T2D-βobese-DEGAS (E) subpopulations. F) Gene set enrichment analysis (GSEA) of ND-vs. T2D-βobese-DEGAS subpopulations. G) Bubble plot highlighting genes driving GO and GSEA categories.

DLK1 is a candidate identified in βT2D-DEGAS and ND-βobese-DEGAS cells and is heterogeneously expressed among diabetic and non-diabetic human pancreatic islets

To support the DEGAS approach for identifying true markers of β-cell subpopulations or heterogeneity associated with diabetes, we selected a candidate gene for immunostaining in FFPE human pancreas sections from non-diabetic and T2D donors. The subpopulations of cells for both βT2D-DEGAS and T2D-βobese-DEGAS exhibited a subset of overlapping enriched genes. One of the more interesting candidates we noted was DLK1 (Delta-Like 1 Homolog), which is known to have strong associations with T1D 35 and T2D 36. DLK1 expression highly overlapped with high scoring βT2D- DEGAS cells (Fig 7A) and with T2D-βobese-DEGAS cells (Fig 7B). DLK1 immunostaining primarily colocalized with β-cells in non-diabetic human pancreas (Fig 7C). DLK1 showed heterogeneous expression within islets and between islets within the same pancreas section, wherein some islets had DLK1/INS co-staining in most β-cells and other islets had only a few DLK1+ β-cells. In T2D pancreas, DLK1 staining was much less intense and in fewer β-cells, yet DLK1+/INS+ cells were observed (Fig 7C). This contrasts with the relatively higher DLK1 gene expression seen in the β-cells from the βT2D-DEGAS and T2D-βobese-DEGAS subpopulations (Fig 4D & 6C) as highlighted in Fig 7A,B.

DLK1 is a β-cell gene heterogeneously expressed between cells and islets of non-diabetic and T2D humans.

A) Formalin-fixed paraffin-embedded (FFPE) sections of human pancreas from non-diabetic (ND) and T2D donors was stained for INS and DLK1. Representative images are shown from three different ND and T2D donors. Arrows indicate INS+/DLK1+ cells. B) UMAP plot overlaying DLK1 expression and βT2D-DEGAS (i) or βobese-DEGAS (ii) scores.

Discussion

Deep transfer learning is a useful approach to identify disease-associated genes in subsets of heterogeneous islet cells

The rapid advancement of deep learning has unveiled new opportunities in diabetes research, specifically the ability to merge and analyze the large amount of publicly available omics data. The DEGAS deep transfer learning framework provides an extremely powerful and versatile tool allowing for time to event outcomes and classification outcomes to be transferred to individual cells. Notably, this direct transfer to single cells avoids the cluster resolution problems that arise using deconvolution approaches, allowing for subsets of cells within a cluster to be assigned different associations with clinical outcomes. In this way, DEGAS allows the definition of essentially new cell subpopulations that would not necessarily be identified by standard clustering algorithms and these subpopulations are rationally selected based on assigned disease association scores. A key point from our study is that the disease-associated single cells identified by DEGAS are those cells whose transcriptomic signature share complex patterns with patient transcriptomics that have known clinical and metabolic attributes. Because of this, the genes identified as up-or down-regulated in subsets of β-cells may be: 1) altered independent of disease due to genetic background of donors; 2) altered downstream of the obese or diabetic states; or 3) representative of resilient or dysfunctional β-cells in the face of metabolic syndrome. Through careful cohort inclusion criteria and batch effect removal, our analyses aim to limit these caveats when possible.

Various deconvolution methods enable estimating relative proportions of cell types in bulk samples using scRNA-seq data (e.g. BSEQ-sc 1, MuSiC 37, CIBERSORT 38, and DWLS 39). These methods have been instrumental in identifying changes to relative quantity of cell types within normal and disease tissues; however, they do not provide detailed information about subsets of cell types which may be related to disease and are limited to the resolution of the predefined cell types through clustering. Early attempts to address this include unsupervised learning algorithms (e.g. RaceID) 40. Now, there are multiple advanced computational approaches to identify subsets of cells related to disease status, survival, drug response, and other disease metrics. These tools can be broadly categorized as ‘cell prioritization algorithms’ and include: DEGAS 31, scAB 41, Scissor 42, and scDEAL 43. Scissor and scAB can assign survival or clinical information from bulk expression data to disparate scRNA-seq datasets using regression and matrix factorization, respectively. scDEAL is a deep learning framework specifically designed for the assignment of drug response information from bulk expression data to subsets of single cells. The major advantages of DEGAS are 1) the ability to select subsets of β-cells at an individual cell resolution based on their disease association; 2) unique qualities of the highly non-linear DEGAS technology vs. linear models of other tools to detect complex molecular programs; and 3) speed and scalability based on efficient implementation. DEGAS has the potential to be used on datasets containing well over 1 million cells, unlike other methods.

Nevertheless, certain linear model-based approaches, like RePACT (regressing principal components for the assembly of continuous trajectory), have successfully uncovered genes associated with β-cell heterogeneity in T2D and obesity 5. In that study, T2D and obesity ‘trajectory’ genes were identified. We compared the genes enriched in our high-scoring βT2D-DEGAS and in our T2D-βobese-DEGAS cells with the corresponding T2D and obesity trajectory genes from Fang et al., respectively (Fig S5, Table S5). We noted substantial agreement between DEGAS and RePACT, for example the transcription factor SIX3 was enriched in T2D-βobese-DEGAS cells and in Fang et al.’s obese trajectory β-cells 5. SIX3 has been implicated as a diabetes risk gene and is important for β-cell function 44, 45. Additionally, both approaches highlighted the association of DLK1 with obesity, but only DEGAS identified DLK1 association with T2D. Overall, the directionality of T2D or obesity gene associations agreed between DEGAS and RePACT results. However, not all the same genes were identified by both approaches, for example, FXYD2 was identified via RePACT, but not in DEGAS, although FXYD2 is a down-regulated gene in T2D in Marselli et al. and Asplund et al. (Fig 1G). Interestingly, FXYD2 has was recently shown to be a downstream effector gene of HNF1A in single human β-cells 46. Down-regulation of HNF1A in T2D β-cells, and reduced FXYD2 expression as a result, may contribute to membrane hyperpolarization and reduced function 46. Distinct from the application of RePACT in Fang, et al.5, we did not observe any differences in HNF1A expression in our DEGAS analyses and FXYD2 was slightly enriched, but not significantly altered, in T2D-βobese- DEGAS cells. Taken together, it is necessary to apply multiple approaches to merged sets of publicly available data because each approach will likely uncover unique and important subsets of β-cells or specific genes.

Differences between high- and low-scoring βT2D-DEGAS subpopulations

High-scoring βT2D-DEGAS cells had enrichment of pathways like hypoxia and TNFα signaling which contain some overlapping genes like BTG1, BTG2, and BHLHE40. BHLHE40 was recently shown to cause hypoxia-induced β-cell dysfuction via interfering with MAFA expression 47. BTG1 and BTG2 are known as anti-proliferation factors. Given the expression of both BTG1 and CDKN1C in high-scoring βT2D-DEGAS cells, it is possible these cells have very low proliferative potential, even in the context of β-cells which are well-known to have low proliferation. Although some genes like CDKN1C can be identified as differentially expressed in simple comparisons of all T2D vs ND donor β-cells in scRNA-seq data, DEGAS has the advantage of prioritization or ranking of candidates. For example, CDKN1C is within the top ten enriched candidates (by adjusted P value and fold-change) in the βT2D-DEGAS cluster but is 286th when simply comparing all T2D vs ND single β-cells (Table S3). Additionally, deletion of CDKN1C has been shown to improve human islet function 48 and gain-of-function mutants are associated with diabetes development and hyperinsulinism 49, 50.

The low-scoring βT2D-DEGAS cells were enriched in pathways involving oxidative phosphorylation and Myc targets. We also noted enrichment of many small and large ribosomal genes (RPS/RPL) in this subpopulation. RPS/RPL ribosomal genes are known to be highly expressed in primary human β-cells 51, and glucose stimulation induced translation of over 50 different RPS/RPL genes in human β-cell line EndoC-βH2 52. This suggest that low-scoring βT2D-DEGAS cells may have relatively increased translation and metabolic activity. Taken together, high-scoring βT2D-DEGAS cells may be reflective of dysfunctional β-cells in either a pre-T2D or T2D-onset state, while low-scoring βT2D-DEGAS cells could potentially represent more resilient β-cells.

Differences between ND-βobese-DEGAS and T2D-βobese-DEGAS subsets of β-cells

Our understanding of why some individuals with obesity develop T2D while others do not is incomplete. A 2021 NHANES report indicated a 41.9% prevalence for obesity in US adults aged 20 and over, with a 14.8% prevalence for diabetes 53, in line with CDC reports 54. This suggests that many individuals with obesity have either not progressed, or may never progress, to T2D. This phenomenon may be related to those individuals categorized as metabolically healthy obese as compared to metabolically unhealthy obese 55. Expanded work with DEGAS may uncover specific genes that underly this relationship. In our analysis we were able to stratify based on T2D status of the single cell donors and overlay the relative obesity-association scores from DEGAS. That analysis allowed us to set rational thresholds for grouping subsets of β-cells to compare ND- and T2D-βobese-DEGAS cells. Perhaps counterintuitively, UPR genes were highly enriched in the ND-βobese-DEGAS cells as opposed to the T2D-βobese- DEGAS cells. It is possible these ND-βobese-DEGAS cells are in a stressed pre-diabetic state on the path to T2D. Dominguez-Gutierrez et al. analyzed islet scRNA-seq data using pseudotime analysis and identified three major β-cell states which were defined by their insulin expression and UPR level 56. The authors speculate that β-cells periodically pass through these states, which in scRNA-seq are visualized as subpopulations of cells. Previously, a FTH1 subpopulation of β-cells was identified with implications in the UPR 57. In agreement with those results, we found FTH1 was upregulated in ND-βobese-DEGAS cells. Therefore, another possibility is that the UPR induced in ND-βobese-DEGAS cells marks a population with lower functionality but higher proliferative potential to combat the insulin resistant obese state, and the relatively lower UPR in T2D-βobese-DEGAS cells possibly correlates with enhanced functionality in established T2D. Concordantly, T2D-βobese-DEGAS cells had enrichment of hormone secretion genes (e.g. SYT7, G6PC2, NEUROD1, UCN3, FFAR1) in our pathway analysis (Fig 6E,F). In a study of human, mouse, and pig islet scRNA-seq data, subpopulations of ‘stressed’ β-cells were identified that exhibited enriched hallmark pathways similarly to what we observe in ND-βobese-DEGAS cells 6. Thus, multiple studies including our current analysis support the existence of subpopulations of stressed β-cells, whether transient or stable, that could in principle be targeted by therapeutics.

We also looked for enrichment of potential secreted biomarkers in high-scoring βT2D-DEGAS and T2D-βobese-DEGAS cells and identified LRPAP1 and C1QL1. LRPAP1 encodes the LDL receptor-related protein-associated protein 1 (LRPAP) and is enriched in this subpopulation. LRPAP is ubiquitously expressed and is predicted to be an ER resident protein 58. LRPAP appears to be heterogeneously expressed among islet cells and was detected in human plasma in the in the Human Protein Atlas 59. C1QL1 encodes a secreted peptide that is highly-expressed in human islets 60 and was enriched in βT2D-DEGAS cells, although little else is known about its role in β-cells. Increased C1QL1 release could potentially signal through its GPCR BAI3 to suppress insulin secretion of surrounding β-cells, as has been shown for C1QL3 61.

Potential for DEGAS in identifying β-cell heterogeneity markers

Subsets or subpopulations of β-cells are an emergent property of β-cell heterogeneity. Two of the most widely used scRNA-seq datasets have also identified significant heterogeneity within pancreatic cell types, notably β-cells 7, 57. Various markers have been identified to define these β-cell subpopulations, including expression of ST8SIA1 and CD9 62 or Flattop expression 63. Additionally, UCN3 marks mature β cells 64 and RBP4+ β-cells correlated with reduced function 65. Functional heterogeneity has also been described as in the case of leader/first-responder β cells 66, 67 or hub β cells 68, and has been linked to transcription factors including PDX169 and HNF1A 46. Nevertheless, cellular heterogeneity within a single cell type is complicated and there is uncertainty in the single cell analytics field about what constitutes a stable cell type versus a transient cell state 70. Tying transcriptionally distinct clusters to function or morphology may be key to making this distinction 71. By inferring disease outcomes in high resolution cellular subtypes, application of DEGAS can help to inform this debate.

In our analysis, DLK1 expression was enriched in high-scoring βT2D-DEGAS and in T2D-βobese-DEGAS cells. We also observed heterogeneous DLK1 staining of β-cells from human pancreas sections. DLK1 was previously noted to be enriched in subpopulations of β-cells 1, 2, 4, 8, 72, however its heterogeneity in transcript and protein abundance in non-diabetic versus T2D human islet β-cells had not been described until now. DLK1 is a maternally-imprinted gene73 with described roles in Notch signaling7476 and glucose homeostasis77. Supporting an active role β-cell disease, a recent preprint reported DLK1 was required for proper maturation, function, and stress resilience of β-like cells differentiated from human embryonic stem cells 78. Our results combined with the published data indicate DLK1 as a significant player in β-cell function in diabetes. DLK1 may serve as a surface marker of, or be released from, stressed or at-risk β-cells. However, increased sample sizes and in-depth analyses of both non-diabetic, T1D, and T2D human islets will be required to better describe the differences seen in transcript and protein abundance of DLK1 between β-cells within the same islet, between islets of the same and different donors, and across different stages of diabetes development.

Applying tools like DEGAS to available data will increase our understanding of the underlying β-cell features associated with progression to T2D or with an obese non-T2D state. Expanded DEGAS analyses will be needed to include scRNA-seq data from islets subjected to various models of T1D/T2D (e.g. cytokines, glucolipotoxicity). Major questions include whether the gene candidates identified by DEGAS are protective, disease-causing, or simply markers in β-cell subpopulations. In the future, these questions may be addressed by functional validation studies or observations in genome-wide association studies.

Limitations of the study and future directions

We think the approach of joining islet omics data to deep learning and artificial intelligence is an area in need of increased attention; however, we appreciate that our current study has limitations. First, our analysis utilized a single bulk RNA-seq human islet dataset, although the Marselli study contains a relatively large number of islet donors compared to most available datasets. Second, we merged only five scRNA-seq human islet datasets for this application of DEGAS. The datasets were chosen because of the extensive meta-analyses to which they have already been subjected 8, and their use as benchmarking data in many scRNA-seq analysis tools 7981. scRNA-seq has a limitation on the number of detectable genes in each cell, as compared to bulk RNA-seq samples. While these limitations may eventually be mitigated by improved technologies, we can begin to overcome these by merging a larger number of human islet bulk and scRNA-seq studies which contain more donors. Increased sample sizes and study variables come with their own challenges, however the future gene candidates identified by DEGAS will be of even higher confidence. Additionally, scRNA-seq is a snapshot in time and it may be tenuous to claim that a particular cell state is transient or persistent. To support a claim that a given subset/subtype of β-cells represents an actual stable population in scRNA-seq data requires at least finding that the cells comprising the candidate population occur across multiple individual donors. In future studies, our use of the diabetes field’s considerable investment in islet transcriptomics data and state-of-the-art cell prioritization tools will enable simultaneous identification of disease-associated cellular subtypes and associated biomarkers of function and dysfunction. Additionally, increasing the number of human pancreas donor tissue samples for high-content image analysis will improve confidence in candidate gene validation.

Potential implications

DEGAS has the potential to be applied to even larger mergers of single cell data (> 1 million cells) using the newly developed DEGAS atlas implementations. A vast amount of human (and other species) islet single cell transcriptomics data is publicly available, but often requires substantial reformatting of metadata and realigning of reads to be harmonized. As our endeavors and those of others proceed, DEGAS will lead to even higher confidence predictions of β-cell subpopulations. It is also apparent from our DEGAS analyses that other non-β-cells are highlighted in the islet single cell map for associations with T2D and obesity. Although outside the scope of this work, further exploration of our current DEGAS analyses, or of analyses using larger single cell integrations, will have implications for the other major islet and pancreas cell types, including α-cells, δ-cells and PP-cells. For example, PP cells have been shown to have a role in pancreatogenic diabetes as opposed to T2D 82, and there appear to be subpopulations of high and low scoring PPT2D-DEGAS cells within the DEGAS analysis (Fig 3A). Our successful application of deep transfer learning in human islet data opens the possibility of predicting subtypes of β-cells in other diseases, like T1D and congenital hyperinsulinism, to find potential biomarkers and therapeutic candidates. In some rare disease, like congenital hyperinsulinism, single cell data is not available, however spatial transcriptomics (ST) could be applied to FFPE sections. DEGAS has already been implemented to analyze ST data in prostate cancer 83, and will be applicable to human pancreas as ST technology continues to advance. It is important to note that DEGAS is just one of many machine learning approaches to analyze transcriptomic data. It will be important to include DEGAS in combination with both linear and other non-linear models to capture as many relevant gene candidates as possible.

Methods

Transfer learning using DEGAS on human islet transcriptomic data

The integrated single cell and bulk datasets were processed with the DEGAS (v1.0) pipeline (https://github.com/tsteelejohnson91/DEGAS) 31 to calculate disease risk scores associated with T2D status or BMI status. The bulk expression data were scaled and normalized prior to DEGAS analysis using the preprocessCounts function. The merged scRNA-seq expression matrix, the bulk expression matrix, and donor sample labels (matched with the bulk samples) were used as input. The intersection of highly variable genes between scRNA-seq and bulk expression data were used for further analysis. The DEGAS model was trained and predicted on the formatted data. Updated scripts and instructions for running DEGAS are available on GitHub (version 1.0) (https://github.com/tsteelejohnson91/DEGAS). DEGAS for T2D and BMI were run independently.

In DEGAS for T2D, the donor labels were “normal” vs “T2D.” In DEGAS for BMI metadata, donors from bulk RNA-seq were categorized and labeled as “lean” (<25 BMI), “overweight” (25-30 BMI) and “obese” (>30 BMI) according to CDC guidelines. The models were trained to calculate T2D-association scores and BMI-association scores for each of the respective categories. We identified differentially-expressed genes using FindMarkers function in β-cells with high vs low T2D-association scores (βT2D-DEGAS) or high obesity scores (βobese-DEGAS) among healthy or T2D donor single-cells.

Subclustering of β-cells and post-DEGAS scRNA-seq analysis

We isolated the β-cell cluster using subset function and reclustered them seperately from the other islet cell types. The β-cells were further classified by different thresholding parameters including median and quantiles. For quantile and median thresholds, we generated high, medium and low, as well upper 50% and lower 50% of median. We used the FindMarkers function to identify differentially-expressed genes in high vs. low (Fig 4D), high vs. medium+low, and median (upper 50% vs lower 50%) (Table S3).

Pathway analysis

To analyze differentially-expressed genes that were either up- or down-regulated between clusters, enrichment analyses and plot generation were completed using ClusterProfiler, EnhancedVolcano, and ggplot2 R packages in R version 4.2.2 and RStudio. Enrichment of gene sets within the the Biological Process GO terms are shown and cluster profiler outputs are provided in Table S4. Venn diagrams for comparing DEGAS and RePACT hit genes (Table S5) were created using https://molbiotools.com/listcompare.php. Gene set enrichment analysis was run using GSEA software 84, 85 downloaded from www.gsea-msigdb.org/gsea and the MSigDB Hallmark Gene Set 86 from the Broad Institute. For bulk RNA-seq analysis, the edgeR results containing all expressed genes was used as input to run standard GSEA. For GSEA analysis of comparisons between subpopluations of β-cells from scRNA-seq data, a ranked list based on log2fold-change was generated without a cutoff for adjusted P value. GSEAPreranked mode used the ’classic’ enrichnment statistic for 1000 permutations.

Human pancreas tissue staining and microscopy

Formalin-fixed paraffin-embedded (FFPE) de-identified human pancreas tissue were obtained through the NDRI (Table S6). Tissue was processed into 5 µm sections and mounted on glass slides by the Indiana University School of Medicine Histology Lab Service Core. Slides were deparaffinized by xylene and ethanol washes. Antigen retrieval was performed by heating for 40 min in an Epitope Retrival Steamer with slides submerged in Epitope Retrieval Solution (IHC-Tek). Subsequently, slides were placed onto disposable immunostaining coverplates and inserted into the Sequenza slide rack (Ted Pella / EMS) for washing, blocking, and antibody incubations. After three 10 min washes in IHC wash buffer (0.1% Triton X-100 and 0.01% sodium azide in PBS pH 7.4), slides were blocked for 1 h at room temperature in normal donkey serum (NDS) block solution (5% donkey serum, 1% bovine serum albumin, 0.3% Triton X-100, 0.05% Tween-20, and 0.05% sodium azide in PBS pH 7.4). Slides were then incubated overnight at 4°C with primary antibodies diluted in NDS block solution (Table S7). After three washes in IHC wash buffer 200 µL each), slides were incubated in secondary antibodies in NDS block solution for 1 h at room temperature. The washed slides were mounted in polyvinyl alcohol (PVA) mounting medium (5% PVA, 10% glycerol, 50mM Tris pH 9.0, 0.01% sodium azide) with DAPI (300 nM) added and imaged on Zeiss LSM710 confocal microscope equipped with a Plan-Apochromat 20x/0.8 objective (#420650-9901). Images were processed in the Zeiss Zen software to add scale bars, set coloration for channels, and generate merged images. Scale bars indicate 50 µm.

Statistical Analysis

In edgeR bulk RNAseq differential gene expression analysis, the likelihood ratio test was applied. In the FindMarkers function in Seurat the default Wilcoxon Rank Sum test was used for differential gene expression analysis. Cut-offs for signifcant differentially expressed genes were set at |log2fold-change| > 0.58 and adjusted p-value < 0.05.

Availability of source code and requirements

Data availability

The raw data sets supporting the results of this article are available in the NCBI GEO and the ArrayExpress repositories under the following persistent identifiers (also shown in Table S1): bulk RNA-seq: GSE159984; scRNA-seq: GSE84133, GSE85241, GSE86469, E-MTAB-5061, GSE81608.

Declarations

List of abbreviations

  • ND: non-diabetic

  • T1D: type 1 diabetes

  • T2D: type 2 diabetes

  • RNA-seq: RNA sequencing

  • scRNA-seq: single cell RNA sequencing

  • FFPE: formalin-fixed paraffin-embedded

  • DEGAS: Diagnostic Evidence GAuge of Single cells

  • DLK1: Delta Like Non-Canonical Notch Ligand 1

  • MANE: Matched Annotation from NCBI and EBI

  • RRID: Research Resource identifer

  • GSEA: gene set enrichment analysis

  • MSigDB: Molecular signatures database

  • BMI: body mass index

  • UMAP: uniform manifold approximation and projection

  • βT2D-DEGAS: T2D-DEGAS disease assocation score for beta cells

  • βobese-DEGAS: obese-DEGAS disease assocation score for beta cells

  • ND-βobese-DEGAS: obese-DEGAS disease assocation score for beta cells from ND donors

  • T2D-βobese-DEGAS: obese-DEGAS disease assocation score for beta cells from T2D donors

  • GO: gene ontology

  • RePACT: regressing principle components for the assembly of continuous trajectory

  • NDS: normal donkey serum

  • PBS: phosphate buffered saline

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by internal funds at the Indiana Biosciences Research Institute (M.A.K. and T.S.J.), AnalytixIN (T.S.J.), Indiana University Precision Health Initiative (T.S.J), 1R01GM148970 (T.S.J), 1R21CA264339 (T.S.J).’

Authors’ contributions

G.R., R.S., O.L., S.R., S.D.M., A.M.M., T.S.J., and M.A.K. performed data analyses. G.R. and D.R. performed experiments. M.A.K. and T.S.J. conceived and initiated this project. M.A.K. and T.S.J. supervised the project. A.M.M. provided processed data files. G.R., T.S.J. and M.A.K. wrote the manuscript. All authors read and approved the manuscript.

Acknowledgements

We thank Dr. Mark Huising for helpful conversations and providing access to processed data. Thank you to Dr. Andrew Templin for helpful conversations and critical review of the manuscript. Thank you to Dr. Anthony Piron at Université Libre de Bruxelles for assistance with metadata mapping for GSE159984. This work was supported by the Histology Core of the Indiana Center for Musculoskeletal Health at IU School of Medicine and the Bone and Body Composition Core of the Indiana Clinical Translational Sciences Institute (CTSI). A specific thanks to Drew Brown in the Histology Lab Service Core for assisting with pancreas sectioning.

Supplemental Information

Islet Gene View results of DEGs found in both Marselli et al. and Asplund et al.

FFAR4 (A), UCN5D (B), SFRP4 (C), and PODN (D) were each significantly regulated in T2D vs. non-T2D human islet RNA-seq data in Marselli et al. and Asplund et al. Plots shown were generated using Islet Gene View browser (https://mae.crc.med.lu.se/IsletGeneView/). FFAR4 and UCN5D were both down-regulated in T2D and SFRP4 and PODN were both up-regulated.

Human islet single cell RNA-seq clusters and diabetes marker genes for all cells and β-cells.

A) Seurat clustering of merged human islet scRNA-seq data. B) Study identifiers overlaid on single cell plot. C) Differential gene expression analysis comparing all cells from T2D donors vs. non-diabetic (ND) donors. D) β-cells reclustered and labeled based on donor T2D status. E) Differential expression analysis of β-cells from T2D vs. ND donors.

Thresholding analysis for βT2D-DEGAS scores and differential gene expression analysis and plots.

A) Quantile thresholding and volcano plot comparing the upper 20% of βT2D-DEGAS scores to the lower 80%. B) Median thresholding and volcano plot comparing the upper 50% vs lower 50% of βT2D-DEGAS scoring cells. C) Gene expression overlays on β-cells of selected genes that drive enrichment of GO terms and GSEA categories. D) GSEA results for high βT2D-DEGAS scoring cells. E) GSEA results of low-scoring βT2D-DEGAS scoring cells.

Score overlay on all islet cells for βlean-DEGAS (A) and βoverweight-DEGAS (B). C,D) Similar score overlays are also shown for β-cells alone. E) Median thresholding of βobese-DEGAS score. F) Differential expression analysis for high vs low βobese-DEGAS scores based on median threshold. G) Subsetting highest-scoring cells based on βobese-DEGAS score. H) Gene expression overlays on β-cell UMAP. I) GSEA results for ND-βobese-DEGAS cells. J) GSEA results for T2D-βobese-DEGAS cells.

Overlap of RePACT trajectory genes with genes identified by DEGAS in human islet scRNA-seq data.

Genes from βT2D-DEGAS (A), or ND-βobese-DEGAS and T2D-βobese-DEGAS (B) that had adjusted p-values < 0.05 and |log2fold-changed| > 0.58 were selected for comparison to Fang et al. Table S4 T2D and obesity trajectory genes, respectively.

Additional Files

Table S1. List of bulk and single cell transcriptomic datasets.

Table S2. Bulk RNA-seq metadata, analysis, and comparisons to publicly-available datasets.

Table S3. Differential gene expression in DEGAS-identified β-cell subpopulations.

Table S4. Pathway analysis results.

Table S5. FFPE human pancreas metadata.

Table S6. Antibodies.

Table S7. Comparison of DEGAS T2D and obesity genes to Fang et al. RePACT trajectory genes.