Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors

9 figures, 1 table and 2 additional files

Figures

Figure 1 with 1 supplement
Knowledge synthesis and the nferX Single Cell resource.

(A) Knowledge synthesis: capturing association between concepts from over 100 million documents. Schematic shows the workflow for generating literature-derived associations between phrases. Local score and global score are defined and the types of literature-derived associations are shown for combinations of high and low local and global scores. (B) Datasets enabling knowledge synthesis-powered scRNAseq analysis platform (https://academia.nferx.com/). Single-cell RNAseq data was obtained from publicly available human and mouse single-cell RNA-seq datasets. Bulk RNA-seq data was obtained from Gene Expression Omnibus (GEO) and the Genotype Tissue Expression (GTEx) project portal. Protein-level expression of coronavirus receptors was assessed using a collection of immunohistochemistry (IHC) images and tissue proteomics datasets from the Human Protein Atlas and the Human Proteome Map. Literature-derived association scores are obtained from over 100 million biomedical documents (C) Highlighting selected tissues and cell types identified by one or more modalities to express ACE2, the putative receptor of SARS-CoV-2 spike protein. Image template: https://www.proteomicsdb.org/.

Figure 1—figure supplement 1
Validation of metrics used to assess literature-derived associations.

A)-(B) 1-d logistic model predicting true vs random concept pair associations from (A) cosine score and (B) Exponential Local Score. Extent of separation between the green true associations and red random associations indicates extent to which score captures known concept associations. (C) Normalized histograms of co-occurrence counts (on logarithmic scale) for high-cosine vs low-cosine token pairs. (D) Distribution of cosines between gene-disease token vector pairs vs null distributions of cosines between pairs of random 300-d vectors. (E) Null cosine distribution between two random vectors, as the dimension of the vectors varies.

Figure 2 with 21 supplements
Triangulation of knowledge synthesis with ACE2 expression profile by scRNAseq across cells and tissues.

Scatterplot shows comparison of percentage of cells with non-zero expression (x-axis) against literature-derived associations: local score (y-axis and size of circles) and global score (transparency of circles). Data includes cell types identified from ~1.8 million human cells and 462,000 mouse cells cumulatively from 72 studies.

Figure 2—figure supplement 1
ACE2 expression in cell types from murine and human pancreas by scRNAseq.

(A) Ace2 expression in pancreatic cell types from the Tabula Muris study (Tabula Muris Consortium et al., 2018), including polypeptide (gamma) and alpha cells. (B) ACE2 expression in pancreatic cell types from human samples (Muraro et al., 2016; Segerstolpe et al., 2016; Grün et al., 2016), with notable absence in gamma and alpha cells.

Figure 2—figure supplement 2
Multimodal analysis of ACE2 expression using bulk RNA-seq, proteomics, and IHC.

(A) ACE2 expression by bulk RNA-seq across healthy tissues from GTEx. Gray box highlights the tissues with highest expression including small intestine and kidney; green arrow indicates lung. (B–C) Expression of ACE2 and other coronavirus receptors by proteomics across over 20 healthy human tissues from the Human Proteome Map (B) and the Human Protein Atlas (C). (D) IHC stained images showing ACE2 expression in the kidney, small intestine, and colon, with minimal or low expression detected in respiratory tissues.

Figure 2—figure supplement 3
ACE2, DPP4, and ANPEP show similar expression profiles in the renal proximal tubule epithelial cells by scRNAseq and bulk RNA-seq.

(A–C) Cosine similarities between gene expression vectors of all genes and ACE2 (A), DPP4 (B), and ANPEP (C) among all annotated proximal tubule epithelial cells from a human kidney scRNAseq dataset (Stewart et al., 2019). ACE2, DPP4, and ANPEP are reciprocally among the most similar genes to each other. (D–F) Analysis of Pearson correlation coefficients between all genes and ACE2 (D), DPP4 (E), and ANPEP (F) from human kidney cortex samples (GTEx; n = 85). ACE2, DPP4, and ANPEP are reciprocally among the most strongly correlated genes to each other.

Figure 2—figure supplement 4
Overview of nferX Single Cell platform functionality.

Use case shows results for a query of Ins1 in the Tabula Muris study. (A) Violin Plot section shows that pancreatic beta cells express Ins1 at the highest level among 148 annotated cell types from 18 tissues. Summary statistics for each row indicate the percent of cells within a given cluster expressing Ins1, the mean expression of Ins1 in that cluster, and a metric of expression specificity (Cohen’s D) to each cluster. On the left side of the table, ‘Signals’ columns highlight literature-derived associations between Ins1 and the tissue (‘pancreas’) or cell type (e.g. ‘pancreatic beta cells’). (B) Dimensionality reduction-based visualization of all ~100,600 cells from Tabula Muris. Left-sided plot colors cells (points) based on Ins1 expression in the individual cell; right-sided plot colors cells based on a selected metadata variable (here, tissue of origin). (C) List of cluster-defining genes for a selected cell population - by default, any cell population which highly expresses the query gene. Here, cluster-defining genes for pancreatic beta cells include Ins1, Ins2, and Iapp. Blue box highlights functionality to triangulate such cluster-defining gene lists with any literature query of interest.

Figure 2—figure supplement 5
Assessment of DPP4, ANPEP, and TMPRSS2 across healthy tissues using bulk RNA-seq and IHC.

(A–C) GTEx tissues with highest expression observed for DPP4 (A), ANPEP (B), and TMPRSS2 (C). (D) Assessment of ACE2 expression along with other coronavirus receptors in non-GTEx tissues from the Janssen BodyMap dataset (bulk RNA-seq). (E) IHC staining showing that DPP4 and ANPEP are apically expressed in the small intestine (top), while DPP4, ANPEP, and TMPRSS2 are all expressed apically in renal tubular cells (bottom).

Figure 2—figure supplement 6
Single-cell RNAseq analysis of coronavirus receptors in the adult and fetal human kidney.

UMAP-based dimensionality reduction plot denoting cell types (A, G) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, H) from adult (A–B) and fetal (G–H) human kidney samples (Stewart et al., 2019). (C–F) Violin plot visualization of gene expression for ACE2 (C), TMPRSS2 (D), ANPEP (E), and DPP4 (F) in adult human kidney scRNAseq dataset. Automated literature synthesis highlights strong associations of ACE2, DPP4, and ANPEP to the kidney.

Figure 2—figure supplement 7
Single-cell RNAseq analysis of coronavirus receptors in the human and murine heart.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D, F) from multiple studies: (A–B) human heart (Wang et al., 2020b), (C–D) murine heart from Tabula Muris (Tabula Muris Consortium et al., 2018), and (E–F) murine heart from Mouse Cell Atlas (Han et al., 2018).

Figure 2—figure supplement 8
Single-cell RNAseq analysis of coronavirus receptors in adipose tissue.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E) and corresponding expression levels of Ace2, Dpp4, Anpep and Tmprss2 (B, D, F) from multiple studies: (A–B) adipose stromovascular fraction from mice treated with PBS or a beta-3 adrenergic agonist (Rajbhandari and Arneson, 2019), (C–D) murine adipose stromovascular fraction from Tabula Muris (Tabula Muris Consortium et al., 2018), and (E–F) adipose tissue from mice subjected to 24 hr of cold shock and processed for single nucleus RNA-sequencing (Rajbhandari and Arneson, 2019). (G–H) Identification of adipocytes as the most likely ACE2-expressing cell type in adipose tissue based on literature associations. The query shown is from the nferX Signals platform and effectively extracts any textual fragments which contain [‘ACE2’ or its gene synonyms] AND [displayed terms related to gene expression/protein detection] AND [‘adipose’ OR 'fat']. Among these textual fragments, we extract the tokens which represent cell types and calculate enrichments of each cell type among these fragments, normalized by the number of occurrences of this cell type elsewhere in the literature.

Figure 2—figure supplement 9
Single-cell RNAseq analysis of coronavirus receptors in the human and murine testis.

UMAP-based dimensionality reduction plot denoting cell types (A, C) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D) from multiple studies: (A–B) adult human testis (Guo et al., 2018), and (C–D) murine testis from Mouse Cell Atlas (Han et al., 2018).

Figure 2—figure supplement 10
Single-cell RNAseq analysis of coronavirus receptors in the human and murine ovary.

UMAP-based dimensionality reduction plot denoting cell types (A, C) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D) from multiple studies: (A–B) adult human ovary (Fan et al., 2019), and (C–D) murine ovary from Mouse Cell Atlas (Han et al., 2018).

Figure 2—figure supplement 11
IHC images of coronavirus receptors in healthy pancreas and liver samples from the Human Protein Atlas.

(A–B) ACE2 expression on the apical surface of gallbladder epithelial cells (A) and on ductal surfaces in the pancreas (B). (C) ANPEP is expressed on apical membranes in pancreatic acini and ducts. (D) TMPRSS2 is expressed on apical membranes in pancreatic acini and ducts. (E) DPP4 is expressed weakly in the pancreas, both on ductal surfaces and in endocrine islets. (F) DPP4 expression in healthy liver. (G) ANPEP expression in healthy liver; ANPEP appears to be the most strongly expressed coronavirus receptor in the human liver.

Figure 2—figure supplement 12
Single-cell RNAseq analysis of coronavirus receptors in the human liver and pancreas.

UMAP-based dimensionality reduction plot denoting cell types (A, C) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D) from multiple studies: (A–B) human liver (Aizarani et al., 2019), and (C–D) three integrated studies of human pancreas (Muraro et al., 2016; Segerstolpe et al., 2016; Grün et al., 2016).

Figure 2—figure supplement 13
Single-cell RNAseq analysis of coronavirus receptors in human blood, spleen, and bone marrow.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D, F) from multiple studies: (A–B) human peripheral blood mononuclear cells (Single Cell Portal, 2020b), (C–D) human spleen (Madissoon et al., 2020), and (E–F) human bone marrow (Data Browser HCA, 2020).

Figure 2—figure supplement 14
Single-cell RNAseq analysis of coronavirus receptors in murine spleen, bone marrow, and thymus.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D, F) from multiple tissues in the Tabula Muris study (Tabula Muris Consortium et al., 2018): (A–B) spleen, (C–D) bone marrow, (E–F) thymus.

Figure 2—figure supplement 15
Single-cell RNAseq analysis of coronavirus receptors in human and murine bladder and prostate.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E, G) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D, F, H) from multiple studies: (A–B) human bladder (Yu et al., 2019), (C–D) murine bladder from Tabula Muris (Tabula Muris Consortium et al., 2018), (E–F) murine bladder from Mouse Cell Atlas (Han et al., 2018), and (G–H) human prostate and prostatic urethra.

Figure 2—figure supplement 16
Single-cell RNAseq analysis of coronavirus receptors in the murine uterus.

UMAP-based dimensionality reduction plot denoting cell types (A) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B) in uterus-derived cells from the Mouse Cell Atlas (Han et al., 2018).

Figure 2—figure supplement 17
Single-cell RNAseq analysis of coronavirus receptors in human and murine central nervous system tissues.

UMAP-based dimensionality reduction plot denoting cell types (A, C, E) and corresponding expression levels of ACE2, DPP4, ANPEP and TMPRSS2 (B, D, F) in CNS samples from multiple studies: (A–B) murine brain from Tabula Muris (Tabula Muris Consortium et al., 2018), (C–D) murine brain from Mouse Cell Atlas (Han et al., 2018), and (E–F) human retina (Menon et al., 2019).

Figure 2—figure supplement 18
IHC analysis of DPP4 expression in human brain cortex.

Images from the Human Protein Atlas showing brain cortex samples from three different individuals stained to assess DPP4 expression. Staining is notably absent from all cell types in all samples tested.

Figure 2—figure supplement 19
Association of age with ACE2 expression and co-administered drugs with COVID-19 outcomes.

(A) Among GTEx transverse colon samples (n = 405), ACE2 expression tends to be higher in samples derived from younger individuals. (B) Among GTEx gastroesophageal junction samples, ACE2 expression tends to be higher in samples derived from older individuals. (C) Analysis of differential adverse events between ACE inhibitors and other classes of antihypertensive medications (example shown for beta-blockers), real-world evidence shows that patients taking ACE inhibitors are more likely to experience angioedema of various tissues including the small intestine and epiglottis.

Figure 2—figure supplement 20
Characterization of oral epithelium cluster-defining genes from Xu et al., 2020b.

From a non-public scRNAseq study, oral mucosal epithelial cells (including tongue-derived epithelial cells) were defined by expression of SFN, KRT6A, and KRT10. Assessment of SFN (A), KRT6A (B), and KRT10 (C) expression in tongue cells sequenced in the Tabula Muris study (Tabula Muris Consortium et al., 2018). Each gene is most strongly expressed in murine tongue keratinocytes, with expression in basal epidermal cells as well.

Figure 2—figure supplement 21
Expression of ACE2 in respiratory tract associated samples from GEO in comparison to Lung (GTEx).

Plots showing the comparison expression distribution of ACE2 between samples from lung (GTEx in purple) and samples from four GEO studies (in green). The GEO identifiers for each comparison are shown.

Figure 3 with 2 supplements
Triangulation of ACE2 expression in the respiratory tract with literature-derived insights.

(A) Schematic representation of the respiratory system highlighting key cell types from the nasal cavity, airway and alveoli. Scatterplot shows comparison of percentage of cells with non-zero expression (x-axis) from eight single-cell studies against literature-derived associations: local score (y-axis and size of circles) and global score (transparency of circles). (B) Assessing literature-based and scRNAseq-based associations between ACE2 and respiratory tract cells. On the left, the dimensionality reduction plots show different cell populations associated with lung and olfactory epithelium. On the right, violin plots show the distribution of ACE2 expression levels in selected populations with non-zero expression. The cell types and the literature-derived local and global associations scores are shown.

Figure 3—figure supplement 1
ACE2 is strongly correlated to surfactant protein-encoding genes across GTEx lung samples.

Distribution of Pearson correlation coefficients between expression (in TPM) of ACE2 and all other genes detected with a mean of at least 1 TPM in the GTEx lung study (n = 573). The right tail of the distribution (shaded red) includes the top 4% of correlated genes (top). Each gene encoding a known surfactant protein is strongly correlated to ACE2 in lung samples, with the Pearson correlation coefficient among this top 4%.

Figure 3—figure supplement 2
Negative IHC staining of ACE2 in nasopharynx from Human Protein Atlas.

Shown are two samples from different patients stained with different anti-ACE2 antibodies.

Triangulation of ACE2 expression in the gastrointestinal (GI) tract with literature-derived insights.

(A) Schematic representation of the GI tract highlighting key cell types. Scatterplot shows comparison of percentage of cells with non-zero expression (x-axis) from nine single cell studies against literature-derived associations: local score (y-axis and size of circles) and global score (transparency of circles). (B) Assessing literature-based and scRNAseq-based association between ACE2 and tongue keratinocytes. Violin plot shows distributions of ACE2 expression in keratinocytes. The literature-derived local and global associations between ACE2 and keratinocytes are shown. (C) ACE2 transcriptional expression is correlated to enterocyte maturation. Violin plots show distribution of ACE2 expression levels in enterocytes at different stages of differentiation.

Figure 5 with 1 supplement
Coronavirus receptors share a transcriptional signature correlated to maturation of small intestinal enterocytes.

(A) Distribution of cosine similarity between the ‘gene expression vectors’ of ACE2 and all genes in a scRNAseq study of the murine small intestine. The gene expression vector corresponds to the set of CP10K values for a given gene in each individual cell from the selected populations in the selected study. (B) Genes similar to ACE2 (cosine similarity >0.4) sorted by literature-derivation association. Arrow indicates a sort option available on the platform. (C) Transcriptional expression of ANPEP correlated to enterocyte maturation in murine small intestine. Violin plots show distribution of ANPEP expression levels in enterocytes at different stages of differentiation. (D) Transcriptional expression of DPP4 correlated to enterocyte maturation in murine small intestine. Violin plots show distribution of DPP4 expression levels in enterocytes at different stages of differentiation.

Figure 5—figure supplement 1
Coronavirus receptors show highly correlated expression patterns by single cell and bulk RNA-seq in human small intestine.

(A) Distribution of cosine similarity between the ‘gene expression vectors’ of ACE2 and all genes in a scRNAseq study of the human small intestine. (B) Genes similar to ACE2 sorted by literature-derivation association. (C) Gene expression correlations to ACE2 from bulk RNA-sequencing (GTEx) of human small intestine samples.

Author response image 1
Author response image 2
Author response image 3
Author response image 4

Tables

Table 1
Results of evaluation.

Performance of approximately 2100 disease-gene pairs.

Assoc score↓Cohen’s d (+)Mann-W U norm. (-)Logistic log loss (-)Logistic Brier score (-)
Cosine (w2v)1.310.1970.510.168
Raw PMI2.070.09530.3740.116
Raw PMI -log(pctile)2.150.09470.3550.111
Exp PMI2.170.08970.3560.109
Exp PMI -log(pctile)2.210.09030.3410.105
Raw Local Score2.350.08280.3120.0947
Raw Local Score -log(pctile)2.280.08320.3170.0963
Exp Local Score2.340.0812*0.301*0.0915
Exp Local Score -log(pctile)*2.36*0.08110.3080.093
log(coocc)2.240.0970.3480.105
  1. Interpretation of the above table.

    Each row corresponds to an association score whereas each column corresponds to one of the evaluation metrics. A (+) in the column means a higher evaluation metric value, the better the association score in that row separates the positive and random pairs. A (-) means a lower evaluation metric is better. Note all the metrics are immune to linear rescalings; also the Mann-Whitney U score is nonparametric.

Additional files

Supplementary file 1

List of studies included in the Single Cell Platform.

The set of studies which have currently been analyzed and made accessible for analysis in the Single Cell Platform are listed below.

https://cdn.elifesciences.org/articles/58040/elife-58040-supp1-v2.docx
Transparent reporting form
https://cdn.elifesciences.org/articles/58040/elife-58040-transrepform-v2.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. AJ Venkatakrishnan
  2. Arjun Puranik
  3. Akash Anand
  4. David Zemmour
  5. Xiang Yao
  6. Xiaoying Wu
  7. Ramakrishna Chilaka
  8. Dariusz K Murakowski
  9. Kristopher Standish
  10. Bharathwaj Raghunathan
  11. Tyler Wagner
  12. Enrique Garcia-Rivera
  13. Hugo Solomon
  14. Abhinav Garg
  15. Rakesh Barve
  16. Anuli Anyanwu-Ofili
  17. Najat Khan
  18. Venky Soundararajan
(2020)
Knowledge synthesis of 100 million biomedical documents augments the deep expression profiling of coronavirus receptors
eLife 9:e58040.
https://doi.org/10.7554/eLife.58040