ANTIPODE Provides a Global View of Cell Type Homology and Transcriptomic Divergence in the Developing Mammalian Brain

  1. Allen Institute for Brain Science, Seattle, United States
  2. Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, University of California, San Francisco, San Francisco, United States
  3. Department of Neurology, University of California, San Francisco, San Francisco, United States
  4. Department of Neurological Surgery, University of California, San Francisco, San Francisco, United States
  5. Department of Anatomy, University of California, San Francisco, San Francisco, United States
  6. Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, United States
  7. Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, United States
  8. Kavli Institute for Fundamental Neuroscience, University of California, San Francisco, San Francisco, United States
  9. Institute for Human Genetics, University of California, San Francisco, San Francisco, United States
  10. Division of Rheumatology, Department of Medicine, University of California, San Francisco, San Francisco, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Genevieve Konopka
    David Geffen School of Medicine at UCLA, Los Angeles, United States of America
  • Senior Editor
    Claude Desplan
    New York University, New York, United States of America

Reviewer #1 (Public review):

Summary:

The integration of single-cell datasets across species is a powerful approach to understanding how cell types and patterns of gene expression have evolved. Current methods to perform such integrations require multiple steps: clustering, the integration itself, and downstream differential expression analysis. In this study, the authors describe a new approach, called ANTIPODE, that combines these steps by integrating deep learning with interpretable decoding and linear modeling. This method builds on previous deep learning approaches to dataset integration, namely SCVI and scANVI, that employ a variational autoencoder to model single-cell RNA-sequencing datasets. However, gene expression estimates from these previous methods are challenging to interpret due to non-linear decoding from the modeled latent space. ANTIPODE seeks to address this issue by using a single-layer decoder coupled to a linear model to estimate patterns of differential expression, e.g. differential expression by coexpression module, across cell types, etc.

The authors apply their framework to a large single-cell RNA-seq dataset (~1.8M cells) containing cells from the central nervous systems of humans, macaques, and mice spanning in utero developmental time points. They identify a consensus set of cell clusters across each species. They find that ANTIPODE performs at least as well as SCVI in terms of species integration and batch correction. The authors demonstrate several use cases of this integrated approach by analyzing differential expression that correlates with gene structure, the evolution of expression differences in neuropeptide systems, and the anatomical and phylogenetic variation in neurodevelopmental timing.

Strengths:

ANTIPODE is a welcome addition to techniques that integrate large single-cell RNA-seq datasets across multiple species. The approach's simultaneous inference of cell clusters, integration manifolds, and differential expression should streamline analysis pipelines whose elements are often disjointed and sometimes work at cross purposes.

Weaknesses:

The authors note several limitations to their method that will be targets for future development. First, clustering "resolution" is inferred from the data and cannot be tuned as with other approaches. Second, because of the linear decoding, ANTIPODE does not accommodate combining datasets obtained from different modalities (e.g. single-cell with single-nucleus RNA-seq). Third, as currently implemented, ANTIPODE does not explicitly model phylogenetic relationships. However, the authors describe an extension that could enable this, enhancing the power of multiple species integrations. A weakness with the current manuscript is the organization and readability of the figures. The supplemental figures in particular need to be restructured and reformatted to increase their interpretability.

Reviewer #2 (Public review):

Summary:

This work presents ANTIPODE, a bilinear generative model developed for the simultaneous integration and identification of cell types across species and developmental stages using single-cell RNA-seq data. ANTIPODE is inspired by scANVI, a well-established semi-supervised framework for single-cell transcriptomics. After describing its implementation, the authors use ANTIPODE to integrate data from 15 species comprising 1,854,767 cells. Then, the authors benchmark ANTIPODE against commonly used methods (scVI, Harmony, and Scanorama) using two snRNAseq datasets and report comparable or superior performance. They then return to the initial integrated dataset and analyse patterns of gene expression evolution. Finally, they leverage the model to study the "later-is-larger" concept, evaluating the relationship between gene expression, developmental timing and structure size and finding gene expression signatures of this concept.

Strengths:

A major strength of the paper is that ANTIPODE employs a bilinear decoding architecture, which produces more interpretable model parameters while performing at least as well as existing, more opaque nonlinear integration approaches.

The authors demonstrate the utility of ANTIPODE by integrating single-cell mRNA sequencing data from mouse, macaque, and human brains and confirming general principles regarding developmental timing and cell-type-specific gene expression divergence.

They also propose a conceptually interesting framework for studying gene expression evolution: instead of focusing solely on differentially expressed genes between homologous cell types, they jointly model gene expression across developmental states and species-specific divergence, allowing them to define and analyse four categories of differential expression.

Finally, the authors' conclusions are well supported by the analyses presented, although these conclusions remain relatively conservative and reinforce already established principles.

Weaknesses:

A central weakness of the paper is its limited accessibility to a broad audience. Despite attempting to keep computational details in the supplement, the main text still uses substantial jargon, undermining the goal of providing an intuitive explanation of the model. The figures are also insufficiently annotated (e.g., colour schemes in Figure 2 heatmap, bubble plot details in Figure 3, entropy definition in Figure 3), and the figure legends are overly brief and lack essential information. I strongly recommend that the authors revise both text and figures to improve clarity and readability.

Similarly, the materials and methods lack a lot of information about the implementation of the model, the statistical tests used, the calculations of entropy, etc.

The study sits between tool development and biological discovery but does not fully commit to either. As a result, it cannot be evaluated as a full benchmarking study, yet it also does not provide new biological insights that are validated experimentally.

Finally, the GitHub repository for ANTIPODE is not yet functional and lacks documentation or tutorials, making it impossible to assess usability or reproducibility.

Author response:

Public Reviews:

Reviewer #1 (Public review):

Summary:

The integration of single-cell datasets across species is a powerful approach to understanding how cell types and patterns of gene expression have evolved. Current methods to perform such integrations require multiple steps: clustering, the integration itself, and downstream differential expression analysis. In this study, the authors describe a new approach, called ANTIPODE, that combines these steps by integrating deep learning with interpretable decoding and linear modeling. This method builds on previous deep learning approaches to dataset integration, namely SCVI and scANVI, that employ a variational autoencoder to model single-cell RNA-sequencing datasets. However, gene expression estimates from these previous methods are challenging to interpret due to non-linear decoding from the modeled latent space. ANTIPODE seeks to address this issue by using a single-layer decoder coupled to a linear model to estimate patterns of differential expression, e.g. differential expression by coexpression module, across cell types, etc.

The authors apply their framework to a large single-cell RNA-seq dataset (~1.8M cells) containing cells from the central nervous systems of humans, macaques, and mice spanning in utero developmental time points. They identify a consensus set of cell clusters across each species. They find that ANTIPODE performs at least as well as SCVI in terms of species integration and batch correction. The authors demonstrate several use cases of this integrated approach by analyzing differential expression that correlates with gene structure, the evolution of expression differences in neuropeptide systems, and the anatomical and phylogenetic variation in neurodevelopmental timing.

Strengths:

ANTIPODE is a welcome addition to techniques that integrate large single-cell RNA-seq datasets across multiple species. The approach's simultaneous inference of cell clusters, integration manifolds, and differential expression should streamline analysis pipelines whose elements are often disjointed and sometimes work at cross purposes.

Weaknesses:

The authors note several limitations to their method that will be targets for future development. First, clustering "resolution" is inferred from the data and cannot be tuned as with other approaches. Second, because of the linear decoding, ANTIPODE does not accommodate combining datasets obtained from different modalities (e.g. single-cell with single-nucleus RNA-seq). Third, as currently implemented, ANTIPODE does not explicitly model phylogenetic relationships. However, the authors describe an extension that could enable this, enhancing the power of multiple species integrations. A weakness with the current manuscript is the organization and readability of the figures. The supplemental figures in particular need to be restructured and reformatted to increase their interpretability.

We thank this reviewer for their positive feedback regarding the utility of the model and how it may simplify challenging evolutionary analysis.

We acknowledge that the figures are a bit difficult to read, and we will improve annotation and tidiness to make them more accessible to the reader.

We have implemented changes for an ANTIPODE version 0.2 version which includes regression of gene expression differences on a phylogeny. We have updated the github with this “antipode.phylo” module. For this study, the 3 species case is equivalent for flat or phylogenetic regression, where for example mouse up is equivalent to primate down, so we will do not plan to redo the analyses in the text using this new version.

We have already provided examples for running ANTIPODE on our own and public datasets (https://github.com/mtvector/scANTIPODE/tree/main/real_examples), as well as in-line documentation of classes and functions, however it is true that these may be insufficient information for new users. We will provide true explanatory tutorials for both to address the reviewer’s concerns. ANTIPODE version 0.1 is currently installable from either github or PyPI.

Reviewer #2 (Public review):

Summary:

This work presents ANTIPODE, a bilinear generative model developed for the simultaneous integration and identification of cell types across species and developmental stages using single-cell RNA-seq data. ANTIPODE is inspired by scANVI, a well-established semi-supervised framework for single-cell transcriptomics. After describing its implementation, the authors use ANTIPODE to integrate data from 15 species comprising 1,854,767 cells. Then, the authors benchmark ANTIPODE against commonly used methods (scVI, Harmony, and Scanorama) using two snRNAseq datasets and report comparable or superior performance. They then return to the initial integrated dataset and analyse patterns of gene expression evolution. Finally, they leverage the model to study the "later-is-larger" concept, evaluating the relationship between gene expression, developmental timing and structure size and finding gene expression signatures of this concept.

Strengths:

A major strength of the paper is that ANTIPODE employs a bilinear decoding architecture, which produces more interpretable model parameters while performing at least as well as existing, more opaque nonlinear integration approaches.

The authors demonstrate the utility of ANTIPODE by integrating single-cell mRNA sequencing data from mouse, macaque, and human brains and confirming general principles regarding developmental timing and cell-type-specific gene expression divergence.

They also propose a conceptually interesting framework for studying gene expression evolution: instead of focusing solely on differentially expressed genes between homologous cell types, they jointly model gene expression across developmental states and species-specific divergence, allowing them to define and analyse four categories of differential expression.

Finally, the authors' conclusions are well supported by the analyses presented, although these conclusions remain relatively conservative and reinforce already established principles.

Weaknesses:

A central weakness of the paper is its limited accessibility to a broad audience. Despite attempting to keep computational details in the supplement, the main text still uses substantial jargon, undermining the goal of providing an intuitive explanation of the model. The figures are also insufficiently annotated (e.g., colour schemes in Figure 2 heatmap, bubble plot details in Figure 3, entropy definition in Figure 3), and the figure legends are overly brief and lack essential information. I strongly recommend that the authors revise both text and figures to improve clarity and readability.

Similarly, the materials and methods lack a lot of information about the implementation of the model, the statistical tests used, the calculations of entropy, etc.

The study sits between tool development and biological discovery but does not fully commit to either. As a result, it cannot be evaluated as a full benchmarking study, yet it also does not provide new biological insights that are validated experimentally.

Finally, the GitHub repository for ANTIPODE is not yet functional and lacks documentation or tutorials, making it impossible to assess usability or reproducibility.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation