Tools and Resources

Machine learning sequence prioritization for cell type-specific enhancer design

Computational Biology Department, School of Computer Science, Carnegie Mellon University, United States
Biological Sciences Department, Mellon College of Science, Carnegie Mellon University, United States
Neuroscience Institute, Carnegie Mellon University, United States
Medical Scientist Training Program, University of Pittsburgh, United States
Department of Psychiatry, Translational Neuroscience Program, University of Pittsburgh, United States
Department of Neurobiology, University of Pittsburgh, United States
Systems Neuroscience Center, Brain Institute, Center for Neuroscience, Center for the Neural Basis of Cognition, United States
Department of Ophthalmology, University of Pittsburgh, United States
Division of Experimental Retinal Therapies, Department of Clinical Sciences & Advanced Medicine, School of Veterinary Medicine, University of Pennsylvania, United States
Department of Bioengineering, University of Pittsburgh, United States

May 16, 2022

Open access
Copyright information

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.

Editor's evaluation

This article describes an exciting new approach for tagging and isolation of unique neuronal subpopulations based on machine learning selection of cell-specific enhancer elements in the genome. The article highlights a specific test case of this technology with neurons expressing Parvalbumin, but this method could be applied to any neuronal or even non-neuronal cell type. The tools and overall approach described here will enable cell tagging in model organisms for which transgenic lines are not commonly available or even expression of other transgenes for control of cell function or genetic perturbation.

https://doi.org/10.7554/eLife.69571.sa0

Introduction

The biology of the brain is complicated by vast diversity in cell types, subtypes, and states. Contemporary advancements in single-cell sequencing have identified over a hundred molecularly distinct neuronal populations in the mammalian cortex (Lake et al., 2016; Saunders et al., 2018; Tasic et al., 2018; Hodge et al., 2019; Zeisel et al., 2015), including several small subpopulations of gamma aminobutyric acid (GABA)ergic neurons with specialized functions that are critical for the control of neuronal inhibition (Kepecs and Fishell, 2014; Lim et al., 2018). Understanding neurological function in health and disease from a cell type-specific perspective is critical to the progress of neuroscience.

Such endeavors necessitate cell type-specific technologies to identify, isolate, and manipulate discrete cell populations. Transgenic mouse strains targeting major inhibitory neuron subclasses, including parvalbumin-expressing (PV+), somatostatin-expressing (SST+), and serotonergic (5-HT+) neurons, are widely used today and have been instrumental toward our understanding of these cell types (Madisen et al., 2010; Taniguchi et al., 2011). More recently, many developers have turned toward scalable virus-based cell type-specific tools (Dimidschstein et al., 2016; Hrvatin et al., 2019; Nair et al., 2020; Vormstein-Schneider et al., 2020; Graybuck et al., 2021; Mich et al., 2021). Adeno-associated virus (AAV) technologies became particularly attractive with the invention of AAV variants that cross the blood–brain barrier to transduce the central nervous system, AAV-PHP.B, and AAV-PHP.eB (Chan et al., 2017; Deverman et al., 2016). In line with certain transgenic engineering, an emerging AAV targeting strategy is to incorporate cell type-specific enhancer elements into the viral genome to promote restricted expression. Enhancer activity can be extremely selective, even more so than the activity of most genes and their associated promoters (Hoffman et al., 2013; Kellis et al., 2014; Roadmap Epigenomics Consortium et al., 2015 2015). Thus, enhancers may be used to confer specificity for neuron subtypes that cannot be resolved by the expression of a single marker gene (Tasic et al., 2018) or for which the marker gene promoter is not specific on its own (Nathanson et al., 2009).

Despite the enthusiasm for enhancer sequences in cell type-specific AAV development, their selection remains challenging. ATAC-seq (Buenrostro et al., 2013) is an advantageous technique for defining potential cell type-specific enhancer regions because of its high nucleotide resolution and its compatibility with small cell populations and even with single-cell technologies (Buenrostro et al., 2015b; Cusanovich et al., 2015). Though convenient, chromatin accessibility is a noisy approximation for enhancer activity and differentially accessible sequence elements often fail to produce selective expression in vivo. A parallel screening approach involving single-nucleus sequencing of barcoded enhancer libraries, PESCA, was proposed to speed up the selection process toward a successful enhancer-driven virus (Hrvatin et al., 2019). Other approaches leveraged marker gene proximity and conservation across species for enhancer prioritization (Vormstein-Schneider et al., 2020). These efforts highlight the low conversion rate from experimentally suggested cell type-specific open chromatin regions (OCRs) to desired cell type-specific activity. We hypothesized that machine learning models could be applied as additional in silico filters to reduce the burden of experimental screening in cell type-specific AAV development while maintaining the convenience of ATAC-seq as the sole input data requirement.

The nucleotide sequence code that links transcription factor (TF) binding sites and other DNA features to enhancer activity is underutilized in AAV enhancer design, perhaps due to its complexity (Jindal and Farley, 2021). We reasoned that machine learning classifiers could be leveraged to identify the most characteristic and specific enhancer sequence patterns for a cell population, enabling efficient prioritization of sequences that are likely to drive selective expression. Convolutional neural networks (CNNs) (Le Cun et al., 1989) and support vector machines (SVMs), for example, have achieved state-of-the-art performance on predicting enhancer activity from sequence (Arvey et al., 2012; Ghandi et al., 2014; Zhou and Troyanskaya, 2015). Here, we present a framework for machine learning-assisted engineering of cell type-specific AAVs, which we refer to as Specific Nuclear-Anchored Independent Labeling (SNAIL). Similar to our previously described Cre-activated AAV technology cSNAIL (Lawler et al., 2020), SNAIL probes have the unique advantage of an affinity purification-compatible fluorescent reporter that can be used to isolate rare cell types (Mo et al., 2015; Deal and Henikoff, 2010; Lawler et al., 2020). Unlike cSNAIL, which relies on Cre activation, SNAIL probes are stand-alone AAVs driven by cell type-specific enhancer sequences selected through machine learning models. Thus, SNAIL probes may be used in a wide variety of systems, including wild-type mice, primates, and other species.

We used our framework to create two novel AAV probes for PV+ neurons. In the mouse primary motor cortex, PV+ SNAIL probes labeled PV+ neurons with >70% specificity when compared to Pvalb antibody staining. PV+ SNAIL probes in this context were more specific to GABAergic PV+ neurons than the common Pvalb-2A-Cre/Ai14 mouse strain. In addition, isolated populations of tagged cells from the cortex, striatum, and external globus pallidus (GPe) were each heavily enriched for known PV+ open chromatin signatures. Nucleotide-resolution model interpretation highlighted a collection of 14 TF binding motif families contributing to PV+ neuron-specific enhancer activation. Machine learning can aid in cross-species probe design as well; at least one SNAIL probe showed PV+ neuron selectivity in both mouse and macaque. These results demonstrate concrete utility in sequence-level information for AAV enhancer selection, setting the stage for efficient probe design for a wide range of cell types.

Results

SVMs discriminate known cell type-specific regulatory sequences

We built machine learning classifiers, primarily SVMs, to discriminate sequences of differential OCRs between two cell populations. These models take 500 bp candidate enhancer DNA sequence strings as input and they output, for each sequence, the cell type in which that candidate enhancer is active and an associated score. The similarity between sequences is determined based on gapped k-mer count vectors, that is, the number of occurrences of all short subsequences of length k, tolerating some gaps or mismatches, as implemented by LS-GKM (Ghandi et al., 2014; Lee, 2016). Where there are sufficient reliable sequence features associated with differential enhancer activation between two cell types, the model should learn these principles during the training phase and then be able to apply these principles to determine the cell type-specific activities of new sequences. We imposed upfront that training sequences have a minimum fold difference in chromatin accessibility between the two cell types used to define the classes. This should help the model learn cell type-specific features of enhancer activation rather than general enhancer features. We chose this strategy because it closely aligned with our goal of prioritizing sequences that would activate enhancers in one cell type and not others.

To evaluate this strategy, we first built SVMs comparing select broad classes of cell types in the brain using publicly available single-nucleus (sn)ATAC-seq data from the mouse motor cortex (Mop) (Li et al., 2021; Figure 1—figure supplement 1). These were (i) a neuron vs. astrocyte classifier and (ii) an excitatory neuron vs. inhibitory neuron classifier. To assess model performance, we used standard classifier metrics, the area under the receiver operator curve (auROC), and the area under the precision-recall curve (auPRC). These scores quantify model performance by comparing the predicted class to the actual class, where a randomly guessing binary classifier would have an auROC score of 0.5 and an auPRC score equal to the fraction of actual positives in the data, and a perfect classifier would have a maximum auROC score of 1.0 and auPRC score of 1.0. Both models performed well on held-out data, achieving auROCs of 0.95 and 0.93 and auPRCs of 0.98 and 0.82 (Figure 1—figure supplement 2).

Although the models were built for enhancer sequences, they recapitulated known cell type-specific activation patterns of some commonly used AAV promoter sequences: Gfap, Camk2a, and Dlx (Figure 1—figure supplement 2). The Gfap promoter sequence, which has a heavy astrocyte bias in vivo, was classified as astrocyte-specific in our neuron vs. astrocyte model (5580 astrocyte-specific enhancers evaluated; Gfap ranks 3298/5580). The same neuron vs. astrocyte model scored the Camk2a promoter and Dlx promoter sequences, which are known to have neuron-specific activity, as neuron-specific (14,347 neuron-specific enhancers evaluated; Camk2a = rank 13,300/14,347, Dlx = rank 9736/14,347). Also consistent with empirical expectations, the excitatory vs. inhibitory neuron model predicted the Camk2a sequence to have an excitatory neuron preference and the Dlx sequence to have an inhibitory neuron preference (15,391 excitatory neuron-specific enhancers evaluated; Camk2a = rank 11,541; 4,608 inhibitory neuron-specific enhancers tested; Dlx = rank 1142/4608) (Figure 1—figure supplement 2). Therefore, this classification strategy is capable of correctly predicting cell type-specific regulatory sequence activity in the viral context for these very distinct cell classes.

Machine learning models accurately predict PV+ neuron-specific open chromatin from sequence

Next, we assessed whether the same modeling strategy could be applied to more narrowly defined neuron subtypes, using PV+ neurons as a target. Again, we trained binary classifiers on sequences underlying differential ATAC-seq peaks between two cell populations (PV+ neurons and another population). To maximize sequence signal detection, we built several models comparing PV+ neurons to either PV- cells generally, excitatory neurons, VIP+ neurons, or SST+ neurons. In addition, we varied the input data source to be either ‘population-derived’ (from bulk ATAC-seq sequencing on purified nuclei populations) or ‘sn-derived’ (from single-nucleus open chromatin assays such as droplet-based snATAC-seq). Finally, we implemented multiple modeling strategies, including linear gapped k-mer SVMs, nonlinear gapped k-mer SVMs using the radial basis function (rbf), and CNNs. The ratios of positive and negative examples during training ranged from 1:0.4 to 1:3.7. Detailed information about all model parameters and training, validation, and test data is available in Figure 1—source data 1 (SVMs) or Figure 1—source data 2 (CNNs).

First, to define potential PV+ neuron and PV- cell enhancer sequences in the mouse cortex in a population data-driven manner, we conducted ATAC-seq on the PV+ and PV- nuclei of Pvalb-2A-Cre mice. We isolated the nuclei populations using previously described Cre-dependent AAV affinity purification technology, cSNAIL (Lawler et al., 2020). cSNAIL probes activate an isolatable nuclear envelope tag in the presence of Cre recombinase protein. Therefore, purified populations from these mice are a direct reflection of cells labeled by the Pvalb-2A-Cre mouse strain, a current standard for PV+ neuron labeling. Population-derived PV+ vs. PV- SVMs could predict the correct classification on held-out data with high accuracy (test set auROC = 0.87–0.88, auPRC = 0.95–0.96; gray lines Figure 1b and c), indicating that there were substantial sequence pattern differences between the PV+ and PV- classes and that the models were able to learn these differences.

Figure 1 with 6 supplements see all

Download asset Open asset

Classification of neuron subtype-specific enhancer activity from sequence.

(a) Schematic representation of the Specific Nuclear-Anchored Independent Labeling (SNAIL) workflow. (**b–e**) Receiver operator characteristic and precision-recall performance metrics for various cell type-specific enhancer sequence model strategies and data modalities. The reported numbers are the areas under the curves for each model. (f) Scatter plots for support vector machine (SVM) scores reported by equivalent population-derived models and single-nucleus-derived models. ***p-Value of Pearson correlation <0.001.

Figure 1—source data 1 Support vector machine (SVM) parameter tuning and performance evaluations.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data1-v1.xlsx
Download elife-69571-fig1-data1-v1.xlsx
Figure 1—source data 2 Convolutional neural network (CNN) parameter tuning and performance evaluations.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data2-v1.xlsx
Download elife-69571-fig1-data2-v1.xlsx
Figure 1—source data 3 Model scores on externally screened parvalbumin-expressing (PV+) enhancer candidates.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data3-v1.xlsx
Download elife-69571-fig1-data3-v1.xlsx
Figure 1—source data 4 Identification of motif sites within external parvalbumin-expressing (PV+) neuron enhancer screen, E1–E34.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data4-v1.xlsx
Download elife-69571-fig1-data4-v1.xlsx

We then considered the possibility that the PV+ vs. PV- models were learning features of general neuron vs. glia enhancer sequence properties and not necessarily features specific to PV+ neurons as the PV- data contained a high proportion of glial cells, a developmental outgroup to neurons. To address this issue, we trained additional population-derived SVMs that directly discriminated between enhancer sequences of PV+ neurons and other neuron subtypes using publicly available ATAC-seq data from INTACT-sorted excitatory (EXC) neurons and VIP+ neurons (Mo et al., 2015). These models performed well (auROC = 0.89–0.93, auPRC = 0.79–0.92; Figure 1b and c), demonstrating that, even at the level of neuron subtypes, OCR sequence information is rich enough to reliably distinguish cell type-specific activity.

To survey an additional machine learning strategy, we built CNN classifiers from the same underlying data (Figure 1—figure supplement 3). CNNs are a type of artificial neural network defined by multiple convolutional layers. The CNNs were trained to take in 1000 bp DNA sequence strings with different accessibility between two cell types, automatically extract predictive sequence features, and output a cell class probability between 0 and 1 (see Materials and methods for details). Compared with SVMs, CNNs are better equipped to learn higher-order interactions between sequence features due to CNNs’ capacity for flexible feature representation and automated feature selection (Le Cun et al., 1989). The CNNs were highly accurate (auROC = 0.95–0.96, auPRC = 0.76–0.97; Figure 1d). Thus, CNNs represent an additional successful approach to discriminate OCR sequence differences between purified neuron populations.

While ATAC-seq from purified cell populations is advantageous for its depth and recovers many examples of differentially accessible reads between neuron subtypes, many neuron populations of interest are not yet isolatable, even through transgenic means. Single-nucleus sequencing technologies can be applied to measure neuron subtype-resolution open chromatin without cell sorting by performing several parallel microreactions that introduce unique cell barcodes into ATAC-seq sequencing reads and then clustering the cells with similar chromatin profiles. We explored whether cell type-specific enhancer sequences derived from mouse motor cortex snATAC-seq (Li et al., 2021) were sufficient to produce neuron subtype-level classifiers. We trained five pairwise linear gapped k-mer SVMs to discriminate differential open chromatin sequences from snATAC-seq clusters or groups of clusters. These included analogous models to the population-derived models comparing PV+ vs. PV-, PV+ vs. EXC, and PV+ vs. VIP+. In this case, the single-nucleus-derived PV+ vs. PV- model refers to a model trained on differential OCR sequences comparing PV+ cluster nuclei to all other nuclei with a random sampling probability. The PV+ vs. k-nearest-neighbor (KNN) model is an additional variation on the PV+ vs. PV- model where the PV- nuclei sampling for differential OCR analysis was selected for similarity to the PV+ cluster as implemented in SnapATAC (Fang et al., 2021). We also produced a model comparing PV+ vs. SST+ neurons, the most similar subtype to PV+. Single-nucleus-derived SVMs were also able to classify cell type-specific enhancer sequences with high accuracy (auROC = 0.89–0.93, auPRC = 0.72–0.92; Figure 1e).

Moreover, models built independently from different data sources identified similar sequence contributions for equivalent tasks. When scoring the population-derived sequences through both the population-derived SVMs and the single-nucleus-derived SVMs, individual sequences had highly similar scores (Figure 1f). These findings highlight the prevalence of reliable cell type-specific enhancer sequence signatures that can be defined by a variety of classifier types and sources of open chromatin measurements. The parameter and performance details of all models can be found in Figure 1—source data 1 (SVMs) and Figure 1—source data 2 (CNNs).

Models learn biological signatures relevant for AAV probe design

To ensure that the neuron subtype-level models were identifying signatures that were relevant for the specific purpose of creating selective PV+ neuron viruses, we evaluated model predictions on 34 externally validated successful and unsuccessful PV+ probe enhancer candidates from Vormstein-Schneider et al., 2020, named E1–E34. Importantly, the enhancer sequence from the probe with the lowest PV+ specificity (E4; 14% specificity) received a negative score from every model, and two of the enhancer sequences with highest cortical PV+ specificity (E22 and E29; 94% specificity) received high positive scores from every model (Figure 1—source data 3).

We binarized PV+ enhancer AAV sequences E1–E34 into either the high-specificity (>70%) or low-specificity (<70%) group and assessed the relationships between these groups and ATAC-seq log2 fold difference or SVM score (Figure 1—figure supplement 5). These groups had little variation in mammalian nucleotide sequence conservation levels measured by PhyloP, but many of the sequences with highly confidently predicted PV+-specific accessibility had PV+ accessibility in humans (Figure 1—figure supplement 5b). On the complete set of enhancers, SVM score was predictive of the PV+ specificity group (p=0.008), and differential activity log2 fold difference had a weak association with PV+ specificity group (p=0.069) (Figure 1—figure supplement 5c). Much of the log2 fold difference association with group PV+ specificity was driven by enhancer sequences with low log2 fold difference and low PV+ specificity. Undesirably, there were some sequences with high log2 fold difference and low in vivo specificity. Within the subset of enhancers with high log2 fold difference (log2 fold difference > 1), the log2 fold difference was not associated with specificity group (p=0.601), while SVM score was weakly associated (p=0.057) (Figure 1—figure supplement 5d). Unlike log2 fold difference scores alone, SVM scores may limit false-positive candidates and improve efficient enhancer AAV selection.

This result emphasizes the benefit of enhancer pre-selection with machine learning, which could reduce in vivo screening efforts by identifying the best PV+ enhancer sequences before experimentation. The models predicted which PV+ enhancer sequence candidates were likely to be cell type-specific expression drivers. They also carry information about precisely which subsequences were responsible for the predicted PV+ neuron-specific activation. Sequence E29, within the Inpp5j locus, was predicted to have PV+ neuron-specific activity due to a central Mef2 motif site and nearby Esrrg motif site, among others (Figure 1—figure supplement 6, Figure 1—source data 4). Sequence E22, within the Tmem132c locus, was predicted to have PV+ specificity in part due to Nkx28 and Lhx6 motif sites (Figure 1—figure supplement 6, Figure 1—source data 4). Nevertheless, none of these enhancers were our highest predicted PV+ neuron sequences, so we continued to investigate additional enhancer candidates genome-wide for PV+ SNAIL probe implementation.

Two candidate PV+ SNAIL probes successfully target PV+ neurons in the mouse cortex

Based on the predictions of all PV+ enhancer models on our candidates, we prioritized two highly characteristic PV+ neuron enhancer sequences and tested their ability to drive targeted expression in vivo (Figure 2). We refer to these sequence candidates as SC1 and SC2. Among true PV+ neuron-specific enhancer sequences that (i) were differential OCRs in PV+ vs. PV-, PV+ vs. EXC, and PV+ vs. VIP+ sorted population data and (ii) scored PV+ positive across all SVM evaluations (1755 sequences), SC1 was the highest predicted sequence candidate, while SC2 was in the 90th percentile (Figure 2b, Figure 2—source data 1). We chose these sequences to represent two different confidence levels and evaluate the general potential of the top 10% of machine learning-prioritized PV+ neuron-specific enhancers.

Figure 2

Download asset Open asset

Two sequence candidates selectively activate adeno-associated virus (AAV) expression in parvalbumin-expressing (PV+) neurons.

(a) Genome browser visualization of PV+-specific ATAC-seq signal at sequence candidates SC1 and SC2. * cSNAIL data, † INTACT data from Mo et al., 2015, ‡ snATAC-seq from Li et al., 2021. (b) Percentile rank of support vector machine (SVM) scores among 1755 true PV+-specific enhancer sequence candidates that scored positively across all models. Linear population-derived models are denoted with ‘pop,’ nonlinear population-derived models are denoted with ‘pop, rbf,’ and linear single-nucleus-derived models are denoted with ‘sn.’ (c) Example images of AAV Sun1GFP expression against parvalbumin (Pvalb) antibody staining. (**d, e**) Quantification of AAV Sun1GFP or Pvalb-2A-Cre/Ai14 reporter overlap with Pvalb+ cells. Bar heights represent the mean among images, and the error of the mean is shown. N cells = 1322 (SC1), 2570 (SC2), 1340 (Ai14), 2013 (*Ef1a*), and 504 (N.C.). N.C., negative control.

Figure 2—source data 1 Detailed information for 1755 experimental parvalbumin (PV)-enriched open chromatin regions (OCRs) with positive parvalbumin-expressing (PV+) model scores. Genomic coordinates, peak log2 fold difference, adjusted p-value, support vector machine (SVM) scores, convolutional neural network (CNN) scores, snATAC-seq overlap, species conservation, and subcortical SVM scores are provided.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig2-data1-v1.xlsx
Download elife-69571-fig2-data1-v1.xlsx
Figure 2—source data 2 Image quantification for Specific Nuclear-Anchored Independent Labeling (SNAIL) viruses or the Pvalb-2A-Cre mouse strain reporter with Pvalb immunohistochemistry.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig2-data2-v1.xlsx
Download elife-69571-fig2-data2-v1.xlsx

SC1 and SC2 sequences were cloned into separate vectors upstream of the cSNAIL reporter gene, Sun1GFP. To minimize off-target effects, PV+ SNAIL probes directly rely on transcriptional activation from SC1 or SC2, without a minimal promoter (see Materials and methods). We also prepared two control vectors: a negative control that was the identical vector but with no enhancer or promoter and a nonspecific control that was the identical vector but with a common Ef1a promoter sequence in place of the candidate sequence. When packaged with AAV-PHP.eB and delivered to the mouse through systemic injection, the SC1-Sun1GFP and SC2-Sun1GFP constructs promoted cortical fluorescence that was restricted to PV+ neurons, while the Ef1a virus did not (Figure 2c–e, Figure 2—source data 2). We quantified images with consistent positioning in the primary motor cortex, with representation from all layers. Compared with immunohistochemistry-labeled Pvalb protein, SC1 and SC2-mediated expression of Sun1GFP was restricted to Pvalb+ neurons in ~70–74% of cases. This was an 11-fold enrichment in precision over the Ef1a promoter and notably an almost twofold enrichment over Cre reporter labeling in Pvalb-2A-Cre/Ai14 double transgenic mice (Figure 2c–e, Figure 2—source data 2). We expect these to be conservative estimates of PV+ targeting due to incomplete antibody capture. On average, Sun1GFP expression from SC1 and SC2 SNAIL probes labeled ~71–73% of Pvalb+ neurons. The rate is limited by the transduction properties of the AAV-PHP.eB capsid, which is expected to transduce 55–70% of neurons in the cortex (Chan et al., 2017). The negative control virus was injected at as high of a concentration as possible (14× the concentration of SC1, SC2, and Ef1a injections) to detect any biases in spurious background expression. At this titer, a number of cells exhibited GFP expression (Figure 2—source data 2), but these did not appear biased toward PV+ neurons (Figure 2d).

Isolation of PV+ SNAIL-labeled nuclei captures PV+ cortical interneurons

Expression of Sun1GFP differentiates SNAIL probes from other cell type-specific AAV technology. The stable nuclear envelope association of this tag enables affinity purification using magnetic beads coated with anti-GFP antibodies, which is advantageous for rare population isolation and downstream epigenetic assays. In many contexts, purification of a cell population is more efficient than single-nucleus sequencing technologies, especially if the population of interest is in low proportion or the desired downstream applications are not available in single-nucleus approaches. Taking advantage of this property, we isolated Sun1GFP-expressing nuclei induced by SC1-Sun1GFP, SC2-Sun1GFP, or Ef1a-Sun1GFP SNAIL virus from the mouse cortex and performed ATAC-seq. Through comparison with known PV+ neuron ATAC-seq (via cSNAIL in the Pvalb-2A-Cre strain) and PV- or bulk ATAC-seq, including cSNAIL PV- cell fractions and Ef1a virus signatures, we determined that both SC1-Sun1GFP and SC2-Sun1GFP cells were highly enriched for PV+ neurons.

The first principal component, accounting for 84% of the total variance, separated known PV+ neuron samples from PV- and bulk tissue samples. Likewise, SC1-Sun1GFP and SC2-Sun1GFP samples grouped with the PV+ samples while Ef1a-Sun1GFP samples grouped with the PV- and bulk sample signatures (Figure 3a). At the Pvalb locus, there were highly reproducible OCR signals between PV+ cSNAIL, PV+ snATAC-seq, SC1-Sun1GFP, and SC2-Sun1GFP samples that did not appear in bulk tissue, PV-, or Ef1a-Sun1GFP samples (Figure 3b).

Figure 3

Download asset Open asset

Cortical SC1 and SC2 Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated nuclei recapitulate parvalbumin-expressing (PV+) GABAergic interneuron ATAC-seq signatures.

(a) Principal component analysis (PCA) of ATAC-seq counts across samples. (b) Genome browser visualization of ATAC-seq signal at the *Pvalb* gene locus. Tracks represent the pooled sample p-value signal. Each track of similar data type is normalized to the same scale: SNAIL data range 0–335, *cSNAIL data range 0–93, †INTACT data range 0–200, ‡snATAC-seq data range 0–2. (c) Scatter plots of ATAC-seq log2 fold difference relative to bulk tissue ATAC-seq, comparing PV+ cSNAIL to other adeno-associated viruses (AAVs). The density of overlapping points is shown by the plot color. (d) snATAC-seq nuclei clusters as visualized by t-SNE. The dendrograms show hierarchical clustering of Euclidean sample distances by Ward’s minimum variance method D2. The heatmap shows the percentage of population open chromatin regions (OCRs) enriched relative to bulk that are also cluster-specific marker OCRs. *Hypergeometric enrichment p<0.01.

Figure 3—source data 1 Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated cortical ATAC-seq relative to bulk tissue cortical ATAC-seq.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig3-data1-v1.xlsx
Download elife-69571-fig3-data1-v1.xlsx

A major goal for PV+ SNAIL probes was to replace transgenic mouse strain technologies in certain contexts. Ideally then, ATAC-seq from Sun1GFP-sorted cells from SNAIL probes in wild-type mice should provide similar information as ATAC-seq from Sun1GFP-sorted cells from cSNAIL in Pvalb-2A-Cre transgenic mice. Therefore, we defined PV+ cSNAIL ATAC-seq log2FoldDifference over bulk cortical tissue ATAC-seq as a gold standard for each OCR. For SC1 and SC2, we computed the correlations between the log2FoldDifference of OCR signal relative to bulk tissue and the log2FoldDifference of OCR signal in PV+ cSNAIL relative to bulk tissue. To establish an upper limit for correlation, we compared two different batches of cortical PV+ cSNAIL samples, which had a Pearson correlation of 0.86 and a Spearman correlation of 0.85. As a lower limit, we evaluated the nonspecific Ef1a control virus, which had a Pearson correlation of 0.38 and a Spearman correlation of 0.26. Because the AAV-PHP.eB capsid has a neuron bias, these lowly correlated signatures are likely to be general neuron specifications shared among PV+ and other neurons. Within this range, SC1 and SC2 had very high correlation with cSNAIL, with SC1 achieving almost equivalent correlation as the two cSNAIL batches (SC1 Pearson = 0.85 and Spearman = 0.84; SC2 Pearson = 0.81 and Spearman = 0.79) (Figure 3c). The details for differential OCRs in each virus relative to bulk tissue can be found in Figure 3—source data 1.

Finally, we compared SC1-Sun1GFP+ and SC2-Sun1GFP+ cell open chromatin signatures to those of snATAC-seq clusters from the mouse motor cortex (Figure 3d; Li et al., 2021). We defined cluster-specific OCRs for each snATAC-seq cluster and population-enriched OCRs for SNAIL-isolated cells relative to bulk tissue (see Materials and methods) and assessed the overlaps. We found that cSNAIL-isolated PV+ OCRs, SC1-isolated OCRs, and SC2-isolated OCRs were each significantly enriched for PV+ cluster-specific markers (34–47% overlap, hypergeometric p=0), while OCRs from Ef1a-isolated cells were not enriched for PV+ cluster-specific markers (4% overlap, p=1). Ef1a OCRs instead had the highest enrichment for markers of a layer 4 excitatory neuron cluster (25% overlap, p=5.3 × 10^–5). We also note that cSNAIL PV+ ATAC-seq had an additional 8% overlap with excitatory cluster L5 PT markers (p=2.5 × 10^–45), possibly reflective of Pvalb-2A-Cre line labeling in layer 5 parvalbumin-expressing excitatory neurons (Tanahira et al., 2009; Jinno and Kosaka, 2004; Roccaro-Waldmeyer et al., 2018). These OCRs were absent in SC1- and SC2-isolated cells. In fact, SC1 and SC2 had no enrichment for cluster-specific OCRs of any cluster other than PV+ (≤2% overlap, p>0.1), including the closely related SST+ neuron population. This suggests that SC1 and SC2 SNAIL probes target a stricter subset of the cells than the Pvalb-2A-Cre mouse strain, likely restricted to PV+inhibitory interneurons.

Chromatin accessibility differences between PV+ neurons in different brain regions

SC1 and SC2 SNAIL probes were designed based on the sequence properties of cortical PV+ neurons. Many PV+ neurons throughout the brain have a common developmental origin in the medial ganglionic eminence (MGE), but there are substantial OCR differences between mature PV+ neuron populations in different brain regions. From cSNAIL-isolated PV+ populations in Pvalb-2A-Cre mice (Lawler et al., 2020), we characterized thousands of OCRs with differential accessibility between the cortex, striatum, and GPe (p_adj<0.01, |log2FoldDifference| > 1) (Figure 4a, Figure 4—source data 1). These differences were associated with distinct TF binding motifs (Figure 4b, Figure 4—source data 2). For example, OCRs that were more accessible in cortical PV+ neurons relative to striatal and GPe PV+ had enrichment for Mef2 family motifs, which are important in neuroplasticity (Donato et al., 2015), may play a role in specifying or maintaining the MGE PV+ neuron lineage in mouse and human (Mayer et al., 2018), and have been linked to neurodevelopmental disorders involving the prefrontal cortex (Mitchell et al., 2018). TFs with motifs enriched in PV+ neuron OCRs that are more open in striatum relative to cortex and GPe included Tgif1, a key homeodomain gene involved in holoprosencephaly (Taniguchi et al., 2012). At 6654 differential OCRs, GPe-specific PV+ OCRs were the most unique and had TF motif enrichments, including the Lhx3, Pou5f1, Esrrg, and Pax3 motifs.

Figure 4 with 2 supplements see all

Download asset Open asset

SC1 and SC2 generalize to parvalbumin-expressing (PV+) neurons in the striatum and external globus pallidus (GPe).

(a) Numbers of differential open chromatin regions (OCRs) between PV+ neuron populations in three brain regions (DESeq2 padj<0.01 and |log2FoldDifference| > 1). Brain region-specific OCRs are those that were significantly enriched in that tissue relative to each of the other two tissues. OCRs shared between two brain regions on the Venn diagram are those that were significantly enriched in each of those tissues relative to the excluded tissue. The shared center of the Venn diagram shows all remaining OCRs that have ambiguous or no tissue preference. (b) Examples of enriched motifs in brain region-specific PV+ open chromatin relative to all PV+ open chromatin. (**c, f**) Distributions of validation data support vector machine (SVM) scores and SC1 and SC2 scores within striatum and GPe PV+ vs. PV- models. (**d, g**) Principal component analysis (PCA) visualization of ATAC-seq counts in each sample. (**e, h**) Pearson correlation coefficients when comparing the log2 fold difference of cSNAIL PV+ ATAC-seq relative to bulk tissue ATAC-seq and the log2 fold difference of SNAIL ATAC-seq relative to bulk tissue ATAC-seq. Error bars show the 95% confidence intervals.

Figure 4—source data 1 Differential open chromatin region (OCR) statistics for tissue-specific parvalbumin-expressing (PV+) neuron OCRs in cortex, striatum, and external globus pallidus (GPe).: https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data1-v1.xlsx
Download elife-69571-fig4-data1-v1.xlsx
Figure 4—source data 2 Motif enrichments among tissue-specific parvalbumin-expressing (PV+) neuron open chromatin region (OCR) sequences.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data2-v1.xlsx
Download elife-69571-fig4-data2-v1.xlsx
Figure 4—source data 3 Pathway enrichments among tissue-specific parvalbumin-expressing (PV+) neuron open chromatin region (OCR) sequences.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data3-v1.xlsx
Download elife-69571-fig4-data3-v1.xlsx
Figure 4—source data 4 Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated striatal ATAC-seq relative to bulk tissue striatal ATAC-seq.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data4-v1.xlsx
Download elife-69571-fig4-data4-v1.xlsx
Figure 4—source data 5 Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated external globus pallidus (GPe) ATAC-seq relative to PV- GPe ATAC-seq.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data5-v1.xlsx
Download elife-69571-fig4-data5-v1.xlsx

These molecular differences likely relate to functional differences, such as the tendency of PV+ cells in the GPe to project to other brain regions vs. the local nature of PV+ cells in the cortex (Hernández et al., 2015; Saunders et al., 2016). We used GREAT, which associates OCRs to one or more genes using proximity-based gene regulatory domains and then performs pathway enrichment analysis for those genes, to assess Gene Ontology enrichments in the brain region-specific PV+ ATAC-seq OCR sets relative to all PV+ OCRs (Figure 4—source data 3; McLean et al., 2010). The set of PV+ OCRs enriched in cortical PV+ neurons included 10 regions associated with the Bdnf gene (Ensembl Genes; false discovery rate [FDR] Q = 0.0035). Bdnf is generally expressed in excitatory forebrain neurons but not PV+ interneurons. The presence of PV+ neuron OCRs near Bdnf could represent genomic regions with nonenhancer functions, enhancers that regulate another gene, the binding regions of repressive TFs, or trace contamination from excitatory populations. Other cortex-specific PV+ enrichments included terms related to sensory perception, especially smell. Striatum-specific PV+ neuron OCRs were enriched for the adenylate cyclase-inhibiting dopamine receptor signaling pathway (GO:BP; FDR Q = 0.010) and bradykinesia (Mouse Phenotype; FDR Q = 0.046). OCRs preferentially open in GPe PV+ neurons were enriched for neuropeptide signaling pathways, for example, acetylcholine receptor binding (GO:MF; FDR Q = 0.0044) and neuropeptide receptor activity (GO:MF; FDR Q = 1.2 × 10^–5). This suggests unique epigenetic mechanisms for the regulation of transcription related to receptor signaling in GPe PV+ neurons, but further work is needed to discern these relationships.

PV+ SNAIL probes generalize to subcortical brain regions in the mouse

Given these complexities, we were interested in the extent to which PV+ enhancer probes chosen from data in one tissue could generalize to other brain regions. Here, we assessed whether SC1 and SC2 SNAIL probes, designed in the cortex, were also selective for PV+ neurons in the striatum and GPe. First, we used cSNAIL ATAC-seq data from the striatum and GPe to model the regulatory sequence properties of PV+ neurons vs. PV- cells in these brain regions, using the population-derived linear SVM strategy (Figure 4—figure supplement 1). We assessed the correlation between the score outputs of the cortex PV+ vs. PV- SVM with the striatum PV+ vs. PV- or GPe PV+ vs. PV- SVMs for our set of experimentally identified mouse cortical PV+ neuron enhancers sequences. Enhancer scores were well correlated between the cortex and striatum models (Pearson = 0.73, Spearman = 0.72), indicating shared regulatory sequence determinants between PV+ neurons in the cortex and striatum. There was a low correlation between enhancer scores of the cortex and GPe models (Pearson = 0.25, Spearman = 0.24), indicating differences in the learned PV+ regulatory sequence properties in these regions. SC1 and SC2 sequences were predicted to have PV+-specific activation in both the striatum and GPe (Figure 4c and f). However, there were between 1000 and 3000 sequences with more confident PV+ scores than SC1 and SC2 in the striatum and GPe.

We proceeded to isolate SC1- and SC2-labeled cells from these tissues in wild-type mice using Sun1GFP affinity purification and performed ATAC-seq on the tagged populations. We have previously shown high agreement between cSNAIL and Pvalb-2A-Cre labeling in the striatum and GPe (Lawler et al., 2020), so we again used cSNAIL ATAC-seq samples from these regions as true PV+ neuron signals. By principal component analysis (PCA), we recovered separation between PV+ samples, including SC1- and SC2-isolated populations, and PV- samples (Figure 4d and g). We assessed the correlations between log2FoldDifference in SNAIL and cSNAIL samples, each relative to bulk tissue (striatum) or, where there were no bulk samples available, cSNAIL PV- cells (GPe) (Figure 4e and h, Figure 4—source data 4, Figure 4—source data 5). Pearson correlation coefficients were similar or slightly lower for SC1 and SC2 in the striatum and GPe than for equivalent comparisons in the cortex, indicating less conservation between cSNAIL and SNAIL probe targets (SC1 cortex = 0.85, striatum = 0.71, GPe = 0.68; SC2 cortex = 0.81, striatum = 0.82, GPe = 0.73). Still, these were substantially increased over Ef1a correlation with cSNAIL in these tissues, especially for the striatum (Ef1a cortex = 0.38, striatum = 0.18, GPe = 0.51).

Finally, by comparing the peak overlaps of SC1 and SC2-enriched OCRs in striatum and GPe with cortical snATAC-seq cluster-specific OCRs, we still identified the PV+ cluster as most similar to SC1 and SC2 cells. As expected, all overlaps in striatum-cortex and GPe-cortex comparisons were lower than those from cortex-cortex comparisons, but the magnitudes of SC1 and SC2 overlapped with the Pvalb cluster in these brain regions were similar to the magnitudes of cSNAIL PV+ overlap with the Pvalb cluster in these brain regions (Figure 4—figure supplement 2). In the striatum, the overlaps with the Pvalb cluster were 8% for SC1, 14% for SC2, and 14% for cSNAIL. In the GPe, the overlaps with the Pvalb cluster were 7% for SC1, 7% for SC2, and 9% for cSNAIL. From these interpretations, SC1 and SC2 SNAIL viruses seem to generalize to the striatum and GPe, though they may not be as robust as they are within the cortical context. Additional experimental evidence may be necessary to confirm the appropriateness of PV+ SNAIL viruses for certain applications outside the cortex.

Esrrg and Mef2 motifs are important for the PV+ neuron-specific activity score of SC1 and SC2 sequences

To interpret the specific sequence patterns within SC1 and SC2 that contribute to their PV+ neuron-specific activity prediction, we assessed commonly used motifs for each model and identified potential matches within the candidate sequences. For all SVMs, we calculated per-base importance scores and hypothetical importance scores for the set of PV+ neuron-specific OCRs that were true positives according to all SVMs (score >0; N = 1755) (Shrikumar et al., 2019). Then, for each model, we used TF-MoDISco (Shrikumar et al., 2018) to cluster commonly important subsequences called ‘seqlets’ within these PV+-specific examples. The resulting clusters represent motifs that were high contributors to a positive score in each model. Among the 11 SVMs comparing PV+ neuron open-chromatin against PV- cells, EXC neurons, VIP+ neurons, or SST+ neurons, we recovered 124 well-supported motifs. Many motifs appeared to be shared across multiple models. Thus, we performed UPGMA clustering on the 124 motifs by sequence similarity using STAMP (Mahony and Benos, 2007) and identified 14 motif clusters (Figure 5a).

Figure 5

Download asset Open asset

Motif interpretation of parvalbumin-expressing (PV+) neuron-specific open chromatin region (OCR) activity.

(a) Motifs with high contributions to PV+ scores in each support vector machine (SVM), clustered by sequence similarity. The bubble color at each node shows the model that motif was discovered in and the size of the bubble shows the number of seqlets supporting that motif. Clusters are labeled by the clade majority best match for known transcription factor binding motifs. The full list of matches can be found in Figure 5—source data 1. (**b, c**) Normalized importance of each base in SC1 (b) and SC2 (c) sequences for their PV+-specific scores in each SVM. Locations with sequence matches for identified motifs in each SVM (from panel a) are shown at the bottom. (d) Predicted impacts of motif scrambling on PV+ specificity. Motif mutation sites are shown with asterisks in panels (a) and (b). Each point is the sequence score from one SVM and ‘x’ is the mean.

Figure 5—source data 1 Association of TF-MoDISco motifs to known transcription factor binding motifs.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data1-v1.xlsx
Download elife-69571-fig5-data1-v1.xlsx
Figure 5—source data 2 Identification of motif sites within SC1 and SC2.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data2-v1.xlsx
Download elife-69571-fig5-data2-v1.xlsx
Figure 5—source data 3 Normalized per-base importance scores for SC1 and SC2 sequences.: https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data3-v1.xlsx
Download elife-69571-fig5-data3-v1.xlsx

The largest cluster, with 23 motif members, contained representation from all 11 models and had matches to known motifs including the motifs for Esrrg and Rora (Figure 5—source data 1). Consistent with an important role for Esrrg in PV+ neurons, Esrrg (a.k.a. Err3) transcript levels were differentially overexpressed in the PV+ neuron cluster relative the rest of the frontal cortex in snRNA-seq (DropViz subcluster #2–7 Neuron.Gad1Gad2.Pvalb Esrrg fold ratio = 8.0, p=1.14 × 10^–198) (Saunders et al., 2018). Esrrg and Rora are key TFs in the Pgc1a transcriptional program, which regulates Pvalb expression, mitochondrial function, and transmitter release (Lin et al., 2005; Lucas et al., 2010). Pgc1a signaling is restricted to PV+ neurons in the brain and may mediate the unique energy demands of fast-spiking neurons (Paul et al., 2017; Lucas et al., 2014).

The second-largest motif cluster contained 16 motifs, also representing all 11 models, and the motifs had the best matches to motifs for Mef2a, Mef2c, and Mef2d. A cluster of Lhx6-like motifs, a TF necessary for MGE interneuron differentiation from interneuron progenitors (Liodis et al., 2007; Vogt et al., 2014), was detected with high support from PV+ vs. PV- models and PV+ vs. EXC models, detected with low support from PV+ vs. VIP+ models, and not detected between MGE neuron subtypes PV+ vs. SST+. Interestingly, two clusters of motifs were dominated by PV+ vs. VIP+ activity, including matches for Stat6, Nkx28, and Cux2 motifs. Cux2 expression is induced by Lhx6 in the MGE, supporting a role in specification of the MGE interneuron lineage (including PV+ and SST+ neurons) from other interneuron lineages (Zhao et al., 2008). Overall, these findings indicate both shared and unique sequence properties dictating PV+ neuron-specific regulatory sequence activity relative to other cell types.

SC1 and SC2 represent two experimentally validated PV+ neuron-selective regulatory sequences. To interpret the sequence determinants of their success, we mapped potential motif sites for the 124 TF-MoDISco motifs (Figure 5—source data 2) and overlaid these with per-base importance scores for each of the SVMs (Figure 5—source data 3). This strategy revealed multiple high-importance subsequences with potential TF binding function. SC1 contained two Esrrg motifs near the sequence center that were high contributors to the PV+ neuron-specific model predictions and matched TF-MoDISco motifs for every model (Figure 5b). An additional subsequence with contributions specific to PV+ vs. VIP+ models matched motifs for Sp7. SC2 contained a highly important Mef2 sequence near the center (Figure 5c). Additionally, SC2 contained an Esrrg motif with shared importance across all models. Interestingly, the most important features of the SC2 sequence closely resemble those of successful PV+ probe E29 from Vormstein-Schneider et al., 2020 (Figure 1—figure supplement 6). Disruption of Esrrg and Mef2 motifs within SC1 or SC2 resulted in a sharply decreased prediction of PV+ specificity according to the scores across the SVMs (Figure 5d). While these impacts are untested in vivo, these analyses provide an intuition for potential nucleotide contributions to PV+ neuron-specific enhancer sequence function in SC1 and SC2.

Cross-species analyses with SNAIL

The utility of modern cell type-specific technologies often depends on transferability between species, particularly between rodents and primates. Machine learning with SNAIL has the potential to expedite cell-type isolation from multiple species by pre-selecting enhancers with sequence composition that is likely to drive cell type-specific expression across species. To investigate this, we built human PV+ SVMs analogous to the mouse SVMs using single-nucleus open chromatin data from the human cortex (Corces et al., 2020). These models learned to discriminate between the sequences of open chromatin peaks in human PV+ neurons relative to human open chromatin peaks in excitatory neurons, VIP+ neurons, SST+ neurons, or the combination of all PV- cells. Although the human data contained fewer cell profiles, models were able to reasonably perform the classification task, with auROC 0.73–0.83 and auPRC 0.68–0.81 (Figure 6a). Such models may be useful for predicting enhancer sequence activity in primates, regardless of the species origin of the enhancer sequence. We used these human-trained models to predict PV+ neuron-specific activity for 4428 mouse sequences from experimentally identified PV+ neuron-specific open chromatin peaks. Score outputs from human-trained models were correlated with score outputs from mouse-trained models (Figure 6b), suggesting shared PV+ neuron-specific enhancer sequence features between mouse and human PV+ neurons. SC1 was one of the highest-scoring enhancer candidates in both mouse and human SVMs.

Figure 6

Download asset Open asset

Extensions of Specific Nuclear-Anchored Independent Labeling (SNAIL) technologies in primates.

(a) Receiver operator characteristic and precision-recall performance metrics for parvalbumin-expressing (PV+) support vector machines (SVMs) derived from single-nucleus chromatin accessibility assays of human cortical tissue. The reported numbers are the areas under the curve for each model. (b) Comparison of mouse PV+ open chromatin sequences scored by mouse and human SVMs. Axes are the mean SVM scores among the 11 mouse SVMs or 4 human SVMs. (c) Images of SC1 AAV activation in the rhesus macaque cortex. (d) Quantification of SC1 and Pvalb antibody staining in the macaque cortex. Image sets near the center or peripheral of the injection site were quantified separately.

Because SC1 was predicted to retain PV+ neuron specificity in primates, we tested the SC1 SNAIL probe in the rhesus macaque cortex via localized intracranial AAV injection. We observed high expression of Sun1GFP with enrichment for PV+ neurons. There was a trade-off between precision and recall of PV+ neuron labeling related to the spatial viral load, with a mean precision of 22% near the injection site center and 67% at the periphery. At the injection center, at least 98% of PV+ neurons were transduced by the SC1 SNAIL probe, while at the periphery, at least 75% of PV+ neurons were transduced. Our preliminary findings are optimistic for PV+ SNAIL probe adoption in primates, but optimal AAV titer should be carefully considered for the specific application.

Discussion

OCR sequence features provide valuable, underutilized information for cell type-specific enhancer design. Here, we showed that sequence was sufficient to discern the directionality of highly differential OCR activity in different neuron subtypes in most cases. Interpretation of these models revealed rich diversity among the biochemical underpinnings of these classification tasks, reflective of cis-trans interactions. The defining sequence properties of cell type-specific OCR activation were robust throughout different data modalities, including ATAC-seq from sorted populations and snATAC-seq, and different classifier types.

Thus far, success in cell type-specific AAV creation has been limited to intensive screens of dozens of individual AAV enhancer candidates in which most do not produce highly selective expression (Vormstein-Schneider et al., 2020; Mich et al., 2021; Graybuck et al., 2021). In SNAIL, our framework for cell type-specific AAV engineering, we incorporate machine learning classifiers as an additional filter for improved enhancer pre-selection in order to mitigate the experimental burden. On a set of externally tested PV+ enhancer-driven AAVs (Vormstein-Schneider et al., 2020), the average PV+ specificity score across our classifiers was more predictive of PV+ specific AAV expression than the log2 fold difference of snATAC-seq signal, sequence conservation, or accessibility conservation at these loci. With the SNAIL framework, we identified and validated two novel enhancers that drive targeted expression in PV+ neurons in the mouse cortex. While these do not represent enough trials to establish a new conversion rate from cell type-specific OCRs to cell type-specific AAVs, we were encouraged by the immediate success of the first probes we selected. We believe that the incorporation of differential sequence property analyses will continue to improve the throughput of targeted AAV development in new contexts and make cell type-specific enhancer tool development accessible for more researchers.

An additional advantage of incorporating classifiers for cell type-specific enhancer selection is increased interpretability of the factors that govern success. The sequence patterns learned by PV+ models reflected known PV+ neuron biology. Common motifs contributing to successful PV+ probe enhancers included Esrrg, Mef2, and Lhx6, important for the specification and maintenance of the cortical PV+ interneuron lineage (Zhao et al., 2008; Mayer et al., 2018; Liodis et al., 2007). It is interesting to note the varying expression patterns of these TFs. In the cortex, Esrrg expression is mainly restricted to PV+ neurons and Lhx6 is restricted to MGE interneurons, but Mef2 transcripts are widely expressed in many inhibitory and excitatory neurons. It is likely that combinatorial interactions between abundant and cell type-specific TF binding events specify PV+ neuron enhancer activation. The SNAIL framework provides an opportunity to meaningfully leverage these complex sequence codes.

We found that a combination of multiple direct comparisons between the target cell type and other cell types made for particularly useful screening. Here, we used a tiered approach to ensure specific activity at multiple levels of cellular relationships to PV+ neurons. At the broadest level, we modeled PV+ neuron OCR sequences against PV- OCRs, a mixed signature from all other neuron and non-neuron cell types in the mouse cortex. Within neurons, we modeled PV+ vs. EXC neurons, and then PV+ relative to more specific subtypes of inhibitory neurons VIP+ and SST+. Successful SC1 and SC2 sequences contained attributes that made them highly PV+ specific across these comparisons.

SC1-Sun1GFP and SC2-Sun1GFP are new AAV technologies for PV+ neuron labeling and isolation in diverse systems. A unique feature of these viruses is the modified Sun1GFP tag that enables nuclei purification by magnetic beads coated with anti-GFP antibodies. This process is advantageous for isolating genomic and epigenomic signals from the population of interest with no dependence on transgenic strains. In comparison to single-nucleus sequencing technologies, affinity purification with SNAIL is more efficient for addressing targeted hypotheses about a specific cell type. SNAIL may also be paired with single-nucleus sequencing technologies for unprecedented resolution of the substructures within minority cell populations. We took advantage of SNAIL affinity purification to isolate SC1-Sun1GFP and SC2-Sun1GFP nuclei for molecular assessment with ATAC-seq. This represents a novel approach for validating new cell type-specific AAVs. We found that SC1 and SC2 PV+ SNAIL probes had high molecular agreement with cells tagged in the Pvalb-2A-Cre mouse strain, making them a reasonable alternative to transgenic strain technology. In addition to their success in the intended brain region (cortex), SC1 and SC2 PV+ SNAIL viruses also generalized to subcortical regions, the striatum, and GPe enough to isolate PV+ ATAC-seq signal. Moreover, the SC1 enhancer has sequence characteristics consistent with primate PV+-specific activation and has biased expression in PV+ neurons in the macaque brain. It will be interesting to conduct further comparisons of the composition of cell populations that are labeled by these tools in different brain regions and species. We speculate that the SNAIL framework could be strategically employed to design tools for specific subtypes of PV+ neurons or other cell populations that may be brain region or species specific.

In general, pairing cell type-specific enhancers with AAVs provides much more flexibility and scalability than transgenic technologies. However, there are drawbacks to certain applications. AAVs require time to reach peak expression, usually 2–4 weeks, although some may be robust earlier. This means they are not appropriate for developmental studies in very young animals. These and other developmental time points outside of the scope of the training data may have unintended effects on enhancer expression and specificity. Additionally, there are limitations to the transduction efficiency, so AAVs may not be ideal for studies where it is important to label all cells of a certain type. Finally, an enhancer’s activity in AAV may differ from the same enhancer’s activity in the host genome due to the surrounding genomic and epigenomic context. Enhancer activity may also fluctuate in response to different conditions because enhancers are dynamic actors in the regulation of gene expression. However, machine learning model-based prioritization of characteristic sequences may minimize this risk.

Excitingly, there are many opportunities for extensions of the SNAIL framework that enable cell type-specific interrogation in unprecedented settings. In a direct application that we began to explore in this work, SNAIL probes could be used to isolate cell types of interest in diverse species. High evolutionary conservation in the rules for enhancer activity has been demonstrated by the success of machine learning models in predicting enhancers across mammals (Minnoye et al., 2020; Kelley, 2020; Kaplow et al., 2020; Chen et al., 2018). Models trained across multiple species could further improve the transferability of probes across species. For example, a new approach that explicitly encourages the model not to learn signatures of species-specific enhancer activity might be especially promising for designing cross-species SNAIL probes (Cochran et al., 2021).

Beyond the potential ability for cross-species probe design, the SNAIL framework provides the opportunity to develop viral tools targeting previously unexplored cell types that are identifiable in snATAC-seq both within and across species. There is potential to divide subpopulations at multiple levels and design extremely specific technologies, including specialization for neuron populations within a specific brain region. Other applications may exploit changes in enhancer sequence activity in disease and other contexts to target specific cell states. In addition to creating cell type-specific enhancers to isolate cell types of interest, machine learning model-selected enhancer sequences may be used to drive the expression of a gene for cell type-specific circuit manipulation, as has been achieved with channelrhodopsin and DREADDS (Lee et al., 2010; Vormstein-Schneider et al., 2020). This approach could also be used to overexpress a particular ion channel, neurotransmitter receptor, gene variant, or guide RNA for a CRISPR-based gene manipulation strategy. We anticipate that SNAIL will provide a foundation for using sequence signatures underlying cell type-specific enhancer activity to develop tools for understanding cell type-specific function of a variety of cell types in diverse contexts.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information
Strain, strain background (Mus musculus)	Pvalb-2A-Cre	Jackson Laboratories	Stock # 012358	B6.Cg-Pvalb^{tm1.1(cre)Aibs}
Strain, strain background (M. musculus)	Ai14	Jackson Laboratories	Stock # 007914	B6.Cg-Gt(ROSA)26Sor^{tm14(CAG-tdTomato)Hze}/J
Recombinant DNA reagent	pAAV-Ef1a-DIO-Sun1GFP-WPRE-pA	Addgene	Plasmid #160141	cSNAIL vector from Lawler et al., 2020
Recombinant DNA reagent	pAAV-Ef1a-Sun1GFP-WPRE-pA	This paper		Ef1a vector (+ control)
Recombinant DNA reagent	pAAV-SC1-Sun1GFP-WPRE-pA	This paper		SC1 vector targeting PV+ neuron expression
Recombinant DNA reagent	pAAV-SC2-Sun1GFP-WPRE-pA	This paper		SC2 vector targeting PV+ neuron expression
Recombinant DNA reagent	pAAV-neg-Sun1GFP-WPRE-pA	This paper		Negative control vector
Antibody	Anti-Pvalb (rabbit polyclonal)	Swant	Cat# PV27	IF (1:750)
Antibody	Anti-Pvalb (mouse monoclonal)	Swant	Cat# 235	IF (1:1000)
Antibody	Anti-GFP (rabbit polyclonal)	Invitrogen	Cat# A-11122	IF (1:1000)
Antibody	Anti-GFP (rabbit monoclonal)	Invitrogen	Cat# G10362	Affinity purification (1:150)
Software, algorithm	ls-gkm	Lee, 2016	https://github.com/Dongwon-Lee/lsgkm	SVM training
Software, algorithm	Keras	https://keras.io		CNN training

Share this article

Cite this article

Classification of neuron subtype-specific enhancer activity from sequence.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Figure 1—source data 4

Two sequence candidates selectively activate adeno-associated virus (AAV) expression in parvalbumin-expressing (PV+) neurons.

Figure 2—source data 1

Figure 2—source data 2

Cortical SC1 and SC2 Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated nuclei recapitulate parvalbumin-expressing (PV+) GABAergic interneuron ATAC-seq signatures.

Figure 3—source data 1

SC1 and SC2 generalize to parvalbumin-expressing (PV+) neurons in the striatum and external globus pallidus (GPe).

Figure 4—source data 1

Figure 4—source data 2

Figure 4—source data 3

Figure 4—source data 4

Figure 4—source data 5

Motif interpretation of parvalbumin-expressing (PV+) neuron-specific open chromatin region (OCR) activity.

Figure 5—source data 1

Figure 5—source data 2

Figure 5—source data 3

Extensions of Specific Nuclear-Anchored Independent Labeling (SNAIL) technologies in primates.

Author details

Alyssa J Lawler

Present address

Contribution

For correspondence

Competing interests

Easwaran Ramamurthy

Contribution

Competing interests

Ashley R Brown

Contribution

Competing interests

Naomi Shin

Contribution

Competing interests

Yeonju Kim

Contribution

Competing interests

Noelle Toong

Contribution

Competing interests

Irene M Kaplow

Contribution

Competing interests

Morgan Wirthlin

Contribution

Competing interests

Xiaoyu Zhang

Contribution

Competing interests

BaDoi N Phan

Contribution

Competing interests

Grant A Fox

Contribution

Competing interests

Kirsten Wade

Contribution

Competing interests

Jing He

Contribution

Competing interests

Bilge Esin Ozturk

Contribution

Competing interests

Leah C Byrne

Contribution

Competing interests

William R Stauffer

Contribution

Competing interests

Kenneth N Fish

Contribution

Competing interests

Andreas R Pfenning

Contribution

For correspondence