Machine learning sequence prioritization for cell type-specific enhancer design

  1. Alyssa J Lawler  Is a corresponding author
  2. Easwaran Ramamurthy
  3. Ashley R Brown
  4. Naomi Shin
  5. Yeonju Kim
  6. Noelle Toong
  7. Irene M Kaplow
  8. Morgan Wirthlin
  9. Xiaoyu Zhang
  10. BaDoi N Phan
  11. Grant A Fox
  12. Kirsten Wade
  13. Jing He
  14. Bilge Esin Ozturk
  15. Leah C Byrne
  16. William R Stauffer
  17. Kenneth N Fish
  18. Andreas R Pfenning  Is a corresponding author
  1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, United States
  2. Biological Sciences Department, Mellon College of Science, Carnegie Mellon University, United States
  3. Neuroscience Institute, Carnegie Mellon University, United States
  4. Medical Scientist Training Program, University of Pittsburgh, United States
  5. Department of Psychiatry, Translational Neuroscience Program, University of Pittsburgh, United States
  6. Department of Neurobiology, University of Pittsburgh, United States
  7. Systems Neuroscience Center, Brain Institute, Center for Neuroscience, Center for the Neural Basis of Cognition, United States
  8. Department of Ophthalmology, University of Pittsburgh, United States
  9. Division of Experimental Retinal Therapies, Department of Clinical Sciences & Advanced Medicine, School of Veterinary Medicine, University of Pennsylvania, United States
  10. Department of Bioengineering, University of Pittsburgh, United States

Abstract

Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.

Editor's evaluation

This article describes an exciting new approach for tagging and isolation of unique neuronal subpopulations based on machine learning selection of cell-specific enhancer elements in the genome. The article highlights a specific test case of this technology with neurons expressing Parvalbumin, but this method could be applied to any neuronal or even non-neuronal cell type. The tools and overall approach described here will enable cell tagging in model organisms for which transgenic lines are not commonly available or even expression of other transgenes for control of cell function or genetic perturbation.

https://doi.org/10.7554/eLife.69571.sa0

Introduction

The biology of the brain is complicated by vast diversity in cell types, subtypes, and states. Contemporary advancements in single-cell sequencing have identified over a hundred molecularly distinct neuronal populations in the mammalian cortex (Lake et al., 2016; Saunders et al., 2018; Tasic et al., 2018; Hodge et al., 2019; Zeisel et al., 2015), including several small subpopulations of gamma aminobutyric acid (GABA)ergic neurons with specialized functions that are critical for the control of neuronal inhibition (Kepecs and Fishell, 2014; Lim et al., 2018). Understanding neurological function in health and disease from a cell type-specific perspective is critical to the progress of neuroscience.

Such endeavors necessitate cell type-specific technologies to identify, isolate, and manipulate discrete cell populations. Transgenic mouse strains targeting major inhibitory neuron subclasses, including parvalbumin-expressing (PV+), somatostatin-expressing (SST+), and serotonergic (5-HT+) neurons, are widely used today and have been instrumental toward our understanding of these cell types (Madisen et al., 2010; Taniguchi et al., 2011). More recently, many developers have turned toward scalable virus-based cell type-specific tools (Dimidschstein et al., 2016; Hrvatin et al., 2019; Nair et al., 2020; Vormstein-Schneider et al., 2020; Graybuck et al., 2021; Mich et al., 2021). Adeno-associated virus (AAV) technologies became particularly attractive with the invention of AAV variants that cross the blood–brain barrier to transduce the central nervous system, AAV-PHP.B, and AAV-PHP.eB (Chan et al., 2017; Deverman et al., 2016). In line with certain transgenic engineering, an emerging AAV targeting strategy is to incorporate cell type-specific enhancer elements into the viral genome to promote restricted expression. Enhancer activity can be extremely selective, even more so than the activity of most genes and their associated promoters (Hoffman et al., 2013; Kellis et al., 2014; Roadmap Epigenomics Consortium et al., 2015 2015). Thus, enhancers may be used to confer specificity for neuron subtypes that cannot be resolved by the expression of a single marker gene (Tasic et al., 2018) or for which the marker gene promoter is not specific on its own (Nathanson et al., 2009).

Despite the enthusiasm for enhancer sequences in cell type-specific AAV development, their selection remains challenging. ATAC-seq (Buenrostro et al., 2013) is an advantageous technique for defining potential cell type-specific enhancer regions because of its high nucleotide resolution and its compatibility with small cell populations and even with single-cell technologies (Buenrostro et al., 2015b; Cusanovich et al., 2015). Though convenient, chromatin accessibility is a noisy approximation for enhancer activity and differentially accessible sequence elements often fail to produce selective expression in vivo. A parallel screening approach involving single-nucleus sequencing of barcoded enhancer libraries, PESCA, was proposed to speed up the selection process toward a successful enhancer-driven virus (Hrvatin et al., 2019). Other approaches leveraged marker gene proximity and conservation across species for enhancer prioritization (Vormstein-Schneider et al., 2020). These efforts highlight the low conversion rate from experimentally suggested cell type-specific open chromatin regions (OCRs) to desired cell type-specific activity. We hypothesized that machine learning models could be applied as additional in silico filters to reduce the burden of experimental screening in cell type-specific AAV development while maintaining the convenience of ATAC-seq as the sole input data requirement.

The nucleotide sequence code that links transcription factor (TF) binding sites and other DNA features to enhancer activity is underutilized in AAV enhancer design, perhaps due to its complexity (Jindal and Farley, 2021). We reasoned that machine learning classifiers could be leveraged to identify the most characteristic and specific enhancer sequence patterns for a cell population, enabling efficient prioritization of sequences that are likely to drive selective expression. Convolutional neural networks (CNNs) (Le Cun et al., 1989) and support vector machines (SVMs), for example, have achieved state-of-the-art performance on predicting enhancer activity from sequence (Arvey et al., 2012; Ghandi et al., 2014; Zhou and Troyanskaya, 2015). Here, we present a framework for machine learning-assisted engineering of cell type-specific AAVs, which we refer to as Specific Nuclear-Anchored Independent Labeling (SNAIL). Similar to our previously described Cre-activated AAV technology cSNAIL (Lawler et al., 2020), SNAIL probes have the unique advantage of an affinity purification-compatible fluorescent reporter that can be used to isolate rare cell types (Mo et al., 2015; Deal and Henikoff, 2010; Lawler et al., 2020). Unlike cSNAIL, which relies on Cre activation, SNAIL probes are stand-alone AAVs driven by cell type-specific enhancer sequences selected through machine learning models. Thus, SNAIL probes may be used in a wide variety of systems, including wild-type mice, primates, and other species.

We used our framework to create two novel AAV probes for PV+ neurons. In the mouse primary motor cortex, PV+ SNAIL probes labeled PV+ neurons with >70% specificity when compared to Pvalb antibody staining. PV+ SNAIL probes in this context were more specific to GABAergic PV+ neurons than the common Pvalb-2A-Cre/Ai14 mouse strain. In addition, isolated populations of tagged cells from the cortex, striatum, and external globus pallidus (GPe) were each heavily enriched for known PV+ open chromatin signatures. Nucleotide-resolution model interpretation highlighted a collection of 14 TF binding motif families contributing to PV+ neuron-specific enhancer activation. Machine learning can aid in cross-species probe design as well; at least one SNAIL probe showed PV+ neuron selectivity in both mouse and macaque. These results demonstrate concrete utility in sequence-level information for AAV enhancer selection, setting the stage for efficient probe design for a wide range of cell types.

Results

SVMs discriminate known cell type-specific regulatory sequences

We built machine learning classifiers, primarily SVMs, to discriminate sequences of differential OCRs between two cell populations. These models take 500 bp candidate enhancer DNA sequence strings as input and they output, for each sequence, the cell type in which that candidate enhancer is active and an associated score. The similarity between sequences is determined based on gapped k-mer count vectors, that is, the number of occurrences of all short subsequences of length k, tolerating some gaps or mismatches, as implemented by LS-GKM (Ghandi et al., 2014; Lee, 2016). Where there are sufficient reliable sequence features associated with differential enhancer activation between two cell types, the model should learn these principles during the training phase and then be able to apply these principles to determine the cell type-specific activities of new sequences. We imposed upfront that training sequences have a minimum fold difference in chromatin accessibility between the two cell types used to define the classes. This should help the model learn cell type-specific features of enhancer activation rather than general enhancer features. We chose this strategy because it closely aligned with our goal of prioritizing sequences that would activate enhancers in one cell type and not others.

To evaluate this strategy, we first built SVMs comparing select broad classes of cell types in the brain using publicly available single-nucleus (sn)ATAC-seq data from the mouse motor cortex (Mop) (Li et al., 2021; Figure 1—figure supplement 1). These were (i) a neuron vs. astrocyte classifier and (ii) an excitatory neuron vs. inhibitory neuron classifier. To assess model performance, we used standard classifier metrics, the area under the receiver operator curve (auROC), and the area under the precision-recall curve (auPRC). These scores quantify model performance by comparing the predicted class to the actual class, where a randomly guessing binary classifier would have an auROC score of 0.5 and an auPRC score equal to the fraction of actual positives in the data, and a perfect classifier would have a maximum auROC score of 1.0 and auPRC score of 1.0. Both models performed well on held-out data, achieving auROCs of 0.95 and 0.93 and auPRCs of 0.98 and 0.82 (Figure 1—figure supplement 2).

Although the models were built for enhancer sequences, they recapitulated known cell type-specific activation patterns of some commonly used AAV promoter sequences: Gfap, Camk2a, and Dlx (Figure 1—figure supplement 2). The Gfap promoter sequence, which has a heavy astrocyte bias in vivo, was classified as astrocyte-specific in our neuron vs. astrocyte model (5580 astrocyte-specific enhancers evaluated; Gfap ranks 3298/5580). The same neuron vs. astrocyte model scored the Camk2a promoter and Dlx promoter sequences, which are known to have neuron-specific activity, as neuron-specific (14,347 neuron-specific enhancers evaluated; Camk2a = rank 13,300/14,347, Dlx = rank 9736/14,347). Also consistent with empirical expectations, the excitatory vs. inhibitory neuron model predicted the Camk2a sequence to have an excitatory neuron preference and the Dlx sequence to have an inhibitory neuron preference (15,391 excitatory neuron-specific enhancers evaluated; Camk2a = rank 11,541; 4,608 inhibitory neuron-specific enhancers tested; Dlx = rank 1142/4608) (Figure 1—figure supplement 2). Therefore, this classification strategy is capable of correctly predicting cell type-specific regulatory sequence activity in the viral context for these very distinct cell classes.

Machine learning models accurately predict PV+ neuron-specific open chromatin from sequence

Next, we assessed whether the same modeling strategy could be applied to more narrowly defined neuron subtypes, using PV+ neurons as a target. Again, we trained binary classifiers on sequences underlying differential ATAC-seq peaks between two cell populations (PV+ neurons and another population). To maximize sequence signal detection, we built several models comparing PV+ neurons to either PV- cells generally, excitatory neurons, VIP+ neurons, or SST+ neurons. In addition, we varied the input data source to be either ‘population-derived’ (from bulk ATAC-seq sequencing on purified nuclei populations) or ‘sn-derived’ (from single-nucleus open chromatin assays such as droplet-based snATAC-seq). Finally, we implemented multiple modeling strategies, including linear gapped k-mer SVMs, nonlinear gapped k-mer SVMs using the radial basis function (rbf), and CNNs. The ratios of positive and negative examples during training ranged from 1:0.4 to 1:3.7. Detailed information about all model parameters and training, validation, and test data is available in Figure 1—source data 1 (SVMs) or Figure 1—source data 2 (CNNs).

First, to define potential PV+ neuron and PV- cell enhancer sequences in the mouse cortex in a population data-driven manner, we conducted ATAC-seq on the PV+ and PV- nuclei of Pvalb-2A-Cre mice. We isolated the nuclei populations using previously described Cre-dependent AAV affinity purification technology, cSNAIL (Lawler et al., 2020). cSNAIL probes activate an isolatable nuclear envelope tag in the presence of Cre recombinase protein. Therefore, purified populations from these mice are a direct reflection of cells labeled by the Pvalb-2A-Cre mouse strain, a current standard for PV+ neuron labeling. Population-derived PV+ vs. PV- SVMs could predict the correct classification on held-out data with high accuracy (test set auROC = 0.87–0.88, auPRC = 0.95–0.96; gray lines Figure 1b and c), indicating that there were substantial sequence pattern differences between the PV+ and PV- classes and that the models were able to learn these differences.

Figure 1 with 6 supplements see all
Classification of neuron subtype-specific enhancer activity from sequence.

(a) Schematic representation of the Specific Nuclear-Anchored Independent Labeling (SNAIL) workflow. (b–e) Receiver operator characteristic and precision-recall performance metrics for various cell type-specific enhancer sequence model strategies and data modalities. The reported numbers are the areas under the curves for each model. (f) Scatter plots for support vector machine (SVM) scores reported by equivalent population-derived models and single-nucleus-derived models. ***p-Value of Pearson correlation <0.001.

Figure 1—source data 1

Support vector machine (SVM) parameter tuning and performance evaluations.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data1-v1.xlsx
Figure 1—source data 2

Convolutional neural network (CNN) parameter tuning and performance evaluations.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data2-v1.xlsx
Figure 1—source data 3

Model scores on externally screened parvalbumin-expressing (PV+) enhancer candidates.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data3-v1.xlsx
Figure 1—source data 4

Identification of motif sites within external parvalbumin-expressing (PV+) neuron enhancer screen, E1–E34.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig1-data4-v1.xlsx

We then considered the possibility that the PV+ vs. PV- models were learning features of general neuron vs. glia enhancer sequence properties and not necessarily features specific to PV+ neurons as the PV- data contained a high proportion of glial cells, a developmental outgroup to neurons. To address this issue, we trained additional population-derived SVMs that directly discriminated between enhancer sequences of PV+ neurons and other neuron subtypes using publicly available ATAC-seq data from INTACT-sorted excitatory (EXC) neurons and VIP+ neurons (Mo et al., 2015). These models performed well (auROC = 0.89–0.93, auPRC = 0.79–0.92; Figure 1b and c), demonstrating that, even at the level of neuron subtypes, OCR sequence information is rich enough to reliably distinguish cell type-specific activity.

To survey an additional machine learning strategy, we built CNN classifiers from the same underlying data (Figure 1—figure supplement 3). CNNs are a type of artificial neural network defined by multiple convolutional layers. The CNNs were trained to take in 1000 bp DNA sequence strings with different accessibility between two cell types, automatically extract predictive sequence features, and output a cell class probability between 0 and 1 (see Materials and methods for details). Compared with SVMs, CNNs are better equipped to learn higher-order interactions between sequence features due to CNNs’ capacity for flexible feature representation and automated feature selection (Le Cun et al., 1989). The CNNs were highly accurate (auROC = 0.95–0.96, auPRC = 0.76–0.97; Figure 1d). Thus, CNNs represent an additional successful approach to discriminate OCR sequence differences between purified neuron populations.

While ATAC-seq from purified cell populations is advantageous for its depth and recovers many examples of differentially accessible reads between neuron subtypes, many neuron populations of interest are not yet isolatable, even through transgenic means. Single-nucleus sequencing technologies can be applied to measure neuron subtype-resolution open chromatin without cell sorting by performing several parallel microreactions that introduce unique cell barcodes into ATAC-seq sequencing reads and then clustering the cells with similar chromatin profiles. We explored whether cell type-specific enhancer sequences derived from mouse motor cortex snATAC-seq (Li et al., 2021) were sufficient to produce neuron subtype-level classifiers. We trained five pairwise linear gapped k-mer SVMs to discriminate differential open chromatin sequences from snATAC-seq clusters or groups of clusters. These included analogous models to the population-derived models comparing PV+ vs. PV-, PV+ vs. EXC, and PV+ vs. VIP+. In this case, the single-nucleus-derived PV+ vs. PV- model refers to a model trained on differential OCR sequences comparing PV+ cluster nuclei to all other nuclei with a random sampling probability. The PV+ vs. k-nearest-neighbor (KNN) model is an additional variation on the PV+ vs. PV- model where the PV- nuclei sampling for differential OCR analysis was selected for similarity to the PV+ cluster as implemented in SnapATAC (Fang et al., 2021). We also produced a model comparing PV+ vs. SST+ neurons, the most similar subtype to PV+. Single-nucleus-derived SVMs were also able to classify cell type-specific enhancer sequences with high accuracy (auROC = 0.89–0.93, auPRC = 0.72–0.92; Figure 1e).

Moreover, models built independently from different data sources identified similar sequence contributions for equivalent tasks. When scoring the population-derived sequences through both the population-derived SVMs and the single-nucleus-derived SVMs, individual sequences had highly similar scores (Figure 1f). These findings highlight the prevalence of reliable cell type-specific enhancer sequence signatures that can be defined by a variety of classifier types and sources of open chromatin measurements. The parameter and performance details of all models can be found in Figure 1—source data 1 (SVMs) and Figure 1—source data 2 (CNNs).

Models learn biological signatures relevant for AAV probe design

To ensure that the neuron subtype-level models were identifying signatures that were relevant for the specific purpose of creating selective PV+ neuron viruses, we evaluated model predictions on 34 externally validated successful and unsuccessful PV+ probe enhancer candidates from Vormstein-Schneider et al., 2020, named E1–E34. Importantly, the enhancer sequence from the probe with the lowest PV+ specificity (E4; 14% specificity) received a negative score from every model, and two of the enhancer sequences with highest cortical PV+ specificity (E22 and E29; 94% specificity) received high positive scores from every model (Figure 1—source data 3).

We binarized PV+ enhancer AAV sequences E1–E34 into either the high-specificity (>70%) or low-specificity (<70%) group and assessed the relationships between these groups and ATAC-seq log2 fold difference or SVM score (Figure 1—figure supplement 5). These groups had little variation in mammalian nucleotide sequence conservation levels measured by PhyloP, but many of the sequences with highly confidently predicted PV+-specific accessibility had PV+ accessibility in humans (Figure 1—figure supplement 5b). On the complete set of enhancers, SVM score was predictive of the PV+ specificity group (p=0.008), and differential activity log2 fold difference had a weak association with PV+ specificity group (p=0.069) (Figure 1—figure supplement 5c). Much of the log2 fold difference association with group PV+ specificity was driven by enhancer sequences with low log2 fold difference and low PV+ specificity. Undesirably, there were some sequences with high log2 fold difference and low in vivo specificity. Within the subset of enhancers with high log2 fold difference (log2 fold difference > 1), the log2 fold difference was not associated with specificity group (p=0.601), while SVM score was weakly associated (p=0.057) (Figure 1—figure supplement 5d). Unlike log2 fold difference scores alone, SVM scores may limit false-positive candidates and improve efficient enhancer AAV selection.

This result emphasizes the benefit of enhancer pre-selection with machine learning, which could reduce in vivo screening efforts by identifying the best PV+ enhancer sequences before experimentation. The models predicted which PV+ enhancer sequence candidates were likely to be cell type-specific expression drivers. They also carry information about precisely which subsequences were responsible for the predicted PV+ neuron-specific activation. Sequence E29, within the Inpp5j locus, was predicted to have PV+ neuron-specific activity due to a central Mef2 motif site and nearby Esrrg motif site, among others (Figure 1—figure supplement 6, Figure 1—source data 4). Sequence E22, within the Tmem132c locus, was predicted to have PV+ specificity in part due to Nkx28 and Lhx6 motif sites (Figure 1—figure supplement 6, Figure 1—source data 4). Nevertheless, none of these enhancers were our highest predicted PV+ neuron sequences, so we continued to investigate additional enhancer candidates genome-wide for PV+ SNAIL probe implementation.

Two candidate PV+ SNAIL probes successfully target PV+ neurons in the mouse cortex

Based on the predictions of all PV+ enhancer models on our candidates, we prioritized two highly characteristic PV+ neuron enhancer sequences and tested their ability to drive targeted expression in vivo (Figure 2). We refer to these sequence candidates as SC1 and SC2. Among true PV+ neuron-specific enhancer sequences that (i) were differential OCRs in PV+ vs. PV-, PV+ vs. EXC, and PV+ vs. VIP+ sorted population data and (ii) scored PV+ positive across all SVM evaluations (1755 sequences), SC1 was the highest predicted sequence candidate, while SC2 was in the 90th percentile (Figure 2b, Figure 2—source data 1). We chose these sequences to represent two different confidence levels and evaluate the general potential of the top 10% of machine learning-prioritized PV+ neuron-specific enhancers.

Two sequence candidates selectively activate adeno-associated virus (AAV) expression in parvalbumin-expressing (PV+) neurons.

(a) Genome browser visualization of PV+-specific ATAC-seq signal at sequence candidates SC1 and SC2. * cSNAIL data, † INTACT data from Mo et al., 2015, ‡ snATAC-seq from Li et al., 2021. (b) Percentile rank of support vector machine (SVM) scores among 1755 true PV+-specific enhancer sequence candidates that scored positively across all models. Linear population-derived models are denoted with ‘pop,’ nonlinear population-derived models are denoted with ‘pop, rbf,’ and linear single-nucleus-derived models are denoted with ‘sn.’ (c) Example images of AAV Sun1GFP expression against parvalbumin (Pvalb) antibody staining. (d, e) Quantification of AAV Sun1GFP or Pvalb-2A-Cre/Ai14 reporter overlap with Pvalb+ cells. Bar heights represent the mean among images, and the error of the mean is shown. N cells = 1322 (SC1), 2570 (SC2), 1340 (Ai14), 2013 (Ef1a), and 504 (N.C.). N.C., negative control.

Figure 2—source data 1

Detailed information for 1755 experimental parvalbumin (PV)-enriched open chromatin regions (OCRs) with positive parvalbumin-expressing (PV+) model scores.

Genomic coordinates, peak log2 fold difference, adjusted p-value, support vector machine (SVM) scores, convolutional neural network (CNN) scores, snATAC-seq overlap, species conservation, and subcortical SVM scores are provided.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig2-data1-v1.xlsx
Figure 2—source data 2

Image quantification for Specific Nuclear-Anchored Independent Labeling (SNAIL) viruses or the Pvalb-2A-Cre mouse strain reporter with Pvalb immunohistochemistry.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig2-data2-v1.xlsx

SC1 and SC2 sequences were cloned into separate vectors upstream of the cSNAIL reporter gene, Sun1GFP. To minimize off-target effects, PV+ SNAIL probes directly rely on transcriptional activation from SC1 or SC2, without a minimal promoter (see Materials and methods). We also prepared two control vectors: a negative control that was the identical vector but with no enhancer or promoter and a nonspecific control that was the identical vector but with a common Ef1a promoter sequence in place of the candidate sequence. When packaged with AAV-PHP.eB and delivered to the mouse through systemic injection, the SC1-Sun1GFP and SC2-Sun1GFP constructs promoted cortical fluorescence that was restricted to PV+ neurons, while the Ef1a virus did not (Figure 2c–e, Figure 2—source data 2). We quantified images with consistent positioning in the primary motor cortex, with representation from all layers. Compared with immunohistochemistry-labeled Pvalb protein, SC1 and SC2-mediated expression of Sun1GFP was restricted to Pvalb+ neurons in ~70–74% of cases. This was an 11-fold enrichment in precision over the Ef1a promoter and notably an almost twofold enrichment over Cre reporter labeling in Pvalb-2A-Cre/Ai14 double transgenic mice (Figure 2c–e, Figure 2—source data 2). We expect these to be conservative estimates of PV+ targeting due to incomplete antibody capture. On average, Sun1GFP expression from SC1 and SC2 SNAIL probes labeled ~71–73% of Pvalb+ neurons. The rate is limited by the transduction properties of the AAV-PHP.eB capsid, which is expected to transduce 55–70% of neurons in the cortex (Chan et al., 2017). The negative control virus was injected at as high of a concentration as possible (14× the concentration of SC1, SC2, and Ef1a injections) to detect any biases in spurious background expression. At this titer, a number of cells exhibited GFP expression (Figure 2—source data 2), but these did not appear biased toward PV+ neurons (Figure 2d).

Isolation of PV+ SNAIL-labeled nuclei captures PV+ cortical interneurons

Expression of Sun1GFP differentiates SNAIL probes from other cell type-specific AAV technology. The stable nuclear envelope association of this tag enables affinity purification using magnetic beads coated with anti-GFP antibodies, which is advantageous for rare population isolation and downstream epigenetic assays. In many contexts, purification of a cell population is more efficient than single-nucleus sequencing technologies, especially if the population of interest is in low proportion or the desired downstream applications are not available in single-nucleus approaches. Taking advantage of this property, we isolated Sun1GFP-expressing nuclei induced by SC1-Sun1GFP, SC2-Sun1GFP, or Ef1a-Sun1GFP SNAIL virus from the mouse cortex and performed ATAC-seq. Through comparison with known PV+ neuron ATAC-seq (via cSNAIL in the Pvalb-2A-Cre strain) and PV- or bulk ATAC-seq, including cSNAIL PV- cell fractions and Ef1a virus signatures, we determined that both SC1-Sun1GFP and SC2-Sun1GFP cells were highly enriched for PV+ neurons.

The first principal component, accounting for 84% of the total variance, separated known PV+ neuron samples from PV- and bulk tissue samples. Likewise, SC1-Sun1GFP and SC2-Sun1GFP samples grouped with the PV+ samples while Ef1a-Sun1GFP samples grouped with the PV- and bulk sample signatures (Figure 3a). At the Pvalb locus, there were highly reproducible OCR signals between PV+ cSNAIL, PV+ snATAC-seq, SC1-Sun1GFP, and SC2-Sun1GFP samples that did not appear in bulk tissue, PV-, or Ef1a-Sun1GFP samples (Figure 3b).

Cortical SC1 and SC2 Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated nuclei recapitulate parvalbumin-expressing (PV+) GABAergic interneuron ATAC-seq signatures.

(a) Principal component analysis (PCA) of ATAC-seq counts across samples. (b) Genome browser visualization of ATAC-seq signal at the Pvalb gene locus. Tracks represent the pooled sample p-value signal. Each track of similar data type is normalized to the same scale: SNAIL data range 0–335, *cSNAIL data range 0–93, †INTACT data range 0–200, ‡snATAC-seq data range 0–2. (c) Scatter plots of ATAC-seq log2 fold difference relative to bulk tissue ATAC-seq, comparing PV+ cSNAIL to other adeno-associated viruses (AAVs). The density of overlapping points is shown by the plot color. (d) snATAC-seq nuclei clusters as visualized by t-SNE. The dendrograms show hierarchical clustering of Euclidean sample distances by Ward’s minimum variance method D2. The heatmap shows the percentage of population open chromatin regions (OCRs) enriched relative to bulk that are also cluster-specific marker OCRs. *Hypergeometric enrichment p<0.01.

Figure 3—source data 1

Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated cortical ATAC-seq relative to bulk tissue cortical ATAC-seq.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig3-data1-v1.xlsx

A major goal for PV+ SNAIL probes was to replace transgenic mouse strain technologies in certain contexts. Ideally then, ATAC-seq from Sun1GFP-sorted cells from SNAIL probes in wild-type mice should provide similar information as ATAC-seq from Sun1GFP-sorted cells from cSNAIL in Pvalb-2A-Cre transgenic mice. Therefore, we defined PV+ cSNAIL ATAC-seq log2FoldDifference over bulk cortical tissue ATAC-seq as a gold standard for each OCR. For SC1 and SC2, we computed the correlations between the log2FoldDifference of OCR signal relative to bulk tissue and the log2FoldDifference of OCR signal in PV+ cSNAIL relative to bulk tissue. To establish an upper limit for correlation, we compared two different batches of cortical PV+ cSNAIL samples, which had a Pearson correlation of 0.86 and a Spearman correlation of 0.85. As a lower limit, we evaluated the nonspecific Ef1a control virus, which had a Pearson correlation of 0.38 and a Spearman correlation of 0.26. Because the AAV-PHP.eB capsid has a neuron bias, these lowly correlated signatures are likely to be general neuron specifications shared among PV+ and other neurons. Within this range, SC1 and SC2 had very high correlation with cSNAIL, with SC1 achieving almost equivalent correlation as the two cSNAIL batches (SC1 Pearson = 0.85 and Spearman = 0.84; SC2 Pearson = 0.81 and Spearman = 0.79) (Figure 3c). The details for differential OCRs in each virus relative to bulk tissue can be found in Figure 3—source data 1.

Finally, we compared SC1-Sun1GFP+ and SC2-Sun1GFP+ cell open chromatin signatures to those of snATAC-seq clusters from the mouse motor cortex (Figure 3d; Li et al., 2021). We defined cluster-specific OCRs for each snATAC-seq cluster and population-enriched OCRs for SNAIL-isolated cells relative to bulk tissue (see Materials and methods) and assessed the overlaps. We found that cSNAIL-isolated PV+ OCRs, SC1-isolated OCRs, and SC2-isolated OCRs were each significantly enriched for PV+ cluster-specific markers (34–47% overlap, hypergeometric p=0), while OCRs from Ef1a-isolated cells were not enriched for PV+ cluster-specific markers (4% overlap, p=1). Ef1a OCRs instead had the highest enrichment for markers of a layer 4 excitatory neuron cluster (25% overlap, p=5.3 × 10–5). We also note that cSNAIL PV+ ATAC-seq had an additional 8% overlap with excitatory cluster L5 PT markers (p=2.5 × 10–45), possibly reflective of Pvalb-2A-Cre line labeling in layer 5 parvalbumin-expressing excitatory neurons (Tanahira et al., 2009; Jinno and Kosaka, 2004; Roccaro-Waldmeyer et al., 2018). These OCRs were absent in SC1- and SC2-isolated cells. In fact, SC1 and SC2 had no enrichment for cluster-specific OCRs of any cluster other than PV+ (≤2% overlap, p>0.1), including the closely related SST+ neuron population. This suggests that SC1 and SC2 SNAIL probes target a stricter subset of the cells than the Pvalb-2A-Cre mouse strain, likely restricted to PV+inhibitory interneurons.

Chromatin accessibility differences between PV+ neurons in different brain regions

SC1 and SC2 SNAIL probes were designed based on the sequence properties of cortical PV+ neurons. Many PV+ neurons throughout the brain have a common developmental origin in the medial ganglionic eminence (MGE), but there are substantial OCR differences between mature PV+ neuron populations in different brain regions. From cSNAIL-isolated PV+ populations in Pvalb-2A-Cre mice (Lawler et al., 2020), we characterized thousands of OCRs with differential accessibility between the cortex, striatum, and GPe (padj<0.01, |log2FoldDifference| > 1) (Figure 4a, Figure 4—source data 1). These differences were associated with distinct TF binding motifs (Figure 4b, Figure 4—source data 2). For example, OCRs that were more accessible in cortical PV+ neurons relative to striatal and GPe PV+ had enrichment for Mef2 family motifs, which are important in neuroplasticity (Donato et al., 2015), may play a role in specifying or maintaining the MGE PV+ neuron lineage in mouse and human (Mayer et al., 2018), and have been linked to neurodevelopmental disorders involving the prefrontal cortex (Mitchell et al., 2018). TFs with motifs enriched in PV+ neuron OCRs that are more open in striatum relative to cortex and GPe included Tgif1, a key homeodomain gene involved in holoprosencephaly (Taniguchi et al., 2012). At 6654 differential OCRs, GPe-specific PV+ OCRs were the most unique and had TF motif enrichments, including the Lhx3, Pou5f1, Esrrg, and Pax3 motifs.

Figure 4 with 2 supplements see all
SC1 and SC2 generalize to parvalbumin-expressing (PV+) neurons in the striatum and external globus pallidus (GPe).

(a) Numbers of differential open chromatin regions (OCRs) between PV+ neuron populations in three brain regions (DESeq2 padj<0.01 and |log2FoldDifference| > 1). Brain region-specific OCRs are those that were significantly enriched in that tissue relative to each of the other two tissues. OCRs shared between two brain regions on the Venn diagram are those that were significantly enriched in each of those tissues relative to the excluded tissue. The shared center of the Venn diagram shows all remaining OCRs that have ambiguous or no tissue preference. (b) Examples of enriched motifs in brain region-specific PV+ open chromatin relative to all PV+ open chromatin. (c, f) Distributions of validation data support vector machine (SVM) scores and SC1 and SC2 scores within striatum and GPe PV+ vs. PV- models. (d, g) Principal component analysis (PCA) visualization of ATAC-seq counts in each sample. (e, h) Pearson correlation coefficients when comparing the log2 fold difference of cSNAIL PV+ ATAC-seq relative to bulk tissue ATAC-seq and the log2 fold difference of SNAIL ATAC-seq relative to bulk tissue ATAC-seq. Error bars show the 95% confidence intervals.

Figure 4—source data 1

Differential open chromatin region (OCR) statistics for tissue-specific parvalbumin-expressing (PV+) neuron OCRs in cortex, striatum, and external globus pallidus (GPe).

https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data1-v1.xlsx
Figure 4—source data 2

Motif enrichments among tissue-specific parvalbumin-expressing (PV+) neuron open chromatin region (OCR) sequences.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data2-v1.xlsx
Figure 4—source data 3

Pathway enrichments among tissue-specific parvalbumin-expressing (PV+) neuron open chromatin region (OCR) sequences.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data3-v1.xlsx
Figure 4—source data 4

Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated striatal ATAC-seq relative to bulk tissue striatal ATAC-seq.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data4-v1.xlsx
Figure 4—source data 5

Differential open chromatin region (OCR) statistics for parvalbumin-expressing (PV+) Specific Nuclear-Anchored Independent Labeling (SNAIL)-isolated external globus pallidus (GPe) ATAC-seq relative to PV- GPe ATAC-seq.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig4-data5-v1.xlsx

These molecular differences likely relate to functional differences, such as the tendency of PV+ cells in the GPe to project to other brain regions vs. the local nature of PV+ cells in the cortex (Hernández et al., 2015; Saunders et al., 2016). We used GREAT, which associates OCRs to one or more genes using proximity-based gene regulatory domains and then performs pathway enrichment analysis for those genes, to assess Gene Ontology enrichments in the brain region-specific PV+ ATAC-seq OCR sets relative to all PV+ OCRs (Figure 4—source data 3; McLean et al., 2010). The set of PV+ OCRs enriched in cortical PV+ neurons included 10 regions associated with the Bdnf gene (Ensembl Genes; false discovery rate [FDR] Q = 0.0035). Bdnf is generally expressed in excitatory forebrain neurons but not PV+ interneurons. The presence of PV+ neuron OCRs near Bdnf could represent genomic regions with nonenhancer functions, enhancers that regulate another gene, the binding regions of repressive TFs, or trace contamination from excitatory populations. Other cortex-specific PV+ enrichments included terms related to sensory perception, especially smell. Striatum-specific PV+ neuron OCRs were enriched for the adenylate cyclase-inhibiting dopamine receptor signaling pathway (GO:BP; FDR Q = 0.010) and bradykinesia (Mouse Phenotype; FDR Q = 0.046). OCRs preferentially open in GPe PV+ neurons were enriched for neuropeptide signaling pathways, for example, acetylcholine receptor binding (GO:MF; FDR Q = 0.0044) and neuropeptide receptor activity (GO:MF; FDR Q = 1.2 × 10–5). This suggests unique epigenetic mechanisms for the regulation of transcription related to receptor signaling in GPe PV+ neurons, but further work is needed to discern these relationships.

PV+ SNAIL probes generalize to subcortical brain regions in the mouse

Given these complexities, we were interested in the extent to which PV+ enhancer probes chosen from data in one tissue could generalize to other brain regions. Here, we assessed whether SC1 and SC2 SNAIL probes, designed in the cortex, were also selective for PV+ neurons in the striatum and GPe. First, we used cSNAIL ATAC-seq data from the striatum and GPe to model the regulatory sequence properties of PV+ neurons vs. PV- cells in these brain regions, using the population-derived linear SVM strategy (Figure 4—figure supplement 1). We assessed the correlation between the score outputs of the cortex PV+ vs. PV- SVM with the striatum PV+ vs. PV- or GPe PV+ vs. PV- SVMs for our set of experimentally identified mouse cortical PV+ neuron enhancers sequences. Enhancer scores were well correlated between the cortex and striatum models (Pearson = 0.73, Spearman = 0.72), indicating shared regulatory sequence determinants between PV+ neurons in the cortex and striatum. There was a low correlation between enhancer scores of the cortex and GPe models (Pearson = 0.25, Spearman = 0.24), indicating differences in the learned PV+ regulatory sequence properties in these regions. SC1 and SC2 sequences were predicted to have PV+-specific activation in both the striatum and GPe (Figure 4c and f). However, there were between 1000 and 3000 sequences with more confident PV+ scores than SC1 and SC2 in the striatum and GPe.

We proceeded to isolate SC1- and SC2-labeled cells from these tissues in wild-type mice using Sun1GFP affinity purification and performed ATAC-seq on the tagged populations. We have previously shown high agreement between cSNAIL and Pvalb-2A-Cre labeling in the striatum and GPe (Lawler et al., 2020), so we again used cSNAIL ATAC-seq samples from these regions as true PV+ neuron signals. By principal component analysis (PCA), we recovered separation between PV+ samples, including SC1- and SC2-isolated populations, and PV- samples (Figure 4d and g). We assessed the correlations between log2FoldDifference in SNAIL and cSNAIL samples, each relative to bulk tissue (striatum) or, where there were no bulk samples available, cSNAIL PV- cells (GPe) (Figure 4e and h, Figure 4—source data 4, Figure 4—source data 5). Pearson correlation coefficients were similar or slightly lower for SC1 and SC2 in the striatum and GPe than for equivalent comparisons in the cortex, indicating less conservation between cSNAIL and SNAIL probe targets (SC1 cortex = 0.85, striatum = 0.71, GPe = 0.68; SC2 cortex = 0.81, striatum = 0.82, GPe = 0.73). Still, these were substantially increased over Ef1a correlation with cSNAIL in these tissues, especially for the striatum (Ef1a cortex = 0.38, striatum = 0.18, GPe = 0.51).

Finally, by comparing the peak overlaps of SC1 and SC2-enriched OCRs in striatum and GPe with cortical snATAC-seq cluster-specific OCRs, we still identified the PV+ cluster as most similar to SC1 and SC2 cells. As expected, all overlaps in striatum-cortex and GPe-cortex comparisons were lower than those from cortex-cortex comparisons, but the magnitudes of SC1 and SC2 overlapped with the Pvalb cluster in these brain regions were similar to the magnitudes of cSNAIL PV+ overlap with the Pvalb cluster in these brain regions (Figure 4—figure supplement 2). In the striatum, the overlaps with the Pvalb cluster were 8% for SC1, 14% for SC2, and 14% for cSNAIL. In the GPe, the overlaps with the Pvalb cluster were 7% for SC1, 7% for SC2, and 9% for cSNAIL. From these interpretations, SC1 and SC2 SNAIL viruses seem to generalize to the striatum and GPe, though they may not be as robust as they are within the cortical context. Additional experimental evidence may be necessary to confirm the appropriateness of PV+ SNAIL viruses for certain applications outside the cortex.

Esrrg and Mef2 motifs are important for the PV+ neuron-specific activity score of SC1 and SC2 sequences

To interpret the specific sequence patterns within SC1 and SC2 that contribute to their PV+ neuron-specific activity prediction, we assessed commonly used motifs for each model and identified potential matches within the candidate sequences. For all SVMs, we calculated per-base importance scores and hypothetical importance scores for the set of PV+ neuron-specific OCRs that were true positives according to all SVMs (score >0; N = 1755) (Shrikumar et al., 2019). Then, for each model, we used TF-MoDISco (Shrikumar et al., 2018) to cluster commonly important subsequences called ‘seqlets’ within these PV+-specific examples. The resulting clusters represent motifs that were high contributors to a positive score in each model. Among the 11 SVMs comparing PV+ neuron open-chromatin against PV- cells, EXC neurons, VIP+ neurons, or SST+ neurons, we recovered 124 well-supported motifs. Many motifs appeared to be shared across multiple models. Thus, we performed UPGMA clustering on the 124 motifs by sequence similarity using STAMP (Mahony and Benos, 2007) and identified 14 motif clusters (Figure 5a).

Motif interpretation of parvalbumin-expressing (PV+) neuron-specific open chromatin region (OCR) activity.

(a) Motifs with high contributions to PV+ scores in each support vector machine (SVM), clustered by sequence similarity. The bubble color at each node shows the model that motif was discovered in and the size of the bubble shows the number of seqlets supporting that motif. Clusters are labeled by the clade majority best match for known transcription factor binding motifs. The full list of matches can be found in Figure 5—source data 1. (b, c) Normalized importance of each base in SC1 (b) and SC2 (c) sequences for their PV+-specific scores in each SVM. Locations with sequence matches for identified motifs in each SVM (from panel a) are shown at the bottom. (d) Predicted impacts of motif scrambling on PV+ specificity. Motif mutation sites are shown with asterisks in panels (a) and (b). Each point is the sequence score from one SVM and ‘x’ is the mean.

Figure 5—source data 1

Association of TF-MoDISco motifs to known transcription factor binding motifs.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data1-v1.xlsx
Figure 5—source data 2

Identification of motif sites within SC1 and SC2.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data2-v1.xlsx
Figure 5—source data 3

Normalized per-base importance scores for SC1 and SC2 sequences.

https://cdn.elifesciences.org/articles/69571/elife-69571-fig5-data3-v1.xlsx

The largest cluster, with 23 motif members, contained representation from all 11 models and had matches to known motifs including the motifs for Esrrg and Rora (Figure 5—source data 1). Consistent with an important role for Esrrg in PV+ neurons, Esrrg (a.k.a. Err3) transcript levels were differentially overexpressed in the PV+ neuron cluster relative the rest of the frontal cortex in snRNA-seq (DropViz subcluster #2–7 Neuron.Gad1Gad2.Pvalb Esrrg fold ratio = 8.0, p=1.14 × 10–198) (Saunders et al., 2018). Esrrg and Rora are key TFs in the Pgc1a transcriptional program, which regulates Pvalb expression, mitochondrial function, and transmitter release (Lin et al., 2005; Lucas et al., 2010). Pgc1a signaling is restricted to PV+ neurons in the brain and may mediate the unique energy demands of fast-spiking neurons (Paul et al., 2017; Lucas et al., 2014).

The second-largest motif cluster contained 16 motifs, also representing all 11 models, and the motifs had the best matches to motifs for Mef2a, Mef2c, and Mef2d. A cluster of Lhx6-like motifs, a TF necessary for MGE interneuron differentiation from interneuron progenitors (Liodis et al., 2007; Vogt et al., 2014), was detected with high support from PV+ vs. PV- models and PV+ vs. EXC models, detected with low support from PV+ vs. VIP+ models, and not detected between MGE neuron subtypes PV+ vs. SST+. Interestingly, two clusters of motifs were dominated by PV+ vs. VIP+ activity, including matches for Stat6, Nkx28, and Cux2 motifs. Cux2 expression is induced by Lhx6 in the MGE, supporting a role in specification of the MGE interneuron lineage (including PV+ and SST+ neurons) from other interneuron lineages (Zhao et al., 2008). Overall, these findings indicate both shared and unique sequence properties dictating PV+ neuron-specific regulatory sequence activity relative to other cell types.

SC1 and SC2 represent two experimentally validated PV+ neuron-selective regulatory sequences. To interpret the sequence determinants of their success, we mapped potential motif sites for the 124 TF-MoDISco motifs (Figure 5—source data 2) and overlaid these with per-base importance scores for each of the SVMs (Figure 5—source data 3). This strategy revealed multiple high-importance subsequences with potential TF binding function. SC1 contained two Esrrg motifs near the sequence center that were high contributors to the PV+ neuron-specific model predictions and matched TF-MoDISco motifs for every model (Figure 5b). An additional subsequence with contributions specific to PV+ vs. VIP+ models matched motifs for Sp7. SC2 contained a highly important Mef2 sequence near the center (Figure 5c). Additionally, SC2 contained an Esrrg motif with shared importance across all models. Interestingly, the most important features of the SC2 sequence closely resemble those of successful PV+ probe E29 from Vormstein-Schneider et al., 2020 (Figure 1—figure supplement 6). Disruption of Esrrg and Mef2 motifs within SC1 or SC2 resulted in a sharply decreased prediction of PV+ specificity according to the scores across the SVMs (Figure 5d). While these impacts are untested in vivo, these analyses provide an intuition for potential nucleotide contributions to PV+ neuron-specific enhancer sequence function in SC1 and SC2.

Cross-species analyses with SNAIL

The utility of modern cell type-specific technologies often depends on transferability between species, particularly between rodents and primates. Machine learning with SNAIL has the potential to expedite cell-type isolation from multiple species by pre-selecting enhancers with sequence composition that is likely to drive cell type-specific expression across species. To investigate this, we built human PV+ SVMs analogous to the mouse SVMs using single-nucleus open chromatin data from the human cortex (Corces et al., 2020). These models learned to discriminate between the sequences of open chromatin peaks in human PV+ neurons relative to human open chromatin peaks in excitatory neurons, VIP+ neurons, SST+ neurons, or the combination of all PV- cells. Although the human data contained fewer cell profiles, models were able to reasonably perform the classification task, with auROC 0.73–0.83 and auPRC 0.68–0.81 (Figure 6a). Such models may be useful for predicting enhancer sequence activity in primates, regardless of the species origin of the enhancer sequence. We used these human-trained models to predict PV+ neuron-specific activity for 4428 mouse sequences from experimentally identified PV+ neuron-specific open chromatin peaks. Score outputs from human-trained models were correlated with score outputs from mouse-trained models (Figure 6b), suggesting shared PV+ neuron-specific enhancer sequence features between mouse and human PV+ neurons. SC1 was one of the highest-scoring enhancer candidates in both mouse and human SVMs.

Extensions of Specific Nuclear-Anchored Independent Labeling (SNAIL) technologies in primates.

(a) Receiver operator characteristic and precision-recall performance metrics for parvalbumin-expressing (PV+) support vector machines (SVMs) derived from single-nucleus chromatin accessibility assays of human cortical tissue. The reported numbers are the areas under the curve for each model. (b) Comparison of mouse PV+ open chromatin sequences scored by mouse and human SVMs. Axes are the mean SVM scores among the 11 mouse SVMs or 4 human SVMs. (c) Images of SC1 AAV activation in the rhesus macaque cortex. (d) Quantification of SC1 and Pvalb antibody staining in the macaque cortex. Image sets near the center or peripheral of the injection site were quantified separately.

Because SC1 was predicted to retain PV+ neuron specificity in primates, we tested the SC1 SNAIL probe in the rhesus macaque cortex via localized intracranial AAV injection. We observed high expression of Sun1GFP with enrichment for PV+ neurons. There was a trade-off between precision and recall of PV+ neuron labeling related to the spatial viral load, with a mean precision of 22% near the injection site center and 67% at the periphery. At the injection center, at least 98% of PV+ neurons were transduced by the SC1 SNAIL probe, while at the periphery, at least 75% of PV+ neurons were transduced. Our preliminary findings are optimistic for PV+ SNAIL probe adoption in primates, but optimal AAV titer should be carefully considered for the specific application.

Discussion

OCR sequence features provide valuable, underutilized information for cell type-specific enhancer design. Here, we showed that sequence was sufficient to discern the directionality of highly differential OCR activity in different neuron subtypes in most cases. Interpretation of these models revealed rich diversity among the biochemical underpinnings of these classification tasks, reflective of cis-trans interactions. The defining sequence properties of cell type-specific OCR activation were robust throughout different data modalities, including ATAC-seq from sorted populations and snATAC-seq, and different classifier types.

Thus far, success in cell type-specific AAV creation has been limited to intensive screens of dozens of individual AAV enhancer candidates in which most do not produce highly selective expression (Vormstein-Schneider et al., 2020; Mich et al., 2021; Graybuck et al., 2021). In SNAIL, our framework for cell type-specific AAV engineering, we incorporate machine learning classifiers as an additional filter for improved enhancer pre-selection in order to mitigate the experimental burden. On a set of externally tested PV+ enhancer-driven AAVs (Vormstein-Schneider et al., 2020), the average PV+ specificity score across our classifiers was more predictive of PV+ specific AAV expression than the log2 fold difference of snATAC-seq signal, sequence conservation, or accessibility conservation at these loci. With the SNAIL framework, we identified and validated two novel enhancers that drive targeted expression in PV+ neurons in the mouse cortex. While these do not represent enough trials to establish a new conversion rate from cell type-specific OCRs to cell type-specific AAVs, we were encouraged by the immediate success of the first probes we selected. We believe that the incorporation of differential sequence property analyses will continue to improve the throughput of targeted AAV development in new contexts and make cell type-specific enhancer tool development accessible for more researchers.

An additional advantage of incorporating classifiers for cell type-specific enhancer selection is increased interpretability of the factors that govern success. The sequence patterns learned by PV+ models reflected known PV+ neuron biology. Common motifs contributing to successful PV+ probe enhancers included Esrrg, Mef2, and Lhx6, important for the specification and maintenance of the cortical PV+ interneuron lineage (Zhao et al., 2008; Mayer et al., 2018; Liodis et al., 2007). It is interesting to note the varying expression patterns of these TFs. In the cortex, Esrrg expression is mainly restricted to PV+ neurons and Lhx6 is restricted to MGE interneurons, but Mef2 transcripts are widely expressed in many inhibitory and excitatory neurons. It is likely that combinatorial interactions between abundant and cell type-specific TF binding events specify PV+ neuron enhancer activation. The SNAIL framework provides an opportunity to meaningfully leverage these complex sequence codes.

We found that a combination of multiple direct comparisons between the target cell type and other cell types made for particularly useful screening. Here, we used a tiered approach to ensure specific activity at multiple levels of cellular relationships to PV+ neurons. At the broadest level, we modeled PV+ neuron OCR sequences against PV- OCRs, a mixed signature from all other neuron and non-neuron cell types in the mouse cortex. Within neurons, we modeled PV+ vs. EXC neurons, and then PV+ relative to more specific subtypes of inhibitory neurons VIP+ and SST+. Successful SC1 and SC2 sequences contained attributes that made them highly PV+ specific across these comparisons.

SC1-Sun1GFP and SC2-Sun1GFP are new AAV technologies for PV+ neuron labeling and isolation in diverse systems. A unique feature of these viruses is the modified Sun1GFP tag that enables nuclei purification by magnetic beads coated with anti-GFP antibodies. This process is advantageous for isolating genomic and epigenomic signals from the population of interest with no dependence on transgenic strains. In comparison to single-nucleus sequencing technologies, affinity purification with SNAIL is more efficient for addressing targeted hypotheses about a specific cell type. SNAIL may also be paired with single-nucleus sequencing technologies for unprecedented resolution of the substructures within minority cell populations. We took advantage of SNAIL affinity purification to isolate SC1-Sun1GFP and SC2-Sun1GFP nuclei for molecular assessment with ATAC-seq. This represents a novel approach for validating new cell type-specific AAVs. We found that SC1 and SC2 PV+ SNAIL probes had high molecular agreement with cells tagged in the Pvalb-2A-Cre mouse strain, making them a reasonable alternative to transgenic strain technology. In addition to their success in the intended brain region (cortex), SC1 and SC2 PV+ SNAIL viruses also generalized to subcortical regions, the striatum, and GPe enough to isolate PV+ ATAC-seq signal. Moreover, the SC1 enhancer has sequence characteristics consistent with primate PV+-specific activation and has biased expression in PV+ neurons in the macaque brain. It will be interesting to conduct further comparisons of the composition of cell populations that are labeled by these tools in different brain regions and species. We speculate that the SNAIL framework could be strategically employed to design tools for specific subtypes of PV+ neurons or other cell populations that may be brain region or species specific.

In general, pairing cell type-specific enhancers with AAVs provides much more flexibility and scalability than transgenic technologies. However, there are drawbacks to certain applications. AAVs require time to reach peak expression, usually 2–4 weeks, although some may be robust earlier. This means they are not appropriate for developmental studies in very young animals. These and other developmental time points outside of the scope of the training data may have unintended effects on enhancer expression and specificity. Additionally, there are limitations to the transduction efficiency, so AAVs may not be ideal for studies where it is important to label all cells of a certain type. Finally, an enhancer’s activity in AAV may differ from the same enhancer’s activity in the host genome due to the surrounding genomic and epigenomic context. Enhancer activity may also fluctuate in response to different conditions because enhancers are dynamic actors in the regulation of gene expression. However, machine learning model-based prioritization of characteristic sequences may minimize this risk.

Excitingly, there are many opportunities for extensions of the SNAIL framework that enable cell type-specific interrogation in unprecedented settings. In a direct application that we began to explore in this work, SNAIL probes could be used to isolate cell types of interest in diverse species. High evolutionary conservation in the rules for enhancer activity has been demonstrated by the success of machine learning models in predicting enhancers across mammals (Minnoye et al., 2020; Kelley, 2020; Kaplow et al., 2020; Chen et al., 2018). Models trained across multiple species could further improve the transferability of probes across species. For example, a new approach that explicitly encourages the model not to learn signatures of species-specific enhancer activity might be especially promising for designing cross-species SNAIL probes (Cochran et al., 2021).

Beyond the potential ability for cross-species probe design, the SNAIL framework provides the opportunity to develop viral tools targeting previously unexplored cell types that are identifiable in snATAC-seq both within and across species. There is potential to divide subpopulations at multiple levels and design extremely specific technologies, including specialization for neuron populations within a specific brain region. Other applications may exploit changes in enhancer sequence activity in disease and other contexts to target specific cell states. In addition to creating cell type-specific enhancers to isolate cell types of interest, machine learning model-selected enhancer sequences may be used to drive the expression of a gene for cell type-specific circuit manipulation, as has been achieved with channelrhodopsin and DREADDS (Lee et al., 2010; Vormstein-Schneider et al., 2020). This approach could also be used to overexpress a particular ion channel, neurotransmitter receptor, gene variant, or guide RNA for a CRISPR-based gene manipulation strategy. We anticipate that SNAIL will provide a foundation for using sequence signatures underlying cell type-specific enhancer activity to develop tools for understanding cell type-specific function of a variety of cell types in diverse contexts.

Materials and methods

Key resources table
Reagent type (species) or resourceDesignationSource or referenceIdentifiersAdditional information
Strain, strain background (Mus musculus)Pvalb-2A-CreJackson LaboratoriesStock # 012358B6.Cg-Pvalbtm1.1(cre)Aibs
Strain, strain background (M. musculus)Ai14Jackson LaboratoriesStock # 007914B6.Cg-Gt(ROSA)26Sortm14(CAG-tdTomato)Hze/J
Recombinant DNA reagentpAAV-Ef1a-DIO-Sun1GFP-WPRE-pAAddgenePlasmid #160141cSNAIL vector from
Lawler et al., 2020
Recombinant DNA reagentpAAV-Ef1a-Sun1GFP-WPRE-pAThis paperEf1a vector (+ control)
Recombinant DNA reagentpAAV-SC1-Sun1GFP-WPRE-pAThis paperSC1 vector targeting PV+ neuron expression
Recombinant DNA reagentpAAV-SC2-Sun1GFP-WPRE-pAThis paperSC2 vector targeting PV+ neuron expression
Recombinant DNA reagentpAAV-neg-Sun1GFP-WPRE-pAThis paperNegative control vector
AntibodyAnti-Pvalb (rabbit polyclonal)SwantCat# PV27IF (1:750)
AntibodyAnti-Pvalb (mouse monoclonal)SwantCat# 235IF (1:1000)
AntibodyAnti-GFP (rabbit polyclonal)InvitrogenCat# A-11122IF (1:1000)
AntibodyAnti-GFP (rabbit monoclonal)InvitrogenCat# G10362Affinity purification (1:150)
Software, algorithmls-gkmLee, 2016https://github.com/Dongwon-Lee/lsgkmSVM training
Software, algorithmKerashttps://keras.ioCNN training

Experimental design

Request a detailed protocol

We designed the ATAC-seq experiments to meet the current ENCODE standards of two or more biological replicates (i.e., individuals) per condition, each consisting of 50 million non-duplicate non-mitochondrial paired-end sequencing reads. The initial cSNAIL experiments to define candidate PV+ enhancers were performed on primary motor cortex and isocortex samples in triplicate on female mice aged 2–3 months old. All subsequent cSNAIL and SNAIL molecular experiments for the validation of PV+ SNAIL probes were performed in the cortex, striatum, and GPe with two or three biological replicates. Each of these cohorts included at least one male and one female mouse, all 2–4 months old. Control samples for SNAIL comparisons included cSNAIL PV+, cSNAIL PV-, and cells labeled by the Ef1a-Sun1GFP virus. Details for all experiment samples can be found in Supplementary file 1. Data primary to this publication can be accessed through the NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/), accession number GSE171549.

Nuclei isolation for ATAC-seq

Request a detailed protocol

ATAC-seq data were generated using an affinity purification approach with cSNAIL or SNAIL to isolate PV+ neurons from the mouse isocortex, as described in Lawler et al., 2020. Briefly, mice were overdosed with isoflurane, decapitated, and rapidly dissected. Fresh brain tissue was sectioned coronally on a vibratome for precision, and we dissected brain regions relevant to the specific experiment to be processed as separate samples. All dissections took place in cold, oxygenated artificial cerebrospinal fluid (aCSF). After dissection, we isolated nuclei from the samples by 30 strokes of Dounce homogenization with the loose pestle (0.005 in clearance) in lysis buffer as described in Buenrostro et al., 2015a. The nuclei were filtered through a 70 µm strainer and pelleted with 10 min of centrifugation at 2000 × g at 4°C. We resuspended the nuclei pellets in wash buffer (0.25 M sucrose, 25 mM KCl, 5 mM MgCl2, 20 mM Tricine with KOH to pH 7.8, and 0.4% IGEPAL) for the affinity purification steps.

Affinity purification of Sun1GFP+ and Sun1GFP- nuclei

Request a detailed protocol

The nuclei suspension was incubated with anti-GFP antibody (Invitrogen, Carlsbad, CA; #G10362) in wash buffer for 30 min at 4°C with end-to-end rotation. After this period, we added Protein G Dynabeads (Thermo Fisher Scientific, Waltham, MA; Cat# 10004D) to the reaction and incubated again for 20 min. We separated the Sun1GFP+ fraction from the Sun1GFP- fraction on a magnetic bead rack. Sun1GFP- nuclei in the supernatant were centrifuged at 2000 × g for 10 min to pellet nuclei, washed one time, and filtered with a 40 µm cell strainer. The Sun1GFP+ nuclei attached to the beads were washed 3–4 times with 800 µL wash buffer by resuspending the sample, letting it settle onto the magnet, and removing the buffer. Where cell yield was not a concern, we also performed a large-volume wash with 10 mL wash buffer and filtered through a 20 µm cell strainer. All nuclei preparations were resuspended in water for the ATAC-seq reaction.

ATAC-seq library construction

Request a detailed protocol

For each sample, a small aliquot was stained with DAPI (Thermo Fisher Scientific; Cat# 62248) and the concentration of nuclei was determined by counting DAPI+ nuclei with a hemocytometer. Next, we combined 50,000 nuclei, 25 µL Tagment DNA Buffer, and 2.5 µL Tagment DNA Enzyme I (Illumina, San Diego, CA; Cat# 20034198) into 50 µL total for the transposition reaction. The reaction was incubated at 37°C for 30 min with 300 rpm mixing. Samples containing beads were gently resuspended every 5–10 min throughout the incubation to prevent the beads from staying settled at the bottom. Immediately following incubation, the DNA was column purified with the QIAGEN MinElute PCR Purification kit (QIAGEN, Hilden Germany; Cat# 28004). Libraries were amplified to ⅓ saturation with dual-indexed Illumina primers (Preissl et al., 2018). We ensured that samples had the characteristic periodic fragment length distribution of high-quality ATAC-seq using TapeStation assessment (Agilent Technologies, Santa Clara, CA). Successful samples were sequenced at low depth on the Illumina MiSeq system to determine appropriate library pooling and sequencing depth, then paired-end sequenced for 2 × 150 cycles with the Illumina NovaSeq 6000.

Animal use

Request a detailed protocol

All animals for ATAC-seq experiments were either wild-type mice (C57BL/6J; Jackson Laboratory, Bar Harbor, ME; Stock # 000664) for SNAIL experiments or heterozygous Pvalb-2A-Cre mice (B6.Cg-Pvalbtm1.1(cre)Aibs/J; Jackson Laboratory; Stock # 012358) (Madisen et al., 2010) on a C57BL/6J background for cSNAIL experiments. Imaging animals were either Pvalb-2A-Cre or double transgenic Pvalb-2A-Cre/Ai14 (Ai14 strain; B6.Cg-Gt(ROSA)26Sortm14(CAG-tdTomato)Hze/J; Jackson Laboratory; Stock # 007914). All mice were 2–4 months old at the time of the tissue experiments. Initial PV+ cSNAIL data for creating the sorted cell PV+ vs. PV- model was collected from female mice, but all subsequent validation experiments included representation from both sexes. All animals were housed with a 12 hr light cycle, and experiments were performed 2–3 hr after lights on. Animals for the data primary to this study received no treatments other than the retro-orbital AAV injections. However, previously published cSNAIL data used in the analysis included healthy animals that received stereotaxic saline injections to the medial forebrain bundle (Lawler et al., 2020).

Molecular cloning

Request a detailed protocol

To make the nonspecific control viral vector pAAV-Ef1a-Sun1GFP, we made modifications to pAAV-Ef1a-Cre with restriction enzyme cloning. pAAV-Ef1a-Cre was a gift from Karl Deisseroth (Addgene, Watertown, MA; plasmid #55636; http://n2t.net/addgene:55636; RRID:Addgene_55636). First, we added a multiple cloning site before the Ef1a promoter to create easy promoter swapping for later use. The multiple cloning site insert was synthesized by Integrated DNA Technologies, Coralville, IA, and was inserted between BshTI and MluI sites upstream of the Ef1a promoter. Next, we used BamHI and EcoRI sites to replace the Cre gene with a modified Sun1GFP gene identical to the one in our cSNAIL technologies.

The resulting pAAV-Ef1a-Sun1GFP vector was further modified to create the other constructs. The PV+ SNAIL probes were designed to contain one PV+-specific enhancer candidate sequence, a synthetic intron for RNA stabilization, the Sun1GFP gene, a WPRE signal, and a polyA signal. From pAAV-Ef1a-Sun1GFP, the Ef1a promoter and intron region was removed and replaced with the sequence for a PV+-specific enhancer candidate and the synthetic intron. Inserts for SC1 and SC2 were synthesized by Integrated DNA Technologies and cloned into the vector using restriction sites for NdeI and BamHI. To ensure that no expression was being driven from the synthetic intron sequence itself, we similarly cloned a negative control construct containing the synthetic intron, but no enhancer candidate sequence. All transformations during cloning were performed in MegaX DH10B cells (Invitrogen, #C640003) and confirmed with Sanger sequencing.

AAV PHP.eB production for mice

Request a detailed protocol

AAV was produced in AAVpro(R) 293T cells (Takara, Kyoto, Japan; #632273) by co-transfection of the genome pAAV, an AAV helper plasmid, and pUCmini-iCAP-PHP.eB. pUCmini-iCAP-PHP.eB was a gift from Viviana Gradinaru (http://n2t.net/addgene:103005; RRID: Addgene 103005) (Chan et al., 2017). The AAV particles were precipitated with polyethylene glycol (PEG 8000, Sigma-Aldrich, St. Louis, MO; Cat# P2139-500G) and purified on an iodixanol gradient (OptiPrep, Sigma-Aldrich, Cat# D1556-250ML) with ultracentrifugation for 2.5 hr at 350,000 × g at 18°C. We filtered and concentrated the virus in PBS using Amicon Ultra-15 centrifugation filters (Millipore, Burlington, MA; #UFC905024). The viral titer was measured with the AAVpro(R) Titration Kit (Takara, #6233) and AAV was stored single-use aliquots at –80°C until injection.

AAV9-2YF production for primates

Request a detailed protocol

AAV vectors were produced in 293AAV cells (Cell Biolabs; Cat# AAV-100) using a triple transfection method (Grieger et al., 2006). Recombinant AAVs were purified by iodixanol gradient ultracentrifugation, buffer exchanged, and concentrated with Amicon Ultra-15 Centrifugal Filter Units (Cat# UFC8100) in DPBS and titered by using QuickTiter AAV Quantitation Kit (Cell Biolabs; Cat# VPK-145) and quantitative PCR relative to a standard curve.

Mouse AAV delivery

Request a detailed protocol

Animals were anesthetized with 2–3% isoflurane until no pedal withdrawal reflex was observed. Then, we injected 4 × 1011 vg total (50 µL) of virus into the retro-orbital cavity and treated the eye with 0.5% proparacaine hydrochloride ophthalmic solution. The animals were monitored while the virus incubated for 3–4 weeks until endpoint experiments.

Primate AAV delivery

Request a detailed protocol

All animal procedures were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and approved by the University of Pittsburgh’s Institutional Animal Care and Use Committee (IACUC) (Protocol ID 19024431). Monkey B was a 3-year-old male (4.6 kg). The monkey was sedated with ketamine (15 mg/kg IM) and transported to a surgery room. The animal was kept anesthetized with isoflurane. The brain was placed and fixed in a stereotaxic frame (Kopf Instruments). We shaved, cleaned the brain, and removed the calvarium above the region of interest. We cut open the dura and exposed the brain. We injected 20 µL PV58 (R58) around Brodmann areas 4 and 6 using Harvard Apparatus infusion pump.

Mouse brain immunofluorescence

Request a detailed protocol

Tissues were fixed with whole-body 4% paraformaldehyde (PFA) perfusion, and the brains were incubated in 4% PFA for an additional 12–24 hr after dissection. Coronal slices 80 µm thick were made with a vibratome. Free-floating sections were stained for parvalbumin with Pvalb (Swant, Marley, Switzerland; PV27) primary antibody with Alexa Fluor 405 (Invitrogen, #A-31556) or Alexa Fluor 594 (Cell Signaling Technology, Danvers, MA; #8889) secondary antibodies.

Primate brain immunofluorescence

Request a detailed protocol

The animal was sedated with ketamine (15 mg/kg IM) and kept anesthetized with isoflurane. We removed the calvarium first and then perfused the monkey through the circulatory systems with 4 L of ice-cold aCSF (124 mM NaCl, 5 mM KCl, 2 mM MgSO4, 2 mM CaCl2, 23 mM NaHCO3, 3 mM NaH2PO4, 10 mM glucose; pH 7.4, osmolarity 290–300 mOsm) oxygenated with 95% O2:5% CO2. We then cut open the dura and extracted the brain. We partitioned the cortical tissue into blocks, which were drop-fixed in 4% PFA for 48 hr and then changed through a series of three graded sucrose solutions. Sections (40 µm) were cut on a cryostat and stored at 4°C in a solution of 30% glycerol/30% ethylene glycol. Sections were rinsed three times in 0.1 M PB and treated with 0.1% NaBH4 for 30 min at room temperature (RT). The tissue was rinsed eight times in PBS-B, treated with 0.3% Triton (30 min at RT), and placed in 20% normal donkey serum blocking solution (Jackson, 017-000-121) for 2 hr at RT. Sections were subsequently incubated for 72 hr at 4°C using parvalbumin (Swant, PV, mouse, 235, 1:1000) and green fluorescent protein (Invitrogen, GFP, rabbit, A-11122, 1:1000) primary antibodies. Tissue was rinsed four times in PBS-B and incubated for 24 hr at 4°C in secondary antibodies (donkey) conjugated to Alexa 488 (Invitrogen, anti-rabbit, 1:500, A21206) or Alexa 568 (Invitrogen, anti-mouse, 1:500, A10037). Sections were rinsed three times in PBS-B. Tissue was mounted using ProLong Diamond Antifade mountant (Thermo Fisher, P36970), sealed with #1.5 coverslips, and stored at 4°C until imaging.

Microscopy

Request a detailed protocol

Images were taken of the motor cortex with laser scanning confocal microscopy. Cells were counted in each channel with Fiji (Schindelin et al., 2012) and assigned as double-labeled or single-labeled manually. Individual images from 1 to 3 animals were treated as replicates to determine the mean and standard error of the mean for specificity and efficiency quantifications.

ATAC-seq data processing

Request a detailed protocol

Samples were processed from the paired-end FASTQ files using the ENCODE ATAC-seq pipeline (https://github.com/ENCODE-DCC/atac-seq-pipeline; ENCODE DCC, 2022) with the following changes from default behaviors: atac.cap_num_peak = 300000, atac.idr_thresh = 0.1. All samples had high TSS enrichment (>15) and clear periodicity, indicative of good data quality. Optimal IDR peaks were determined for biological replicates of the same cell type, brain region, and sequencing batch (https://github.com/kundajelab/idr; Li et al., 2011). IDR peaks were then merged to define the combined peak regions (OCRs) for each analysis using bedtools (Quinlan and Hall, 2010). Specifically, we defined sets of OCRs for (i) cortex PV+ and PV- cSNAIL samples, (ii) PV+, EXC, and VIP+ INTACT samples (Mo et al., 2015), and (iii) cortex, striatum, and GPe bulk samples, PV+ and PV- cSNAIL samples, SC1-Sun1GFP samples, SC2-Sun1GFP samples, and Ef1a-Sun1GFP samples. We constructed count tables including the relevant samples on each of these OCR backgrounds using Rsubread featureCounts version 1.28.1 (Liao et al., 2019). These three count tables were used to form the basis of (i) the sorted population PV+ vs. PV- models, (ii) the sorted population PV+ vs. EXC and PV+ vs. VIP+ models, and (iii) analysis of SC1 and SC2 SNAIL PV+ probes in the cortex, striatum, and GPe, respectively.

The counts were modeled using the negative binomial distribution in DESeq2 (Love et al., 2014). We assessed the coefficient of cell group, where cell groups were unique tissue, virus, cell-type combinations, and we controlled for sex differences where both were present: DESeq2 design ~sex + cellGroup. Differential peaks were defined strictly for applications i and ii related to building models (padj<0.01 and |Log2FoldDifference| > 1) and more loosely for application iii to compare across viruses (padj<0.05 and |Log2FoldDifference| > 0.5). Related to Figure 3, only cortical samples from count matrix iii were included in the DESeq2 model, while the Figure 4 DESeq2 model included samples from all three brain regions.

snATAC-seq processing

Request a detailed protocol

We downloaded the following samples of snATAC-seq from the mouse MOp in Snap file format from http://data.nemoarchive.org/biccn/ (RRID:SCR_016152): CEMBA171206_3C, CEMBA171207_3C, CEMBA171212_4B, CEMBA171213_4B, CEMBA180104_4B, CEMBA180409_2C, CEMBA180410_2C, CEMBA180612_5D, and CEMBA180618_4D (Li et al., 2021). We processed these samples using SnapATAC version 1.0.0 (Fang et al., 2021). We restricted the analysis to nuclei that passed filtering as defined by the original authors (Li et al., 2021). This removed nuclei that had fewer than 1000 reads, TSS enrichment < 10, or doublet signatures detected by Scrublet (Wolock et al., 2019). Filtered samples contained 6700–10,983 nuclei each, making a total of 78,525 nuclei. We transformed the data to a count matrix over 5000 bp bins and combined the snap objects. We removed bins overlapping with the ENCODE blacklist, mitochondrial regions, and the top 5% of bins that overlapped with invariant features. We reduced dimensionality and selected 18 significant components. We used these to correct for batch effects with Harmony (Korsunsky et al., 2019). We performed Louvain clustering on the outputs from Harmony using runCluster() with the option louvain.lib=“R-igraph”.

We assigned cell types to clusters by accessibility at promoters and gene bodies of marker genes (Figure 1—figure supplement 1) and by comparison to the cell annotations from the original authors (Li et al., 2021). We called peaks for each cluster using MACS2 with the options --nomodel --shift 0 --ext 73 --qval 1e-2 -B --SPMR --call-summits (Zhang et al., 2008). We merged overlapping peaks across all clusters, resulting in 415,813 total OCRs. We defined differential OCRs using the findDAR() function with test.method = “exactTest” and were required to meet padj<0.01 (Benjamini–Hochberg corrected) and |log2FoldDifference| > 1. For comparisons to groups of clusters, for example, PV+ vs. EXC, we performed separate tests for PV+ vs. each excitatory cluster and selected the intersection of differential OCRs.

For the human models, we downloaded human snATAC-seq data from the parietal, temporal, and frontal cortex (Corces et al., 2020) from the NCBI Gene Expression Omnibus, accession number GSE147672. We assigned the cluster identities assigned in the original data publication (Corces et al., 2020). We defined differential peaks using the ArchR package (Granja et al., 2021) with the cutoffs padj<0.01 (Benjamini–Hochberg corrected) and |log2FoldDifference| > 1.

SVM data preparation

Request a detailed protocol

SVMs were developed to predict the direction of differential activity from sequences underlying differential OCRs between two cell types or groups of cell types. Because ATAC-seq summit regions are highly enriched for TF binding motifs, we centered on the peak summits within differential ATAC-seq OCRs and extended in both directions for a total fixed sequence length of 500 bp, a convenient length for AAV cloning. Peak summits were defined by MACS2 (Zhang et al., 2008), and only summit regions of peaks called within the cell type of interest were retained. For data from sorted cells, we used optimal IDR peaks across biological replicates of the given cell type. For example, in a PV+ vs. VIP+ model comparison, the positive model input examples were 500 bp summit-centered regions of PV+ IDR peaks that overlapped PV+-specific differential OCRs and the negative model input examples were 500 bp summit-centered regions of VIP+ neuron IDR peaks that overlapped VIP+ neuron-specific differential OCRs. For snATAC-seq data, we used peaks called within a cluster to define the relevant summit regions. If multiple cell clusters were involved in the comparison, for example, the excitatory neuron vs. inhibitory neuron model, we used summits found in any peak set from a cluster within that category. In cases where there were multiple summits within a differential OCR, all summits greater than 100 bp apart from each other were retained.

After defining the genomic locations of the summit-centered differential OCRs, we did additional filtering to finish preparing the data for model training. First, we restricted the data to enhancer regions because they usually have more cell type specificity than promoters and may be governed by different sequence properties. Therefore, we filtered out regions that were within 2000 bp of a TSS using RefSeq annotations downloaded from the UCSC Table browser in July 2020 (Kuhn et al., 2013). Next, for mouse models, we removed super-enhancers because they also may be governed by different sequence features and are not useful for AAV probe design because they are too large. We downloaded mm9 coordinates of mouse cortex super enhancers defined by H3K27ac from the dbSuper database (Khan and Zhang, 2016) and converted these to mm10 coordinates using UCSC liftOver with minmatch = 0.95 (Kuhn et al., 2013). Using bedtools intersect (Quinlan and Hall, 2010), we removed regions with any super-enhancer overlap. Finally, we used bedtools getfasta (Quinlan and Hall, 2010) to retrieve the sequences at these genomic coordinates from the mm10 assembly downloaded from UCSC genome browser in May 2018 (Kuhn et al., 2013) or the hg38 genome assembly GCA_000001405.27 downloaded from NCBI, and we removed any sequences that contained uncertain bases (Ns).

SVM model construction

Request a detailed protocol

We divided sequences into separate partitions by chromosome for model training, validation, and final testing. For mouse models, the training sets included chromosomes 3–7, 10–19, and X, the validation sets included chromosomes 8 and 9, and the test sets included chromosomes 1 and 2. For human models, the training sets included chromosomes 3–7, 10–22, and X, the validation sets included chromosomes 8 and 9, and the test sets included chromosomes 1 and 2. We input the training data into LS-GKM’s gkmtrain, and we evaluated model performance with gkmpredict (Lee, 2016). Because the input data was summit-centered, all models used the center-weighted gkm kernel, option -t 4, or the center weighted gkm rbf kernel, option -t 5. We tuned the -l, -k, -d, -c, and -w parameters for word length, number of informative columns, number of mismatches to consider, regularization, and class-weighted regularization, respectively, to maximize the validation set F1 scores. We used default values for other parameters. We calculated auROC and auPRC metrics and visualized on training, validation, and test sets using the ROCR package in R (http://ipa-tys.github.io/ROCR/; Sing et al., 2020). All paper figures reflect final test set performance. The details of all parameter settings and performance metrics of the final models are reported in Figure 1—source data 1.

CNN data preparation

Request a detailed protocol

We conducted differential accessibility analysis using DESeq2 (Love et al., 2014) to identify regulatory regions that display cell type-specific accessibility in ATAC-seq in PV+ neurons relative to other background cell types (PV-, VIP+, EXC). We used PV+ and PV- neuron ATAC-seq samples generated in this study as well as PV+, VIP+, and EXC neuron ATAC-seq samples from Mo et al., 2015. To conduct differential accessibility analysis, we obtained genomic coordinates of all 200 bp bins in the mm10 reference genome, starting from the 200 bp bin at the beginning of each chromosome and including all following contiguous nonoverlapping 200 bp bins. We then filtered out any bin that overlaps with an artifact region (Amemiya et al., 2019) or with regions that have unknown nucleotides (obtained from the UCSC twoBitInfo utility using the -nBed option). During this step, regions near the ends of chromosomes were filtered out. Then, using the featureCounts function in the subread package (Liao et al., 2014), we counted the reads mapping to each of the 200 bp bins in the ATAC-seq samples obtained from every included ATAC-seq sample. We then use the DESeq2 R package (Love et al., 2014) to identify bins that were differentially accessible between (i) PV+ and PV-, (ii) PV+ and VIP+, and (iii) PV+ and EXC neurons at a Benjamini–Hochberg FDR-adjusted p-value cutoff of 0.01. For each of the three comparisons, significant differential bins that displayed PV+ specificity (log2FoldDifference > 0) were used as positive examples for CNN training and significant differential bins that displayed negative log2FoldDifference (log2FoldDifference < 0) were used as negative examples for CNN training.

CNN model construction

Request a detailed protocol

We trained three separate CNN models that relate sequence to comparative regulatory activity (Zhou and Troyanskaya, 2015; Kelley et al., 2016; Quang and Xie, 2016). For each significant differential 200 bp bin, we obtained the 1000 bp sequence surrounding the center of the bin from the mm10 reference genome and trained the CNN to predict the positive or negative class label. We held out sequence examples underlying all significant differential bins on chromosome 4 as a validation set to evaluate hyperparameter settings and to choose the best performing final model. We also held out sequence examples underlying all significant differential bins on chromosomes 8 and 9 as a test set for final evaluation. Because we had different validation and test sets from those used for the SVM, we did not use any results from the SVM to influence our approach to designing the CNN architecture or any other aspects of CNN training. We implemented our CNN model in Keras 2.2.4 (https://keras.io/) with a theano backend (Al-Rfou et al., 2016). We created a one-hot encoded representation of the sequence, a 4 × 1000 binary matrix representing positions and occurrences of the 4 nucleotide characters (A,T,G and C) on the sequence, which was propagated through the network. Our CNN architecture consisted of multiple layers of convolution kernels stacked on top of each other (Figure 1—figure supplement 3). The first such layer consisted of 1000 convolution kernels, each with a kernel width of 8 and height of 4, which scan the input sequence in chunks of 8 nucleotides. We applied rectified linear unit (ReLu) activations on the outputs of these convolution kernels. This initial layer is followed by a variable number of convolution layers with the same number of kernels (100), each of width 8 and height 1. We applied ReLu activations on these convolution outputs as well. These convolution layers are then followed by a set of max pooling operations that selects the maximum value from a set of 13 adjacent units (pooling size = 13). We set the stride for the max pooling operation to 13 units, meaning that it selected the maximum values from contiguous chunks of 13 adjacent outputs from the previous layer. We applied dropout regularization (Srivastava et al., 2014) on the outputs of the max pooling operation to prevent overfitting to the training set. We then flattened the outputs of the max pooling layer into a single vector and passed them to a single output unit with a sigmoid activation function. We used stochastic gradient descent (SGD) to minimize binary cross-entropy loss (log loss) between the output of this unit and the positive/negative class label to learn model parameters.

Each model was trained for 100 passes through the training set (or ‘epochs’). For the PV+ vs. PV- and the PV+ vs. VIP+ tasks, we evaluated model performance and chose the best performing model based on the value of the binary cross-entropy loss on the validation set. For the PV+ vs. EXC task, we chose the final model based on a combination of auROC and auPRC on the validation set. We ignored small differences in validation auROC and auPRC (± 0.02) while selecting the final PV+ vs. EXC model. Tuning only the number of variable convolution layers (0, 1, or 2), and the dropout probability for the max pooling output (0.2, 0.4, or 0.5), we were able to achieve strong auROCs and auPRCs on the held-out validation sets. Therefore, we did not attempt to vary learning rate for SGD (0.01), momentum (0.0), batch size (30), number of training epochs (100), number of filters in the first convolution layer (1000), number of filters in subsequent convolution layers (100), kernel sizes (8), max pooling size (13), and stride (13). A table of hyperparameter settings and associated performance metrics (loss value, auROC, auPRC) on training, validation, and test sets is provided in Figure 1—source data 2.

Broad promoter sequences

Request a detailed protocol

The sequences of Gfap, Camk2a, and Dlx promoters (Figure 1—figure supplement 2) were extracted from AAV plasmids with confirmed cell type-specific activity in vivo. The Gfap promoter sequence was from hGFAP-GFP (Addgene plasmid #40592; http://n2t.net/addgene:40592; RRID:Addgene_40592). The Camk2a promoter sequence was from pENN.AAV.CamKII0.4.eGFP.WPRE.rBG (Addgene plasmid #105541; http://n2t.net/addgene:105541; RRID:Addgene_105541). The Dlx promoter sequence was from pAAV-mDlx-GFP-Fishell-1 (Addgene plasmid #83900; http://n2t.net/addgene:83900; RRID:Addgene_83900) (Dimidschstein et al., 2016).

SVM score analysis for external PV+ AAV screen

Request a detailed protocol

Externally tested PV+ AAV enhancer sequences (Vormstein-Schneider et al., 2020) were scored through all cortical PV+ SVMs. To enable comparison between models, scores were normalized to standard deviations from 0 using the standard variation of the validation data set for each model. For each pair of models, the sequence scores were assessed for correlation with cor() function from the R Stats package (https://www.rdocumentation.org/packages/stats/versions/3.6.2) with the Pearson method and visualized using the corrplot package in R (https://github.com/taiyun/corrplot; Taiyun, 2022) (Figure 1—figure supplement 4).

Alternative prioritization explorations for external PV+ AAV screen

Request a detailed protocol

Common alternative approaches for prioritizing enhancer candidates for cell type-specific AAV design include log2FoldDifference and conservation-based ranking. We show that machine learning models are more predictive of success than these approaches by evaluating on the external PV+ enhancer AAV screen (Vormstein-Schneider et al., 2020). The log2FoldDifference of ATAC-seq signal in different cell-type comparisons was evaluated from snATAC-seq data (Li et al., 2021). We added the exact genomic locations of each test sequence to the genomic peak set for assessment and applied the findDAR() function with test.method = “exactTest” in SnapATAC version 1.0.0 (Fang et al., 2021). The log2FoldDifference was determined for (i) the PV+ cluster relative to all PV- cells using cluster.neg = “random”, (ii) the PV+ cluster relative to closely related cells using cluster.neg = “knn”, (iii) the PV+ cluster relative to the pool of excitatory neuron clusters, (iv) the PV+ cluster relative to the VIP+ cluster, and (v) the PV+ cluster relative to the SST+ cluster (Figure 1—figure supplement 5).

Euarchontoglires PhyloP scores were extracted for all bases within each PV+ enhancer candidate using the UCSC Table Browser (phyloP60wayEuarchontoGlires track for the Grcm38/mm10 genome, accessed March 2021) (Kuhn et al., 2013). Regions were mapped from mouse (mm10) to human (hg38) using UCSC LiftOver, requiring a minimum ratio of bases that must remap of 0.1. All regions were mappable between species. Finally, we assessed overlapping human PV+ neuron OCRs from motor cortex snATAC-seq (Bakken et al., 2020) using bedtools intersect (Quinlan and Hall, 2010). Any peak overlap of at least 1 bp was recorded as an overlapping peak.

Evaluation of SC1 and SC2 ATAC-seq

Request a detailed protocol

PCA was performed using plotPCA() on the DESeqDataSet object with variance stabilizing transformation in DESeq2 version 1.26.0 (Love et al., 2014). Using the DESeq2 models described above for cell groups, we extracted OCR statistics for particular cell group comparisons by using the results contrasts. Correlations between log2FoldDifferences for PV+ cSNAIL vs. bulk tissue and log2FoldDifferences for SNAIL probes vs. bulk tissue were assessed using the R function cor.test() with both ‘Spearman’ and ‘Pearson’ methods. Genome browser tracks were visualized in the mm10 genome using IGV (Robinson et al., 2011) and track heights were normalized between samples of the same experimental ATAC-seq method (cSNAIL, SNAIL, bulk tissue, or single nucleus). Comparisons to snATAC-seq cluster markers (Figure 3d, Figure 4—figure supplement 2) represent the percentage of cSNAIL/SNAIL ATAC-seq OCRs enriched relative to bulk (padj<0.05 and log2FoldDifference > 0.5) that overlap snATAC-seq cluster markers. snATAC-seq cluster markers were defined as enriched OCRs for that cluster relative to its k-nearest neighbors (padj<0.01 and log2FoldDifference > 1) that were not enriched OCRs for any other cluster. The significance of the enrichments was assessed using the hypergeometric test with the phyper() function in R, setting lower.tail = FALSE. Enrichments for cluster-specific OCRs were assessed using a background of all snATAC-seq OCRs (N = 415,813), and p-values were corrected for 84 tests with Bonferroni correction.

Assessment of PV+ neuron OCRs in different brain regions

Request a detailed protocol

We assessed PV+ neuron cSNAIL ATAC-seq samples from the cortex, striatum, and GPe tissue of healthy control mice from Lawler et al., 2020 (one male, one female) for differential open chromatin using DESeq2 as described above. We evaluated OCRs that were preferentially open in one brain region relative to each of the other brain regions (padj<0.01 and log2FoldDifference > 1) for sequence motif and pathway enrichments. We computed motif enrichments for tissue-specific PV+ OCRs using AME version 5.3.3 (McLeay et al., 2010) against a background of PV+ OCRs from all three tissues. Similarly, we computed pathway enrichments using GREAT version 4.0.4 (McLean et al., 2010) for tissue-specific PV+ OCRs relative to a background of PV+ OCRs from all three tissues.

Model interpretation

Request a detailed protocol

We used GkmExplain (Shrikumar et al., 2019) to calculate actual and hypothetical importance scores per base for each of 11 SVMs among 1755 true-positive PV+-specific OCR sequences that also scored PV+-specific across all SVMs. First, sequences were one-hot encoded. The importance scores were normalized based on the hypothetical importance scores of all possibilities per base, so that a base position decreased in importance if there were other nucleotide possibilities that produced similar scores. We identified sequence motifs with high contributions to PV+ scores for each SVM separately using TF-MoDISco version 0.4.2.3 (Shrikumar et al., 2018) with options chosen to align with final SVM parameters: sliding_window_size = 7, flank_size = 3, min_seqlets_per_task = 3000, trim_to_window_size = 7, initial_flank_to_add = 3, final_flank_to_add = 4, kmer_len = 7, num_gaps = 1, and num_mismatches = 1. The resulting sequence patterns, representing motifs generated from seqlet clusters, were trimmed to the 13 central bases and patterns with support from more than 100 seqlets were used in downstream analysis. The position weight matrices (PWMs) of these patterns were associated with known motifs in the Human and Mouse HOCOMOCO v11 FULL database using Tomtom (Gupta et al., 2007) with the Pearson correlation coefficient motif comparison function (Figure 5—source data 1). Motifs from all models were clustered based on PWM similarity using STAMP (Mahony and Benos, 2007); STAMP operations were performed after trimming motif edges with information content less than 0.4, using ungapped Smith–Waterman alignment, the iterative refinement multiple alignment strategy, Pearson correlation coefficient comparison metrics, and UPGMA tree construction. Finally, individual instances of motif sites were mapped in SC1 and SC2 sequences using FIMO with default parameters (Grant et al., 2011).

For in silico mutagenesis of high-importance motifs, we first ranked individual nucleotides based on their GkmExplain importance scores. We used nucleotides in the top 5% of importance scores to estimate motif site boundaries within SC1 and SC2, and we identified the top two important motif sites in each sequence, which were 6–8 bp each. We randomly scrambled motifs and then input the partially scrambled sequences into Tomtom (Gupta et al., 2007) to ensure that no other known motif site was introduced. We scored the partially scrambled sequences for each SVM using gkmpredict (Lee, 2016) in the same way that we did for other sequences.

Data availability

Sequencing data have been deposited in GEO under accession code GSE171549.

The following data sets were generated
The following previously published data sets were used
    1. Mo A
    2. Mukamel EA
    3. Davis FP
    4. Luo C
    5. Eddy SR
    6. Ecker JR
    7. Nathans J
    (2015) NCBI Gene Expression Omnibus
    ID GSE63137. Epigenomic Signatures of Neuronal Diversity in the Mammalian Brain.
    1. Lawler AJ
    2. Brown AR
    3. Bouchard RS
    4. Toong N
    5. Kim Y
    6. Velraj N
    7. Fox G
    8. Kleyman M
    9. Kang B
    10. Gittis AH
    11. Pfenning AR
    (2020) NCBI Gene Expression Omnibus
    ID GSE157359. Cell type-specific oxidative stress genomic signatures in the globus pallidus of dopamine depleted mice.
    1. Srinivasan C
    2. Phan BN
    3. Lawler AJ
    4. Ramamurthy E
    5. Kleyman M
    6. Brown AR
    7. Kaplow IM
    8. Wirthlin ME
    9. Pfenning AR
    (2021) NCBI Gene Expression Omnibus
    ID GSE161374. Addiction-associated genetic variants implicate brain cell type- and region-specific cis-regulatory elements in addiction neurobiology.

References

    1. Srivastava N
    2. Hinton G
    3. Krizhevsky A
    4. Sutskever I
    5. Salakhutdinov R
    (2014)
    Dropout: A Simple Way to Prevent Neural Networks from Overfitting
    Journal of Machine Learning Research 15:1929–1958.

Decision letter

  1. Jeremy J Day
    Reviewing Editor; University of Alabama at Birmingham, United States
  2. Naama Barkai
    Senior Editor; Weizmann Institute of Science, Israel
  3. Jeremy J Day
    Reviewer; University of Alabama at Birmingham, United States
  4. Cliff Kentros
    Reviewer; Norwegian University of Science and Technology, Norway

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Machine learning sequence prioritization for cell type-specific enhancer design" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Jeremy J Day as Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Naama Barkai as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Cliff Kentros (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) To expand on the overall applicability of the cSNAIL approach, it would be useful to determine whether identified PV-specific sequences using this approach extend to ATAC-seq signal from PV neurons in other species. Showing this would also help to generate confidence that the selected AAV sequences can drive expression in PV neurons in other systems. At a minimum, it would be important to demonstrating the accessibility of the machine-learning identified enhancer sets studied here in publicly available snATAC-seq datasets from other brain regions and in other species. This addition would go a long way to indicating the ability to generalize this approach, which is important for a resource manuscript like this.

2) In many cases, the computational approaches are not fully described and may introduce confusion for a more general readership. It would be useful for descriptions of what each model actually does to be incorporated into the manuscript.

3) Points raised by all reviewers regarding technical and interpretational clarifications should be addressed in a revised manuscript.

Reviewer #1 (Recommendations for the authors):

1. The introduction is a bit lengthy, and may be improved by efforts to consolidate the last three paragraphs.

2. It is encouraging that population-derived SVMs and single nucleus-derived SVMs arrive at similar conclusions with respect to selected sequences (Figure 1F). However, if I understand correctly all of this data is from mouse cortex. To extend the applicability of this approach, it would be useful to determine whether identified PV-specific sequences using this approach extend to ATAC-seq signal from PV neurons in other species. Showing this would also help to generate confidence that the selected AAV sequences can drive expression in PV neurons in other systems (thereby significantly expanding the applicability of this tool beyond the mouse).

3. Similar to the above comment, the extension of this approach to other brain regions will largely drive the application of this technology by other groups. While Figure 4 shows data demonstrating that the predictions generated from mouse cortex may hold for mouse striatum and GPe (although not as well as cortex), this claim is only tested using ATAC-seq. This claim could be strengthened by the addition of co-labelling evidence (as in Figure 2 for cortex) that targeted SUN-GFP cells are also PV+.

4. The text identifies Err3 and Mef2 motifs as being important for the PV-specific activity of SC1 and SC2 sequences. However, without additional experimental evidence demonstrating that loss or mutation of these motifs abolishes the PV-specific expression pattern of these sequences, this conclusion should be moderated.

5. Other reports using open chromatin profiling to identify enhancers for transgene expression in AAV have recently been published (PMID 33789083, 33789096). While these reports are cited here, the Discussion section of this manuscript would benefit from a more systematic comparison of these approaches with the current approach.

Reviewer #2 (Recommendations for the authors):

1) Twice the authors compare the PV specificity of their viral vectors to that achieved with the Pvalb-2A-Cre mouse (line 91, line 271), however no data or references are cited to explain what data the authors are comparing from the mouse strain. Published studies using Cre-dependent viruses in adult Pvalb-2A-Cre show excellent specificity for expressing in PV+ neurons, it is when the strain is crossed to other transgenic reporter mice (like the original Ai14 mouse in the Madisen et al., paper) that there is expression outside the PV+ population in the adult brain. This may not reflect Cre expression outside the PV+ population as much as it reflects Pvalb gene activation during development in cells beyond the highly PV+ population of the adult. Therefore, the authors need to clarify what data from the Pvalb-2A-Cre mice to which they are comparing.

2) in Line 275-276, and Figures 2d-e, the authors state that their PV derived enhancer vectors have a 9-fold increase in PV specificity over the negative control. However, the negative control vector has no promoter and thus should really have no expression (there is certainly exceptionally little expression in Figure 2C in the right hand most panel). If the image is truly representative, then it seems that the authors probably have too few expressing cells to say much that is meaningful in a quantitative way about the specificity of that negative control.

3) In line 338 the authors use their motif analysis to suggest a distinction in the MEF2 family members that may control the differentiation of different inhibitory cell types. However, to my knowledge there is no rigorous experimental identification of distinct MEF2A versus MEF2C versus MEF2D binding sites (if the authors are aware of a validated study this reference would be good to add). The slight variations in the motifs in the databases interrogated here may rather reflect differences in the methods of the groups that deposited the motifs. The authors would benefit the field to think broadly about their interpretation of these variations in motif enrichment and what they might mean.

4) As to the evidence that MEF2C is required for the PV+ interneuron lineage, what the Mayer 2018 paper actually showed is that conditional knockout of MEF2C in PV+ lineage cells (with Dlx6-Cre) led to loss of PV expression in the adult cortex. However, whether this is loss of PV expression, versus failure of PV cells to develop, was not addressed. The authors should adjust their language here to account for that uncertainty. It is also important that the Mayer study settled on MEF2C because it was more highly expressed in PV interneurons compared with other interneurons (esp. SST interneurons). However, MEF2C is highly expressed in many classes of excitatory neurons especially during development but also in the adult brain. Therefore, MEF2C alone cannot explain cell type specific enhancer function except in collaboration with other cell type specific transcription factors.

5) In Line 351 the authors highlight the association of open chromatin near BDNF especially at promoter IV. However, this is somewhat surprising, given that BDNF is really not expressed in interneurons (see PMID 24855953). This example highlights that the authors need to consider the possibility that not every region of differentially open chromatin reflects enhancers – these may instead be regions bound by repressors that silence nearby genes, or architectural factors that affect long-distance regulation of gene expression. The authors could strengthen this aspect of their analysis if they compared their differential OCRs to differential gene expression at least for their major cell types of interest.

Reviewer #3 (Recommendations for the authors):

To sum up the public review, it is not that we don't believe that there is added value in the SNAIL approach, we suspect there definitely is, and would honestly like to try it ourselves once it is published. It just needs to be properly and clearly demonstrated. The ideal way to show this would be to analyse the same tissue twice, once via SNAIL and once via any of the competing approaches discussed above, find areas of non-overlap and see which one works best at making specific vectors, but this is admittedly a big ask, which I am not making. However, there needs to be a bit more to demonstrate how useful it might be… another less onerous way one could do so can be seen in Figure 1 suppl 6, which shows the predicted PV specificity from SNAIL (1S6b) versus accessibility alone (1S6c). Note that the two panels are largely similar, but there are particular enhancers which have wildly different predicted specificity than from chromatin accessibility alone. The authors could illustrate how well their approach worked by taking an enhancer with very different values by the two methods (e.g. E14, E12) and seeing which prediction holds true in an AAV: SNAIL vs. raw accessibility.

Specific points are below.

Line 16: "introduce" seems out of place, since cSNAIL was already described in the previous 2020 paper… In general this is a real issue… the paper would be greatly aided by making clearer the functional distinctions between the two acronyms, rather than simply say that cSNAIL is Cre-based. The similarity of the acronyms will be confusing to the casual reader who doesn't look up the prior paper, because they are actually quite distinct in application. It is unclear how much of the paper is about the label versus the algorithm (it's almost entirely the latter). This is central to the novelty of the work, so it should be clearer.

Lines 56 to 68: in this description of selection of enhancers one relevant aspect is overlooked, namely that scATAC-seq is a relatively noisy approach to the prediction of active enhancers. A combination of different chromatin marks would likely be equally effective in selecting relevant enhancers. Naturally, this would be harder to accomplish than a simple scATAC-seq. For example the work by Ernst and Kellis is very useful here (for example PMID: 29120462).

Discovery of functional elements does not require necessarily any modeling, the addition of K27Ac marks alone for example improves prediction greatly in the case of active promoters and enhancers.

Figure 1 – supplement 2 and lines 112-114: even though this presentation is convincing in showing the predictive capabilities of machine learning, it does not put the empirically tested promoters into context of all predicted regulatory elements. I would like to see in the text an inclusion of which rank the cell class specific promoters reach (i.e. "the SVM model ranked the Gfap promoter as xx out of xxx astrocyte specific OCRs").

Furthermore, in supplement 2b, I would like to see the inclusion of all enhancers/promoters as gray, points. This will put the verified promoters into context.

Line 130 and other instances: please use "PV+" rather than "PV", in the context of "PV-", only using "PV" for "PV+" is confusing. In the same vein "PV-" may work better to prevent confusion with "PV-specific" in line 132.

Line 133: how many OCRs were present on the merged reproducible list? This information would help to put into context the differentally accessible regions in line 140.

Lines 141 to 153: it is unclear what ROC does for a naïve reader. Please include some more information what actually happens in this model. Furthermore, with OCRs of 500bp and motifs typically 6-20bp (possibly several), that would leave many basepairs of noise. Are the OCRs first screened for relevant motifs, or is the ROC applied directly the full OCRs?

How well does the CNN models perform on the more broad peaks of ATAC-seq compared to the data they were originally designed for with a much higher basepair resolution (ie. CHiP-NEXUS)?

Line 160: how many peaks were present in the merged reproducible list of peaks?

Line 165 to 174: it is unclear what the CNN does for a naïve reader. Please add a few sentences of explanation. Additionally, the true/false positive rate graphs are presented without further explanation.

Lines 221 to 225: Rather than picking out 3 values that correlate with the hypothesis, please report the full data.

In other words, please include a full list/table of enhancers E1-E34 with scores from both models, as well as values of specificity. Then ideally include two graphs with on the X-axis specificity and on the Y axis the model score, for a selection of the contrasts. I requests this because in many cases in figure 1 supplementary 4 and 5 the correlation seems to be driven primarily by E4, E22 and E29.

Could you also provide the correlation between predicted activity and specificity without E4, E22 and E29? This should hold up even without these most obvious examples.

Figure 1 supplement 5a: according to the predicted activity E10, E7 and E9 are particularly specific for PV+ cells compared to VIP+ cells. Similarly E14 should be particularly specific for PV+ cells compared to SST+ cells. This predicted specify is not apparent from the accessibility (supplement 5b) and the general, empirically determined specificity is not particularly high. But, based on the models, these particular enhancers should display specificity for PV+ cells over VIP+ cells. This hypothesis is easily tested by injecting viral vectors with these particular enhancers and counting the transgene expression in PV+ cells compared to VIP+/SST+ cells.

This hypothesis is particularly interesting because the tested enhancers are fully in line with the accessibility predictor of specificity, whereas E10, E7 and E9 predict something different than accessibility.

Figure 1 supplement 5b: It appears the correlation between specificity and accessibility is stronger than the correlation between specificity and predicted activity. Please provide the Pearson correlation of specificity and accessibility also and comment on the difference between the two correlations.

Line 241 to 250: it appears E14, and in lesser extend E11 and E2, have a similarly high specificity. Do these enhancers contain motifs too? For that matter, all other enhancers?

Line 258: what is the motivation for picking SC2 over all other candidates in the 90th percentile?

Figure 2 A-B: It appears the snATAC signal for SC1 and SC2 alone would be very strong in pointing these enhancers out as PV specific enhancers. Based on only Accessibility, how high would they rank? In other words, could you make a list with all enhancers sorted based on log2 fold accessibility difference, and provide the percentile ranks of these enhancers based on this compared to the model predicted ranks?

Figure 2C-E and lines 266 to 276: which region was investigated? Were the same cortical regions investigated to establish the percentages? Some cortical regions are naturally more abundant with PV+ cells. Please include more details on this analysis.

If this is done right: very strong, excellent to see this working! Absolutely convincing SC1 and SC2 drive PV specific expression.

Line 297 to 327: Strong, convincing evidence that SC1 and SC2 are indeed PV specific. This is not only interesting for those researching PV cells, but also an indication that the selection of these enhancers based on the in silico models was successful.

Line 360: it would be interesting to see a GPe or cortical PV+ neuron specific enhancer in a subsequent publication!

Line 444: most interesting, a good lead into functional understanding of enhancers-TF interaction in particular cell types in the brain.

Line 449: This sentence seems a bit of an overstatement. As I understand, first OCRs are selected on differential activity. Meaning a requirement of scATAC data with defined and annontated clusters. For this statement to be true, the models need to be run on all peaks, rather than pre-selected regions with differential activity. Perhaps I misunderstood and the models were actually run on the merged lists of peaks, in that case disregard this comment.

Line 459: I'm not sure which data this is based on. I don't think a direct comparison was made between the average score in Figure 1.sup5a and b. I would like to see this explicitly state in the text, along with correlation plots for specificity vs. predicted activity and specificity vs. accessibility, with pearson correlations.

Line 491: Could you speculate on the possibility to find or generate regulatory elements specific for cell types in these regions. So, instead of generalization, specification.

Lines 493 to 501: Another limitation, is that when supplemented in mature neurons, the viral vector will not undergo the same developmental, epigenomic modifications. This may result in different levels of expression. At least, in our hands we found discrepancies in expression between transgenically provided genes and virally provided ones. There does not seem to be much literature on this topic though, so a discussion may be beyond the scope of this paper.

https://doi.org/10.7554/eLife.69571.sa1

Author response

Essential revisions:

1) To expand on the overall applicability of the cSNAIL approach, it would be useful to determine whether identified PV-specific sequences using this approach extend to ATAC-seq signal from PV neurons in other species. Showing this would also help to generate confidence that the selected AAV sequences can drive expression in PV neurons in other systems. At a minimum, it would be important to demonstrating the accessibility of the machine-learning identified enhancer sets studied here in publicly available snATAC-seq datasets from other brain regions and in other species. This addition would go a long way to indicating the ability to generalize this approach, which is important for a resource manuscript like this.

We agree that demonstrating the generalizability of the approach is of particular importance. We made several substantial computational and experimental advances to address these concerns:

1. We compared the predictions of machine learning models trained on mouse cortex, mouse striatum, and mouse GPe. High concordance between PV+ neuron-specific predictions in the cortex and striatum suggest generalizability between brain regions where PV+ neurons are similar.

2. We trained new machine learning models using human single nucleus open chromatin data (Corces et al., 2020). The success of these models shows the applicability of the SNAIL framework for source data and applications beyond the mouse brain.

3. We demonstrate that PV+ machine learning models trained from human data show substantial correlation to those trained from mouse data. This shows cross-species conservation in the regulatory sequence features that discriminate PV+ neuron-specific enhancers. It also suggests that many machine learning-designed enhancer AAVs chosen from the mouse genome are likely to be transferable to primates and other species. In fact, one of our enhancer candidates SC1 scored as one of the highest predicted PV+ neuron-specific sequences for both mouse and human.

4. We tested one of our SNAIL PV+ AAVs in the macaque cortex where it exhibited similar selectivity for PV+ neurons. This validates the wide utility of the SC1 SNAIL virus and underscores the generalizability of the SNAIL system to other contexts and species.

These results of items 2 – 4 above are described in a new section in the text “Cross-species analyses with SNAIL” and reference a new main figure, Figure 6.

2) In many cases, the computational approaches are not fully described and may introduce confusion for a more general readership. It would be useful for descriptions of what each model actually does to be incorporated into the manuscript.

Based on the broad audience of eLife, we have added more background and detail about the computational approaches, with special attention to the machine learning modeling. Some examples include:

“These models take 500 bp candidate enhancer DNA sequence strings as input and they output, for each sequence, the cell type in which that candidate enhancer is active and an associated score. The similarity between sequences is determined based on gapped k-mer count vectors, i.e. the number of occurrences of all short subsequences of length k, tolerating some gaps or mismatches, as implemented by LS-GKM (Ghandi et al., 2014; Lee 2016).Where there are sufficient reliable sequence features associated with differential enhancer activation between two cell types, the model should learn these principles during the training phase and then be able to apply these principles to determine the cell type-specific activities of new sequences.”

“CNNs are a type of artificial neural network defined by multiple convolutional layers. The CNNs were trained to take in 1000 bp DNA sequence strings with different accessibility between two cell types, automatically extract predictive sequence features, and output a cell class probability between 0 and 1 (see methods for details). Compared with SVMs, CNNs are better-equipped to learn higher-order interactions between sequence features due to CNNs’ capacity for flexible feature representation and automated feature selection (Cun et al., 1989).”

“To assess model performance, we used standard classifier metrics, the area under the receiver operator curve (auROC) and the area under the precision-recall curve (auPRC). These scores quantify model performance by comparing the predicted class to the actual class, where a randomly guessing binary classifier would have an auROC score of 0.5 and an auPRC score equal to the fraction of actual positives in the data, and a perfect classifier would have a maximum auROC score of 1.0 and auPRC score of 1.0.”

Reviewer #1 (Recommendations for the authors):

1. The introduction is a bit lengthy, and may be improved by efforts to consolidate the last three paragraphs.

We consolidated the introduction where possible, with particular attention to the last three paragraphs. In the second to last paragraph, we made the description of machine learning more concise.

“The nucleotide sequence code that links transcription factor binding sites and other DNA features to enhancer activity is underutilized in AAV enhancer design, perhaps due to its complexity (Jindal and Farley 2021). We reasoned that machine learning classifiers could be leveraged to identify the most characteristic and specific enhancer sequence patterns for a cell population, enabling efficient prioritization of sequences that are likely to drive selective expression. Convolutional neural networks (CNNs) (Cun et al., 1989) and support vector machines (SVMs), for example, have achieved state-of-the-art performance on predicting enhancer activity from sequence (Chen, Fish, and Capra 2018; Kaplow et al., 2021; Kelley 2020).”

2. It is encouraging that population-derived SVMs and single nucleus-derived SVMs arrive at similar conclusions with respect to selected sequences (Figure 1F). However, if I understand correctly all of this data is from mouse cortex. To extend the applicability of this approach, it would be useful to determine whether identified PV-specific sequences using this approach extend to ATAC-seq signal from PV neurons in other species. Showing this would also help to generate confidence that the selected AAV sequences can drive expression in PV neurons in other systems (thereby significantly expanding the applicability of this tool beyond the mouse).

We generated new human-based SVMs using single nucleus open chromatin data from human cortical regions (Corces et al., 2020) to parallel the mouse SVMs. Mouse SVM scores and human SVM scores were well correlated (Pearson R=0.77) for PV+ neuron OCRs. We incorporated these additional models into the manuscript in the new Results section “Cross-species analyses with SNAIL” line 455 and new Figure 6. We also demonstrated selectivity of the SC1 SNAIL probe for PV+ neurons in the macaque cortex (Figure 6c).

3. Similar to the above comment, the extension of this approach to other brain regions will largely drive the application of this technology by other groups. While Figure 4 shows data demonstrating that the predictions generated from mouse cortex may hold for mouse striatum and GPe (although not as well as cortex), this claim is only tested using ATAC-seq. This claim could be strengthened by the addition of co-labelling evidence (as in Figure 2 for cortex) that targeted SUN-GFP cells are also PV+.

Unfortunately, we did not collect imaging data from the mouse striatum and GPe. We have tempered the language to note that further experiments are necessary to confirm PV+ neuron labeling in additional brain regions. e.g.

“From these interpretations, SC1 and SC2 SNAIL viruses seem to generalize to the striatum and GPe, though they may not be as robust as they are within the cortical context. Additional experimental evidence may be necessary to confirm the appropriateness of PV+ SNAIL viruses for certain applications outside the cortex.”

We also added an additional broader comparison of the machine learning models from different brain regions, which shows high agreement between cortical models and striatal models, and a weak relationship between cortical models and GPe models. This result is biologically intuitive, GPe PV+ neurons have major distinguishing phenotypes, including projections to other brain regions (Hernández et al., 2015; Saunders, Huang, and Sabatini 2016).

“We assessed the correlation between the score outputs of the cortex PV+ vs. PV- SVM with the striatum PV+ vs. PV- or GPe PV+ vs. PV- SVMs for our set of experimentally identified mouse cortical PV+ neuron enhancers sequences. Enhancer scores were well correlated between the cortex and striatum models (pearson = 0.73, spearman = 0.72), indicating shared regulatory sequence determinants between PV+ neurons in the cortex and striatum. There was a low correlation between enhancer scores of the cortex and GPe models (pearson = 0.25, spearman = 0.24), indicating differences in the learned PV+ regulatory sequence properties in these regions.”

Overall, we believe that the ATAC-seq data from multiple brain regions, as well as the addition of the cross-species analyses, demonstrate high utility for PV SNAIL technologies.

4. The text identifies Err3 and Mef2 motifs as being important for the PV-specific activity of SC1 and SC2 sequences. However, without additional experimental evidence demonstrating that loss or mutation of these motifs abolishes the PV-specific expression pattern of these sequences, this conclusion should be moderated.

We agree that further experimentation is needed to assess this claim, and we have moderated the language in the manuscript. We amended the section title to be “Err3 and Mef2 motifs are important for the PV+ neuron-specific activity score of SC1 and SC2 sequences” to direct our interpretation to the model score and not necessarily to functional impact. To add to this discussion, we have added predictions for the contributions of these motifs using in silico mutagenesis (Figure 5d). We added related text and moderated conclusions in lines 450 – 453:

“Disruption of Err3 and Mef2 motifs within SC1 or SC2 resulted in sharply decreased the prediction of PV+ specificity according to the scores across the SVMs (Figure 5d). While these impacts are untested in vivo, these analyses provide an intuition for potential nucleotide contributions to PV+ neuron-specific enhancer sequence function in SC1 and SC2.”

5. Other reports using open chromatin profiling to identify enhancers for transgene expression in AAV have recently been published (PMID 33789083, 33789096). While these reports are cited here, the Discussion section of this manuscript would benefit from a more systematic comparison of these approaches with the current approach.

We view SNAIL as a novel addition to existing large screen approaches which could make them more efficient and more accessible for boutique applications. We added to the second paragraph of the discussion to better explain SNAIL in the context of the field:

“Thus far, success in cell type-specific AAV creation has been limited to intensive screens of dozens of individual AAV enhancer candidates in which most do not produce highly selective expression (Vormstein-Schneider et al., 2020; Mich et al., 2021; Graybuck et al., 2021). In SNAIL, our framework for cell type-specific AAV engineering, we incorporate machine learning classifiers as an additional filter for improved enhancer pre-selection in order to mitigate the experimental burden. On a set of externally tested PV+ enhancer-driven AAVs (Vormstein-Schneider et al., 2020), the average PV+ specificity score across our classifiers was more predictive of PV+ specific AAV expression than the log2 fold difference of snATAC-seq signal, sequence conservation, or accessibility conservation at these loci. With the SNAIL framework, we identified and validated two novel enhancers that drive targeted expression in PV+ neurons in the mouse cortex. While these do not represent enough trials to establish a new conversion rate from cell type-specific OCRs to cell type-specific AAVs, we were encouraged by the immediate success of the first probes we selected. We believe that the incorporation of differential sequence property analyses will continue to improve the throughput of targeted AAV development in new contexts and make cell type-specific enhancer tool development accessible for more researchers.”

In later sections of the discussion, we go on to emphasize other unique advantages of SNAIL over other approaches, including the interpretability of sequence contributions (beginning in line 503) and the isolatable reporter Sun1GFP (beginning in line 521).

Reviewer #2 (Recommendations for the authors):

1) Twice the authors compare the PV specificity of their viral vectors to that achieved with the Pvalb-2A-Cre mouse (line 91, line 271), however no data or references are cited to explain what data the authors are comparing from the mouse strain. Published studies using Cre-dependent viruses in adult Pvalb-2A-Cre show excellent specificity for expressing in PV+ neurons, it is when the strain is crossed to other transgenic reporter mice (like the original Ai14 mouse in the Madisen et al., paper) that there is expression outside the PV+ population in the adult brain. This may not reflect Cre expression outside the PV+ population as much as it reflects Pvalb gene activation during development in cells beyond the highly PV+ population of the adult. Therefore, the authors need to clarify what data from the Pvalb-2A-Cre mice to which they are comparing.

These comparisons reference data collected alongside SC1 and SC2 and available in this paper. For imaging, we compared to PValb-2A-Cre/Ai14 fluorescent expression (~50% specificity to Pvalb+ cells in our images). We have amended the text to be more explicit about this, and to direct readers to the appropriate source data:

“This was an 11-fold enrichment in precision over the Ef1a promoter and notably, an almost 2-fold enrichment over Cre reporter labeling in Pvalb-2A-Cre/Ai14 double transgenic mice (Figure 2c-e, Figure 2-Source Data 2).”

For the affinity purification ATAC-seq data, we compared to single transgenic PValb-2A-Cre labeling using cSNAIL. Our claim that SNAIL probes are more specific to PV+ interneurons than the PValb-2A-Cre strain references the experiment comparing all Sun1GFP-purified populations with snATAC-seq (Figure 3d).

“We also note that cSNAIL PV+ ATAC-seq had an additional 8% overlap with excitatory cluster L5 PT markers (p = 2.5 x 10-45), possibly reflective of Pvalb-2A-Cre line labeling in layer 5 Parvalbumin-expressing excitatory neurons (Tanahira et al., 2009; Jinno and Kosaka 2004; Roccaro-Waldmeyer et al., 2018). These OCRs were absent in SC1- and SC2-isolated cells. In fact, SC1 and SC2 had no enrichment for cluster-specific OCRs of any cluster other than PV+ (≤ 2% overlap, p > 0.1), including the closely related SST+ neuron population. This suggests that SC1 and SC2 SNAIL probes target a stricter subset of the cells than the Pvalb-2A-Cre mouse strain, likely restricted to PV+ inhibitory interneurons.”

2) in Line 275-276, and Figures 2d-e, the authors state that their PV derived enhancer vectors have a 9-fold increase in PV specificity over the negative control. However, the negative control vector has no promoter and thus should really have no expression (there is certainly exceptionally little expression in Figure 2C in the right hand most panel). If the image is truly representative, then it seems that the authors probably have too few expressing cells to say much that is meaningful in a quantitative way about the specificity of that negative control.

There were 154 weakly expressing gfp+ cells total across the N.C. images, compared to thousands of gfp+ for SC1 and SC2 (see Figure 2-Source Data 2). We have removed the statement about the 9-fold increase over N.C. because there was a difference between titers. Instead, we use the N.C. data to determine that there is no PV+ bias in background expression. We have changed the text to reflect these differences:

“The negative control virus was injected at as high of a concentration as possible (14x the concentration of SC1, SC2, and Ef1a injections) to detect any biases in spurious background expression. At this titer, a number of cells exhibited GFP expression (Figure 2-Source Data 2), but these did not appear biased toward PV+ neurons (Figure 2d).”

3) In line 338 the authors use their motif analysis to suggest a distinction in the MEF2 family members that may control the differentiation of different inhibitory cell types. However, to my knowledge there is no rigorous experimental identification of distinct MEF2A versus MEF2C versus MEF2D binding sites (if the authors are aware of a validated study this reference would be good to add). The slight variations in the motifs in the databases interrogated here may rather reflect differences in the methods of the groups that deposited the motifs. The authors would benefit the field to think broadly about their interpretation of these variations in motif enrichment and what they might mean.

We have modified the paragraph beginning on line 429 to remove the claims about different Mef2 motifs. The utility and interpretations are not heavily dependent on the specificities of the Mef2 family members.

4) As to the evidence that MEF2C is required for the PV+ interneuron lineage, what the Mayer 2018 paper actually showed is that conditional knockout of MEF2C in PV+ lineage cells (with Dlx6-Cre) led to loss of PV expression in the adult cortex. However, whether this is loss of PV expression, versus failure of PV cells to develop, was not addressed. The authors should adjust their language here to account for that uncertainty. It is also important that the Mayer study settled on MEF2C because it was more highly expressed in PV interneurons compared with other interneurons (esp. SST interneurons). However, MEF2C is highly expressed in many classes of excitatory neurons especially during development but also in the adult brain. Therefore, MEF2C alone cannot explain cell type specific enhancer function except in collaboration with other cell type specific transcription factors.

Thank you for the suggestion. We completely agree that MEF2C alone cannot explain cell type-specificity. In fact, the strong performance of our computational models is almost certainly due to their ability to identify how complex combinations of transcription factor binding sites and other sequence signatures influence PV+ neuron-specific enhancer activity. We have added text to the discussion that clarifies the role we propose MEF2C plays in this system.

“An additional advantage of incorporating classifiers for cell type-specific enhancer selection is increased interpretability of the factors that govern success. The sequence patterns learned by PV+ models reflected known PV+ neuron biology. Common motifs contributing to successful PV+ probe enhancers included Err3, Mef2, and Lhx6, important for the specification and maintenance of the cortical PV+ interneuron lineage (Zhao et al., 2008; Mayer et al., 2018; Liodis et al., 2007). It is interesting to note the varying expression patterns of these transcription factors. In the cortex, Err3 expression is mainly restricted to PV+ neurons and Lhx6 is restricted to MGE interneurons, but Mef2 transcripts are widely expressed in many inhibitory and excitatory neurons. It is likely combinatorial interactions between abundant and cell type-specific transcription factor binding events that specify PV+ neuron enhancer activation. The SNAIL framework provides an opportunity to meaningfully leverage these complex sequence codes.”

5) In Line 351 the authors highlight the association of open chromatin near BDNF especially at promoter IV. However, this is somewhat surprising, given that BDNF is really not expressed in interneurons (see PMID 24855953). This example highlights that the authors need to consider the possibility that not every region of differentially open chromatin reflects enhancers – these may instead be regions bound by repressors that silence nearby genes, or architectural factors that affect long-distance regulation of gene expression. The authors could strengthen this aspect of their analysis if they compared their differential OCRs to differential gene expression at least for their major cell types of interest.

We have modified the BDNF OCR discussion and included speculation on this discrepancy.

“The set of PV+ OCRs enriched in cortical PV+ neurons included 10 regions associated with the Bdnf gene (Ensembl Genes; FDR Q = 0.0035). Bdnf is generally expressed in excitatory forebrain neurons but not PV+ interneurons. The presence of PV+ neuron OCRs near Bdnf could represent genomic regions with non-enhancer functions, enhancers that regulate another gene, the binding regions of repressive TFs, or trace contamination from excitatory populations.”

Reviewer #3 (Recommendations for the authors):

To sum up the public review, it is not that we don't believe that there is added value in the SNAIL approach, we suspect there definitely is, and would honestly like to try it ourselves once it is published. It just needs to be properly and clearly demonstrated. The ideal way to show this would be to analyse the same tissue twice, once via SNAIL and once via any of the competing approaches discussed above, find areas of non-overlap and see which one works best at making specific vectors, but this is admittedly a big ask, which I am not making. However, there needs to be a bit more to demonstrate how useful it might be… another less onerous way one could do so can be seen in Figure 1 suppl 6, which shows the predicted PV specificity from SNAIL (1S6b) versus accessibility alone (1S6c). Note that the two panels are largely similar, but there are particular enhancers which have wildly different predicted specificity than from chromatin accessibility alone. The authors could illustrate how well their approach worked by taking an enhancer with very different values by the two methods (e.g. E14, E12) and seeing which prediction holds true in an AAV: SNAIL vs. raw accessibility.

Specific points are below.

Line 16: "introduce" seems out of place, since cSNAIL was already described in the previous 2020 paper… In general this is a real issue… the paper would be greatly aided by making clearer the functional distinctions between the two acronyms, rather than simply say that cSNAIL is Cre-based. The similarity of the acronyms will be confusing to the casual reader who doesn't look up the prior paper, because they are actually quite distinct in application. It is unclear how much of the paper is about the label versus the algorithm (it's almost entirely the latter). This is central to the novelty of the work, so it should be clearer.

We have made attempts throughout the manuscript to clarify. Specifically, we note the similarity and difference between cSNAIL and SNAIL at first mention, line 90:

“Similar to our previously described Cre-activated AAV technology cSNAIL (Lawler et al., 2020), SNAIL probes have the unique advantage of an affinity purification-compatible fluorescent reporter that can be used to isolate rare cell types (Mo et al., 2015; Deal and Henikoff 2010; Lawler et al., 2020). Unlike cSNAIL, which relies on Cre activation, SNAIL probes are stand-alone AAVs driven by cell type-specific enhancer sequences selected through machine learning models. Thus, SNAIL probes may be used in a wide variety of systems including wild type mice, primates, and other species.”

We also changed our language to refer to SNAIL throughout the text as a “framework” for machine learning-assisted cell type-specific sequence design. For example:

“Here, we present a framework for machine learning-assisted engineering of cell type-specific AAVs, which we refer to as Specific Nuclear Anchored Independent Labeling (SNAIL).”

“In SNAIL, our framework for cell type-specific AAV engineering, we incorporate machine learning classifiers as an additional filter for improved enhancer pre-selection in order to mitigate the experimental burden.”

“With the SNAIL framework, we identified and validated two novel enhancers that drive targeted expression in PV+ neurons in the mouse cortex.”

Thank you, it is corrected.

Lines 56 to 68: in this description of selection of enhancers one relevant aspect is overlooked, namely that scATAC-seq is a relatively noisy approach to the prediction of active enhancers. A combination of different chromatin marks would likely be equally effective in selecting relevant enhancers. Naturally, this would be harder to accomplish than a simple scATAC-seq. For example the work by Ernst and Kellis is very useful here (for example PMID: 29120462).

Discovery of functional elements does not require necessarily any modeling, the addition of K27Ac marks alone for example improves prediction greatly in the case of active promoters and enhancers.

We have added to that paragraph to note that ATAC-seq is an imperfect proxy for enhancer activity.

“ATAC-seq (Buenrostro et al., 2013) is an advantageous technique for defining potential cell type-specific enhancer regions because of its high nucleotide resolution and its compatibility with small cell populations and even with single cell technologies (Buenrostro, Wu, Litzenburger, et al., 2015; Cusanovich et al., 2015). Though convenient, chromatin accessibility is a noisy approximation for enhancer activity and differentially accessible sequence elements often fail to produce selective expression in vivo.”

We agree that the addition of other epigenetic features, including H3K27ac and H3K4me1 (Ernst and Kellis 2017), could potentially improve the predictions. These, and other potential features, are not included in our approach for two reasons. First, these histone modifications tend to be found across broader regions than ATAC-Seq peaks (>1,000bp) (Roadmap Epigenomics Consortium [2015] 2015). In our experience, this can make it more difficult to learn important local sequence features. Second, the availability of this type of data for specific cell types is rare, because of the technical difficulty of ChIP-seq in tiny populations of cells. Overall, our approach helps isolate true cell type-specific sequence characteristics while only requiring convenient ATAC-seq data.

Figure 1 – supplement 2 and lines 112-114: even though this presentation is convincing in showing the predictive capabilities of machine learning, it does not put the empirically tested promoters into context of all predicted regulatory elements. I would like to see in the text an inclusion of which rank the cell class specific promoters reach (i.e. "the SVM model ranked the Gfap promoter as xx out of xxx astrocyte specific OCRs").

Furthermore, in supplement 2b, I would like to see the inclusion of all enhancers/promoters as gray, points. This will put the verified promoters into context.

We have added the rank information as suggested. We also included all test set enhancers as gray points in Figure 1—figure supplement 2b.

“The Gfap promoter sequence, which has a heavy astrocyte bias in vivo, was classified as astrocyte-specific in our neuron vs. astrocyte model (5,580 astrocyte-specific enhancers evaluated; Gfap ranks 3,298/5,580). The same neuron vs. astrocyte model scored the CamkII promoter and Dlx promoter sequences, which are known to have neuron-specific activity, as neuron-specific (14,347 neuron-specific enhancers evaluated; CamkII = rank 13,300/14,347, Dlx = rank 9,736/14,347). Also consistent with empirical expectations, the excitatory vs. inhibitory neuron model predicted the CamkII sequence to have an excitatory neuron preference and the Dlx sequence to have an inhibitory neuron preference (15,391 excitatory neuron-specific enhancers evaluated; CamkII = rank 11,541. 4,608 inhibitory neuron-specific enhancers tested; Dlx = rank 1,142/4,608) (Figure 1—figure supplement 2).”

Line 130 and other instances: please use "PV+" rather than "PV", in the context of "PV-", only using "PV" for "PV+" is confusing. In the same vein "PV-" may work better to prevent confusion with "PV-specific" in line 132.

All instances have been standardized to PV+ and PV-.

Line 133: how many OCRs were present on the merged reproducible list? This information would help to put into context the differentally accessible regions in line 140.

There were 160,450 merged reproducible peaks in this analysis. We added detailed information about the number of total peaks, differential peaks, and positive/negative set ratios for all comparisons into Figure 1-Source Data 1.

Lines 141 to 153: it is unclear what ROC does for a naïve reader. Please include some more information what actually happens in this model. Furthermore, with OCRs of 500bp and motifs typically 6-20bp (possibly several), that would leave many basepairs of noise. Are the OCRs first screened for relevant motifs, or is the ROC applied directly the full OCRs?

We have included a better description of what the models actually do at first mention. We do not screen OCRs for relevant motifs, and the full 500 bp sequence goes into the model and evaluations. A key advantage of the machine learning models is their ability to pick out relevant sequence signal (e.g. motifs) among noise, much better than a person could do manually with this volume of data.

“These models take 500 bp candidate enhancer DNA sequence strings as input and they output, for each sequence, the cell type in which that candidate enhancer is active and an associated score. The similarity between sequences is determined based on gapped k-mer count vectors, i.e. the number of occurrences of all short subsequences of length k, tolerating some gaps or mismatches, as implemented by LS-GKM (Ghandi et al., 2014; D. Lee 2016). Where there are sufficient reliable sequence features associated with differential enhancer activation between two cell types, the model should learn these principles during the training phase and then be able to apply these principles to determine the cell type-specific activities of new sequences.”

ROC is a measure of accuracy of the classifier, i.e., given the 500bp sequence alone, how well can the model correctly classify sequences into the PV+ or PV- category. ROC metrics are applied to the whole 500 bp sequence, where each 500 bp sequence counts as 1 true positive, false positive, true negative, or false negative.

How well does the CNN models perform on the more broad peaks of ATAC-seq compared to the data they were originally designed for with a much higher basepair resolution (ie. CHiP-NEXUS)?

CNNs have been widely adopted for a large variety of data modalities, from computer vision to natural language processing and genomics. Within genomics, they have been applied to wide variety of sequence signals. Some of the early approaches were indeed shorter sequences for transcription factor binding sites (Zeng et al., 2016). However, when predicting broader epigenetic features, including enhancers, 1000bp can be used to train accurate models (Zhou and Troyanskaya 2015). Subsequent models have under even longer 3kb inputs (Kelley 2020). Some recent methods have even used sequences over 100kb as input to predictions (Kelley 2020).

Line 160: how many peaks were present in the merged reproducible list of peaks?

There were 186,016 merged reproducible peaks. I have added all compiled peak information in Figure 1—figure supplement 1.

Line 165 to 174: it is unclear what the CNN does for a naïve reader. Please add a few sentences of explanation. Additionally, the true/false positive rate graphs are presented without further explanation.

We have added clarification on the CNN modeling and the graphs of auROC and auPRC.

“CNNs are a type of artificial neural network defined by multiple convolutional layers. The CNNs were trained to take in 1000 bp DNA sequence strings with different accessibility between two cell types, automatically extract predictive sequence features, and output a cell class probability between 0 and 1 (see methods for details). Compared with SVMs, CNNs are better-equipped to learn higher-order interactions between sequence features due to CNNs’ capacity for flexible feature representation and automated feature selection (Cun et al., 1989).”

“To assess model performance, we used standard classifier metrics, the area under the receiver operator curve (auROC) and the area under the precision-recall curve (auPRC). These scores quantify model performance by comparing the predicted class to the actual class, where a randomly guessing binary classifier would have an auROC score of 0.5 and an auPRC score equal to the fraction of actual positives in the data, and a perfect classifier would have a maximum auROC score of 1.0 and auPRC score of 1.0.”

Lines 221 to 225: Rather than picking out 3 values that correlate with the hypothesis, please report the full data.

In other words, please include a full list/table of enhancers E1-E34 with scores from both models, as well as values of specificity. Then ideally include two graphs with on the X-axis specificity and on the Y axis the model score, for a selection of the contrasts. I requests this because in many cases in figure 1 supplementary 4 and 5 the correlation seems to be driven primarily by E4, E22 and E29.

Could you also provide the correlation between predicted activity and specificity without E4, E22 and E29? This should hold up even without these most obvious examples.

We have included the full data in a new supplement Figure 1-Source Data 3. Based on the feedback, we have transitioned away from correlation estimates and toward binary analysis. This is because our models are binary classifiers, which predict a class label (+ or -), and the score should be thought of as a measure of confidence of this label, with scores near 0 being less confident, and scores further from 0 more confident. Therefore, we should not necessarily expect strong correlation between model scores and experimental specificity for enhancers with intermediate performance. A regression model would be more appropriate if trying to recapitulate this trend. Instead, as you note, our classifiers perform great on the best and worst examples. This is useful for narrowing a pool of candidate enhancers down to the best candidates. Essentially, the practical goal is to minimize false positives, where a high LFC might lead you to use an enhancer, but it actually has poor specificity in vivo due to sequence composition. We have revised this analysis to group E1 – E34 into either high experimental specificity (>70%) or low experimental specificity (<70%), and plotted the LFC or model scores in Figure 1—figure supplement 5. The results are described in the text:

“On the complete set of enhancers, SVM score was predictive of the PV+ specificity group (p = 0.008), and differential activity log2 fold change had a weak association with PV+ specificity group (p = 0.069) (Figure 1—figure supplement 5c). Much of the log2 fold difference association with group PV+ specificity was driven by enhancer sequences with low log2 fold difference and low PV+ specificity. Undesirably, there were some sequences with high log2 fold difference and low in vivo specificity. Within the subset of enhancers with high log2 fold difference (log2 fold difference > 1), the log2 fold difference was not associated with specificity group (p = 0.601), while SVM score was weakly associated (p = 0.057) (Figure 1—figure supplement 5d). Unlike log2 fold difference scores alone, SVM scores may limit false positive candidates and improve efficient enhancer AAV selection.”

Figure 1 supplement 5a: according to the predicted activity E10, E7 and E9 are particularly specific for PV+ cells compared to VIP+ cells. Similarly E14 should be particularly specific for PV+ cells compared to SST+ cells. This predicted specify is not apparent from the accessibility (supplement 5b) and the general, empirically determined specificity is not particularly high. But, based on the models, these particular enhancers should display specificity for PV+ cells over VIP+ cells. This hypothesis is easily tested by injecting viral vectors with these particular enhancers and counting the transgene expression in PV+ cells compared to VIP+/SST+ cells.

This hypothesis is particularly interesting because the tested enhancers are fully in line with the accessibility predictor of specificity, whereas E10, E7 and E9 predict something different than accessibility.

Thank you for the suggestion. It is a great experiment idea, but not within the scope of this project at this time.

Figure 1 supplement 5b: It appears the correlation between specificity and accessibility is stronger than the correlation between specificity and predicted activity. Please provide the Pearson correlation of specificity and accessibility also and comment on the difference between the two correlations.

We appreciate this suggestion. Based on the new models we constructed, we believe this comment no longer applies. See above note on binarized analysis.

Line 241 to 250: it appears E14, and in lesser extend E11 and E2, have a similarly high specificity. Do these enhancers contain motifs too? For that matter, all other enhancers?

Other enhancers do contain PV+ relevant motifs to some extent as well. We have included a new supplementary table, Figure 1-Source Data 4, with this information.

Line 258: what is the motivation for picking SC2 over all other candidates in the 90th percentile?

We used numerous criteria for identifying enhancers that are outlined in the methods (beginning line 746, section “SVM data preparation”). After scoring with all models (all scores available in Figure 2-Source Data 1), we arbitrarily chose one at the 90th percentile to get a better sense of whether only the very top enhancers are likely to drive cell type-specificity. We amended the main text to explain this:

“Among true PV+ neuron-specific enhancer sequences that i) were differential OCRs in PV+ vs. PV-, PV+ vs. EXC, and PV+ vs. VIP+ sorted population data and ii) scored PV+ positive across all SVM evaluations (1,755 sequences), SC1 was the highest predicted sequence candidate, while SC2 was in the 90th percentile (Figure 2b, Figure 2-Source Data 1). We chose these sequences to represent two different confidence levels and evaluate the general potential of the top 10% of machine learning-prioritized PV+ neuron-specific enhancers.”

Figure 2 A-B: It appears the snATAC signal for SC1 and SC2 alone would be very strong in pointing these enhancers out as PV specific enhancers. Based on only Accessibility, how high would they rank? In other words, could you make a list with all enhancers sorted based on log2 fold accessibility difference, and provide the percentile ranks of these enhancers based on this compared to the model predicted ranks?

We have provided the accessibility log2 fold difference from the PV+ vs PV- contrast from our sorted population cSNAIL data in Figure 2—figure supplement 1. Based only on accessibility log2 fold difference, SC1 ranks in the 85th percentile and SC2 ranks in the 99th percentile. We argue that log2 fold difference is informative, but not sufficient for probe selection as it calls many false positives. Our models provide an additional filter for efficient narrowing down of candidates and enable interpretation of candidate probe sequences.

Figure 2C-E and lines 266 to 276: which region was investigated? Were the same cortical regions investigated to establish the percentages? Some cortical regions are naturally more abundant with PV+ cells. Please include more details on this analysis.

If this is done right: very strong, excellent to see this working! Absolutely convincing SC1 and SC2 drive PV specific expression.

Yes, these were all consistently imaged in the primary motor cortex (M1), with representation from all layers. I have added a note to the text of this section.

“We quantified images with consistent positioning in the primary motor cortex, with representation from all layers.”

Line 297 to 327: Strong, convincing evidence that SC1 and SC2 are indeed PV specific. This is not only interesting for those researching PV cells, but also an indication that the selection of these enhancers based on the in silico models was successful.

Line 360: it would be interesting to see a GPe or cortical PV+ neuron specific enhancer in a subsequent publication!

Line 444: most interesting, a good lead into functional understanding of enhancers-TF interaction in particular cell types in the brain.

Line 449: This sentence seems a bit of an overstatement. As I understand, first OCRs are selected on differential activity. Meaning a requirement of scATAC data with defined and annontated clusters. For this statement to be true, the models need to be run on all peaks, rather than pre-selected regions with differential activity. Perhaps I misunderstood and the models were actually run on the merged lists of peaks, in that case disregard this comment.

The reviewer’s interpretation is correct. We have amended to qualify the statement.

“Here, we showed that sequence was sufficient to discern the directionality of highly differential OCR activity in different neuron subtypes in most cases.”

Line 459: I'm not sure which data this is based on. I don't think a direct comparison was made between the average score in Figure 1.sup5a and b. I would like to see this explicitly state in the text, along with correlation plots for specificity vs. predicted activity and specificity vs. accessibility, with pearson correlations.

We have now included explicit statements of the binary analysis in the text, which supports this claim.

“On the complete set of enhancers, SVM score was predictive of the PV+ specificity group (p = 0.008), and differential activity log2 fold difference had a weak association with PV+ specificity group (p = 0.069) (Figure 1—figure supplement 5c). Much of the log2 fold difference association with group PV+ specificity was driven by enhancer sequences with low log2 fold difference and low PV+ specificity. Undesirably, there were some sequences with high log2 fold difference and low in vivo specificity. Within the subset of enhancers with high log2 fold difference (log2 fold difference > 1), the log2 fold difference was not associated with specificity group (p = 0.601), while SVM score was weakly associated (p = 0.057) (Figure 1—figure supplement 5d). Unlike log2 fold difference scores alone, SVM scores may limit false positive candidates and improve efficient enhancer AAV selection.”

Line 491: Could you speculate on the possibility to find or generate regulatory elements specific for cell types in these regions. So, instead of generalization, specification.

We have added a sentence to state the possibility of using the SNAIL framework to tag specialized populations such as this.

“It will be interesting to conduct further comparisons of the composition of cell populations that are labeled by these tools in different brain regions and species. We speculate that the SNAIL framework could be strategically employed to design tools for specific subtypes of PV+ neurons or other cell populations that may be brain region or species specific.”

Lines 493 to 501: Another limitation, is that when supplemented in mature neurons, the viral vector will not undergo the same developmental, epigenomic modifications. This may result in different levels of expression. At least, in our hands we found discrepancies in expression between transgenically provided genes and virally provided ones. There does not seem to be much literature on this topic though, so a discussion may be beyond the scope of this paper.

We have added a sentence on this potential limitation.

“Finally, an enhancer’s activity in AAV may differ from the same enhancer’s activity in the host genome due to the surrounding genomic and epigenomic context. Enhancer activity may also fluctuate in response to different conditions because enhancers are dynamic actors in the regulation of gene expression.”

References

Buenrostro, Jason D., Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf. 2013. “Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-Binding Proteins and Nucleosome Position.” Nature Methods 10 (12): 1213–18.

Buenrostro, Jason D., Beijing Wu, Ulrike M. Litzenburger, Dave Ruff, Michael L. Gonzales, Michael P. Snyder, Howard Y. Chang, and William J. Greenleaf. 2015. “Single-Cell Chromatin Accessibility Reveals Principles of Regulatory Variation.” Nature 523 (7561): 486–90.

Chen, Ling, Alexandra E. Fish, and John A. Capra. 2018. “Prediction of Gene Regulatory Enhancers across Species Reveals Evolutionarily Conserved Sequence Properties.” PLoS Computational Biology 14 (10): e1006484.

Corces, M. Ryan, Anna Shcherbina, Soumya Kundu, Michael J. Gloudemans, Laure Frésard, Jeffrey M. Granja, Bryan H. Louie, et al. 2020. “Single-Cell Epigenomic Analyses Implicate Candidate Causal Variants at Inherited Risk Loci for Alzheimer’s and Parkinson's Diseases.” Nature Genetics 52 (11): 1158–68.

Cun, Y. L., L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard. 1989. “Handwritten Digit Recognition: Applications of Neural Network Chips and Automatic Learning.” IEEE Communications Magazine. https://doi.org/10.1109/35.41400.

Cusanovich, Darren A., Riza Daza, Andrew Adey, Hannah A. Pliner, Lena Christiansen, Kevin L. Gunderson, Frank J. Steemers, Cole Trapnell, and Jay Shendure. 2015. “Multiplex Single Cell Profiling of Chromatin Accessibility by Combinatorial Cellular Indexing.” Science 348 (6237): 910–14.

Deal, Roger B., and Steven Henikoff. 2010. “A Simple Method for Gene Expression and Chromatin Profiling of Individual Cell Types within a Tissue.” Developmental Cell 18 (6): 1030–40.

Ernst, Jason, and Manolis Kellis. 2017. “Chromatin-State Discovery and Genome Annotation with ChromHMM.” Nature Protocols 12 (12): 2478–92.

Ghandi, Mahmoud, Dongwon Lee, Morteza Mohammad-Noori, and Michael A. Beer. 2014. “Enhanced Regulatory Sequence Prediction Using Gapped K-Mer Features.” PLoS Computational Biology 10 (7): e1003711.

Graybuck, Lucas T., Tanya L. Daigle, Adriana E. Sedeño-Cortés, Miranda Walker, Brian Kalmbach, Garreck H. Lenz, Elyse Morin, et al. 2021. “Enhancer Viruses for Combinatorial Cell-Subclass-Specific Labeling.” Neuron, March. https://doi.org/10.1016/j.neuron.2021.03.011.

Hernández, Vivian M., Daniel J. Hegeman, Qiaoling Cui, Daniel A. Kelver, Michael P. Fiske, Kelly E. Glajch, Jason E. Pitt, Tina Y. Huang, Nicholas J. Justice, and C. Savio Chan. 2015. “Parvalbumin+ Neurons and Npas1+ Neurons Are Distinct Neuron Classes in the Mouse External Globus Pallidus.” The Journal of Neuroscience: The Official Journal of the Society for Neuroscience 35 (34): 11830–47.

Jindal, Granton A., and Emma K. Farley. 2021. “Enhancer Grammar in Development, Evolution, and Disease: Dependencies and Interplay.” Developmental Cell 56 (5): 575–87.

Jinno, Shozo, and Toshio Kosaka. 2004. “Parvalbumin Is Expressed in Glutamatergic and GABAergic Corticostriatal Pathway in Mice.” The Journal of Comparative Neurology 477 (2): 188–201.

Kaplow, Irene M., Daniel E. Schäffer, Morgan E. Wirthlin, Alyssa J. Lawler, Ashley R. Brown, Michael Kleyman, and Andreas R. Pfenning. 2021. “Predicting Lineage-Specific Differences in Open Chromatin across Dozens of Mammalian Genomes.” bioRxiv. https://doi.org/10.1101/2020.12.04.410795.

Kelley, David R. 2020. “Cross-Species Regulatory Sequence Activity Prediction.” PLoS Computational Biology 16 (7): e1008050.

Lawler, Alyssa J., Ashley R. Brown, Rachel S. Bouchard, Noelle Toong, Yeonju Kim, Nitinram Velraj, Grant Fox, et al. 2020. “Cell Type-Specific Oxidative Stress Genomic Signatures in the Globus Pallidus of Dopamine-Depleted Mice.” The Journal of Neuroscience 40 (50): 9772–83.

Lee, Dongwon. 2016. “LS-GKM: A New Gkm-SVM for Large-Scale Datasets.” Bioinformatics 32 (14): 2196–98.

Liodis, Petros, Myrto Denaxa, Marirena Grigoriou, Cynthia Akufo-Addo, Yuchio Yanagawa, and Vassilis Pachnis. 2007. “Lhx6 Activity Is Required for the Normal Migration and Specification of Cortical Interneuron Subtypes.” The Journal of Neuroscience: The Official Journal of the Society for Neuroscience 27 (12): 3078–89.

Li, Yang Eric, Sebastian Preissl, Xiaomeng Hou, Ziyang Zhang, Kai Zhang, Yunjiang Qiu, Olivier B. Poirion, et al. 2021. “An Atlas of Gene Regulatory Elements in Adult Mouse Cerebrum.” Nature 598 (7879): 129–36.

Mayer, Christian, Christoph Hafemeister, Rachel C. Bandler, Robert Machold, Renata Batista Brito, Xavier Jaglin, Kathryn Allaway, Andrew Butler, Gord Fishell, and Rahul Satija. 2018. “Developmental Diversification of Cortical Inhibitory Interneurons.” Nature 555 (7697): 457–62.

McLean, Cory Y., Dave Bristor, Michael Hiller, Shoa L. Clarke, Bruce T. Schaar, Craig B. Lowe, Aaron M. Wenger, and Gill Bejerano. 2010. “GREAT Improves Functional Interpretation of Cis-Regulatory Regions.” Nature Biotechnology 28 (5): 495–501.

Mich, John K., Lucas T. Graybuck, Erik E. Hess, Joseph T. Mahoney, Yoshiko Kojima, Yi Ding, Saroja Somasundaram, et al. 2021. “Functional Enhancer Elements Drive Subclass-Selective Expression from Mouse to Primate Neocortex.” Cell Reports 34 (13): 108754.

Mo, Alisa, Eran A. Mukamel, Fred P. Davis, Chongyuan Luo, Gilbert L. Henry, Serge Picard, Mark A. Urich, et al. 2015. “Epigenomic Signatures of Neuronal Diversity in the Mammalian Brain.” Neuron 86 (6): 1369–84.

Roadmap Epigenomics Consortium. (2015) 2015. “Integrative Analysis of 111 Reference Human Epigenomes.” Nature 518 (7539): 317–30.

Roccaro-Waldmeyer, Diana M., Franck Girard, Daniele Milani, Elisabetta Vannoni, Laurent Prétôt, David P. Wolfer, and Marco R. Celio. 2018. “Eliminating the VGlut2-Dependent Glutamatergic Transmission of Parvalbumin-Expressing Neurons Leads to Deficits in Locomotion and Vocalization, Decreased Pain Sensitivity, and Increased Dominance.” Frontiers in Behavioral Neuroscience 12: 146.

Saunders, Arpiar, Kee Wui Huang, and Bernardo Luis Sabatini. 2016. “Globus Pallidus Externus Neurons Expressing Parvalbumin Interconnect the Subthalamic Nucleus and Striatal Interneurons.” PloS One 11 (2): e0149798.

Tanahira, Chiyoko, Shigeyoshi Higo, Keisuke Watanabe, Ryohei Tomioka, Satoe Ebihara, Takeshi Kaneko, and Nobuaki Tamamaki. 2009. “Parvalbumin Neurons in the Forebrain as Revealed by Parvalbumin-Cre Transgenic Mice.” Neuroscience Research 63 (3): 213–23.

Vormstein-Schneider, Douglas, Jessica D. Lin, Kenneth A. Pelkey, Ramesh Chittajallu, Baolin Guo, Mario A. Arias-Garcia, Kathryn Allaway, et al. 2020. “Viral Manipulation of Functionally Distinct Interneurons in Mice, Non-Human Primates and Humans.” Nature Neuroscience, August. https://doi.org/10.1038/s41593-020-0692-9.

Zeng, Haoyang, Matthew D. Edwards, Ge Liu, and David K. Gifford. 2016. “Convolutional Neural Network Architectures for Predicting DNA–protein Binding.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btw255.

Zhao, Yangu, Pierre Flandin, Jason E. Long, Melissa Dela Cuesta, Heiner Westphal, and John L. R. Rubenstein. 2008. “Distinct Molecular Pathways for Development of Telencephalic Interneuron Subtypes Revealed through Analysis of Lhx6 Mutants.” The Journal of Comparative Neurology 510 (1): 79–99.

Zhou, Jian, and Olga G. Troyanskaya. 2015. “Predicting Effects of Noncoding Variants with Deep Learning-Based Sequence Model.” Nature Methods 12 (10): 931–34.

https://doi.org/10.7554/eLife.69571.sa2

Article and author information

Author details

  1. Alyssa J Lawler

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Biological Sciences Department, Mellon College of Science, Carnegie Mellon University, Pittsburgh, United States
    3. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Present address
    Broad Institute of Harvard and MIT, Stanley Center for Psychiatric Research, Cambridge, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Visualization, Writing – original draft, Writing – review and editing
    For correspondence
    alawler@broadinstitute.org
    Competing interests
    Inventor on US Patent Application 62/921,452, "Specific nuclear-anchored independent labeling system"
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2151-5164
  2. Easwaran Ramamurthy

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Formal analysis, Methodology, Visualization, Writing – review and editing
    Competing interests
    Inventor on US Patent Application 62/921,452, "Specific nuclear-anchored independent labeling system"
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2439-0600
  3. Ashley R Brown

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Conceptualization, Investigation, Project administration, Resources, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3091-3930
  4. Naomi Shin

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  5. Yeonju Kim

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  6. Noelle Toong

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Formal analysis, Investigation
    Competing interests
    No competing interests declared
  7. Irene M Kaplow

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Formal analysis
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8924-8269
  8. Morgan Wirthlin

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Formal analysis
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7967-7070
  9. Xiaoyu Zhang

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Formal analysis
    Competing interests
    No competing interests declared
  10. BaDoi N Phan

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    3. Medical Scientist Training Program, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Data curation, Formal analysis, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6331-5980
  11. Grant A Fox

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  12. Kirsten Wade

    Department of Psychiatry, Translational Neuroscience Program, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Data curation, Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
  13. Jing He

    1. Department of Neurobiology, University of Pittsburgh, Pittsburgh, United States
    2. Systems Neuroscience Center, Brain Institute, Center for Neuroscience, Center for the Neural Basis of Cognition, Pittsburgh, United States
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9034-8390
  14. Bilge Esin Ozturk

    Department of Ophthalmology, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5117-077X
  15. Leah C Byrne

    1. Department of Neurobiology, University of Pittsburgh, Pittsburgh, United States
    2. Department of Ophthalmology, University of Pittsburgh, Pittsburgh, United States
    3. Division of Experimental Retinal Therapies, Department of Clinical Sciences & Advanced Medicine, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, United States
    4. Department of Bioengineering, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Supervision
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3229-4993
  16. William R Stauffer

    Department of Neurobiology, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Supervision
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1031-8824
  17. Kenneth N Fish

    Department of Psychiatry, Translational Neuroscience Program, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Supervision
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1774-3815
  18. Andreas R Pfenning

    1. Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, United States
    2. Neuroscience Institute, Carnegie Mellon University, Pittsburgh, United States
    Contribution
    Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review and editing
    For correspondence
    apfenning@cmu.edu
    Competing interests
    Inventor on US Patent Application 62/921,452, "Specific nuclear-anchored independent labeling system"
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3447-9801

Funding

National Institutes of Health (UG3-MH-120094)

  • William R Stauffer

National Institutes of Health (1DP1DA046585)

  • Andreas R Pfenning

National Science Foundation (DGE1745016)

  • Alyssa J Lawler

National Institute on Drug Abuse (1F30DA053020)

  • BaDoi N Phan

National Institute of Mental Health (MH096985)

  • Kenneth N Fish

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

This study was performed in strict accordance with the PHS Policy on the Humane Care and Use of Laboratory Animals and the Animal Welfare Act. All animal use and procedures were approved and overseen by the Institutional Animal Care & Use Committee (IACUC) of Carnegie Mellon University (Protocol ID PROTO201600003) or the University of Pittsburgh (Protocol ID 19024431).

Senior Editor

  1. Naama Barkai, Weizmann Institute of Science, Israel

Reviewing Editor

  1. Jeremy J Day, University of Alabama at Birmingham, United States

Reviewers

  1. Jeremy J Day, University of Alabama at Birmingham, United States
  2. Cliff Kentros, Norwegian University of Science and Technology, Norway

Publication history

  1. Preprint posted: April 15, 2021 (view preprint)
  2. Received: April 20, 2021
  3. Accepted: April 25, 2022
  4. Version of Record published: May 16, 2022 (version 1)

Copyright

© 2022, Lawler et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,117
    Page views
  • 160
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Alyssa J Lawler
  2. Easwaran Ramamurthy
  3. Ashley R Brown
  4. Naomi Shin
  5. Yeonju Kim
  6. Noelle Toong
  7. Irene M Kaplow
  8. Morgan Wirthlin
  9. Xiaoyu Zhang
  10. BaDoi N Phan
  11. Grant A Fox
  12. Kirsten Wade
  13. Jing He
  14. Bilge Esin Ozturk
  15. Leah C Byrne
  16. William R Stauffer
  17. Kenneth N Fish
  18. Andreas R Pfenning
(2022)
Machine learning sequence prioritization for cell type-specific enhancer design
eLife 11:e69571.
https://doi.org/10.7554/eLife.69571

Further reading

    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Joseph V Geisberg, Zarmik Moqtaderi ... Kevin Struhl
    Research Advance

    Alternative polyadenylation yields many mRNA isoforms whose 3' termini occur disproportionately in clusters within 3' UTRs. Previously, we showed that profiles of poly(A) site usage are regulated by the rate of transcriptional elongation by RNA polymerase (Pol) II (Geisberg et., 2020). Pol II derivatives with slow elongation rates confer an upstream-shifted poly(A) profile, whereas fast Pol II strains confer a downstream-shifted poly(A) profile. Within yeast isoform clusters, these shifts occur steadily from one isoform to the next across nucleotide distances. In contrast, the shift between clusters from the last isoform of one cluster to the first isoform of the next - is much less pronounced, even over large distances. GC content in a region 13-30 nt downstream from isoform clusters correlates with their sensitivity to Pol II elongation rate. In human cells, the upstream shift caused by a slow Pol II mutant also occurs continuously at the nucleotide level within clusters, but not between them. Pol II occupancy increases just downstream of the most speed-sensitive poly(A) sites, suggesting a linkage between reduced elongation rate and cluster formation. These observations suggest that 1) Pol II elongation speed affects the nucleotide-level dwell time allowing polyadenylation to occur, 2) poly(A) site clusters are linked to the local elongation rate and hence do not arise simply by intrinsically imprecise cleavage and polyadenylation of the RNA substrate, 3) DNA sequence elements can affect Pol II elongation and poly(A) profiles, and 4) the cleavage/polyadenylation and Pol II elongation complexes are spatially, and perhaps physically, coupled so that polyadenylation occurs rapidly upon emergence of the nascent RNA from the Pol II elongation complex.

    1. Genetics and Genomics
    Amel Lamri, Jayneel Limbachia ... Sonia S Anand
    Research Article Updated

    South Asian women are at increased risk of developing gestational diabetes mellitus (GDM). Few studies have investigated the genetic contributions to GDM risk. We investigated the association of a type 2 diabetes (T2D) polygenic risk score (PRS), on its own, and with GDM risk factors, on GDM-related traits using data from two birth cohorts in which South Asian women were enrolled during pregnancy. 837 and 4372 pregnant South Asian women from the SouTh Asian BiRth CohorT (START) and Born in Bradford (BiB) cohort studies underwent a 75-g glucose tolerance test. PRSs were derived using genome-wide association study results from an independent multi-ethnic study (~18% South Asians). Associations with fasting plasma glucose (FPG); 2 hr post-load glucose (2hG); area under the curve glucose; and GDM were tested using linear and logistic regressions. The population attributable fraction (PAF) of the PRS was calculated. Every 1 SD increase in the PRS was associated with a 0.085 mmol/L increase in FPG ([95% confidence interval, CI=0.07–0.10], p=2.85×10−20); 0.21 mmol/L increase in 2hG ([95% CI=0.16–0.26], p=5.49×10−16); and a 45% increase in the risk of GDM ([95% CI=32–60%], p=2.27×10−14), independent of parental history of diabetes and other GDM risk factors. PRS tertile 3 accounted for 12.5% of the population’s GDM alone, and 21.7% when combined with family history. A few weak PRS and GDM risk factors interactions modulating FPG and GDM were observed. Taken together, these results show that a T2D PRS and family history of diabetes are strongly and independently associated with multiple GDM-related traits in women of South Asian descent, an effect that could be modulated by other environmental factors.