Parallel functional testing identifies enhancers active in early postnatal mouse brain

  1. Jason T Lambert
  2. Linda Su-Feher
  3. Karol Cichewicz
  4. Tracy L Warren
  5. Iva Zdilar
  6. Yurong Wang
  7. Kenneth J Lim
  8. Jessica L Haigh
  9. Sarah J Morse
  10. Cesar P Canales
  11. Tyler W Stradleigh
  12. Erika Castillo Palacios
  13. Viktoria Haghani
  14. Spencer D Moss
  15. Hannah Parolini
  16. Diana Quintero
  17. Diwash Shrestha
  18. Daniel Vogt
  19. Leah C Byrne
  20. Alex S Nord  Is a corresponding author
  1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, United States
  2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, United States
  3. Department of Pediatrics and Human Development, Grand Rapids Research Center, Michigan State University, United States
  4. Helen Wills Neuroscience Institute, University of California, Berkeley, United States
  5. Departments of Ophthalmology and Neurobiology, University of Pittsburgh, United States

Abstract

Enhancers are cis-regulatory elements that play critical regulatory roles in modulating developmental transcription programs and driving cell-type-specific and context-dependent gene expression in the brain. The development of massively parallel reporter assays (MPRAs) has enabled high-throughput functional screening of candidate DNA sequences for enhancer activity. Tissue-specific screening of in vivo enhancer function at scale has the potential to greatly expand our understanding of the role of non-coding sequences in development, evolution, and disease. Here, we adapted a self-transcribing regulatory element MPRA strategy for delivery to early postnatal mouse brain via recombinant adeno-associated virus (rAAV). We identified and validated putative enhancers capable of driving reporter gene expression in mouse forebrain, including regulatory elements within an intronic CACNA1C linkage disequilibrium block associated with risk in neuropsychiatric disorder genetic studies. Paired screening and single enhancer in vivo functional testing, as we show here, represents a powerful approach towards characterizing regulatory activity of enhancers and understanding how enhancer sequences organize gene expression in the brain.

Introduction

Cis-regulatory elements such as enhancers are critical drivers of spatiotemporal gene expression within the developing and mature brain (Nord, 2013). Enhancers integrate the combinatorial functions of transcription factors (Fuxman Bass et al., 2015) and chromatin organizers (Calo and Wysocka, 2013) to drive cell and regional-specific expression of genes across brain development. DNA sequence variation within enhancers has been associated with both evolution (McLean et al., 2011) and the genetic etiology of neurological disorders (Dunham, 2012; Perenthaler et al., 2019; D’haene and Vergult, 2021). The first identified enhancers were defined by functional capacity to amplify transcriptional activity in reporter plasmids (Moreau et al., 1981; Banerji et al., 1981). Putative enhancers have been predicted via DNA conservation using comparative genomics and, more recently, by epigenetic signatures such as open chromatin from DNaseI hypersensitive site sequencing (DNase-seq) or assaying for transposase-accessible chromatin using sequencing (ATAC-seq) (Dunham, 2012), histone tail modifications from chromatin immunoprecipitation sequencing (ChIP-seq) (Nord, 2013; Visel, 2013), and 3D chromatin organization (Schoenfelder and Fraser, 2019); however, these approaches are proxy measurements that do not directly evaluate whether a DNA sequence acts as a functional enhancer (Benton et al., 2019).

Enhancer reporter assays assess the ability of candidate DNA sequences to drive expression of a reporter gene, and have been used as the primary means of functionally testing activity of predicted enhancers (Kvon, 2015). Transgenic mouse enhancer assays have been applied to characterize the regulatory activity of candidate DNA sequences in the mouse, in both the developing and mature brain (Nord, 2013; Visel, 2013; Silberberg, 2016), and recently viral vectors have been used to deliver enhancer reporter constructs to mouse brain as well (Graybuck et al., 2021; Mich, 2021). These assays offer exceptional information on tissue-specific enhancer activity but require one-by-one testing of individual candidates. The advancement of massively parallel reporter assays (MPRAs) has led to the ability to functionally screen a multitude of DNA sequences for enhancer activity in parallel within a single experiment using sequencing-based quantification (Inoue and Ahituv, 2015). To date, there are a few promising demonstrations of MPRA applied in vivo in brain (Shen et al., 2016; Shen, 2019), although screening approaches to characterize enhancers remain highly novel, particularly in developing brain. Functional assessment of enhancer activity and the use of enhancers to express reporters and genes in mouse brain cells has emerged as an area of major interest (Dimidschstein et al., 2016; Hrvatin, 2019; Nord and West, 2020; Vogt, 2014). In vivo screening approaches represent an ideal method to perform broad testing at scale and maintain tissue-type and developmental context.

Here, we adapted an MPRA functional enhancer screen, self-transcribing active regulatory region sequencing (STARR-seq) (Arnold et al., 2013), for application to early postnatal mouse brain via recombinant adeno-associated virus (rAAV) delivery. We screened a library of candidate human DNA sequences spanning putative brain-specific regulatory sequences and regions associated with genome-wide association studies of epilepsy and schizophrenia. We identified sequences capable of enhancing reporter expression in early postnatal mouse cortex, and validated positive and negative MPRA expression predictions, including within the neuropsychiatric disorder-associated third intron of CACNA1C (Moon et al., 2018; Ferreira et al., 2008; Sklar, 2008). Our results demonstrate the utility of parallel functional testing to interrogate the regulatory activity of enhancers in early postnatal mouse brain and highlight opportunities for functional screening in studies of normal brain development and function as well as the etiology of neurodevelopmental and neuropsychiatric disorders.

Results

Developing and validating AAV-MPRA strategy

We used a modified STARR-seq (Arnold et al., 2013) MPRA orientation, in which the candidates of interest were cloned in the 3’ untranslated region (3’-UTR) of a reporter gene (here EGFP) driven by the Hsp1a minimal promoter (HspMinP; formerly referred to as Hsp68 minP Bevilacqua et al., 1995). This reporter construct was flanked by inverted terminal repeat regions (ITRs) necessary for packaging self-complementary adeno-associated virus (scAAV), which has the advantage of faster transduction rates compared to conventional AAV (McCarty, 2008; Figure 1A, Figure 1—figure supplement 1). Enhancers are proposed to function independent of orientation and the position relative to the transcriptional start site (Serfling et al., 1985), as has been demonstrated in in vitro applications of STARR-seq (Arnold et al., 2013; Klein et al., 2020). However, we first sought to verify that an enhancer sequence in the STARR-seq orientation not only increased transcription generally, but also did so in a cell-type defined restricted manner in vivo when paired with HspMinP and delivered via scAAV. To validate this, we cloned a GABAergic-biased mouse Dlx5/6 enhancer (Dlx) (Dimidschstein et al., 2016; Lee et al., 2014) into the scAAV reporter vector. scAAV9 carrying HspMinP-EGFP-Dlx was mixed with AAV9 carrying the constitutively active CAG-mRuby3 reporter and this mixture was delivered to postnatal day (P)0 mouse forebrain via brain intraventricular injection and collected at P7. The expression of mRuby3 was used as a positive control and to locate the injection site for analysis. Red fluorescent cells, marking the focus of viral exposure, were found distributed in the cortex and hippocampus (Figure 1B, Figure 1—figure supplement 2A, B). We analyzed EGFP-expressing cells in these regions. The brains were fixed, cryosectioned, and stained with an antibody for GFP to enhance detection of Dlx activity (see Materials and methods).

Figure 1 with 4 supplements see all
Designing and validating 3’-UTR enhancer reporter AAV assay.

(A) Schematic of in vivo parallelized functional enhancer reporter assay. The test library was generated using the previral vector pscAAV-HspMinP-EGFP, which contained a multiple cloning site (light grey) between the EGFP reporter and polyadenylation site (PAS). Purified PCR products for test amplicons were cloned into the vector using Gibson assembly. The previral library was packaged into AAV9(2YF), and the viral library delivered to the brain via injection at P0. Brains were collected at P7. (B) Representative image of a coronal section of a P7 mouse brain injected at P0 with a virus mixture consisting of an AAV containing the STARR-seq vector carrying the inhibitory interneuron enhancer Dlx (scAAV9-HspMinP-EGFP-Dlx) and an injection control AAV containing an expression vector for mRuby3 under the control of CAG, a general mammalian promoter. EGFP expression was visualized via IHC using an anti-GFP antibody, while mRuby3 expression was visualized using native fluorescence. Insets show close up of boxed regions showing morphology of EGFP-expressing cells in the cortex. (C) Sections from P7 mouse cortex transduced with Dlx-driven STARR-seq reporter vector and mRuby3 injection control at P0, counterstained with an antibody for Lhx6, a transcription factor active in deep cortical layer interneurons. EGFP-expressing cells with Lhx6+ Nuclei are indicated with arrows. Inset graph shows fraction of EGFP- or mRuby3-expressing cells co-labeled with Lhx6 in three replicate animals injected with scAAV9-HspMinP-EGFP-Dlx (Animal 1, n = 20 EGFP+ cells, 218 mRuby3+ cells; Animal 2, n = 18 EGFP+ cells, 435 mRuby3+ cells; Animal 3, n = 32 EGFP+ cells, 311 mRuby3+ cells) or one animal injected with AAV9-Dlx-βGlobinMinP-EGFP (Dimidschstein et al., 2016; Lee et al., 2014) (n = 31 EGFP+ cells, 63 mRuby3+ cells). (D) Ratiometric (log2 RNA/DNA) activity of miniMPRA mouse library in P7 mouse cortex after injection at P0. Boxplot of distribution and individual replicates (N = 4) shown for the 16 tested candidates. NEG indicates putative negative candidate; otherwise, name indicates nearby gene and if applicable, embryonic enhancer ID, for positive candidates.

We evaluated cell-type specificity two ways: by assessing morphology and by counterstaining with an antibody for Lhx6, a cardinal transcription factor expressed in both developing and mature neurons derived from the medial ganglionic eminence (Sandberg, 2016; Alifragis et al., 2004). EGFP+ cells in these P7 brains had small cell somas and neurites, typical of interneurons compared to pyramidal cells and glia (Figure 1B, Figure 1—figure supplement 2A, B); many were Lhx6+, but this was quite variable across animals (Figure 1C, Figure 1—figure supplement 3). We noted that the EGFP+ cells that were negative for Lhx6 staining still primarily appeared to be interneurons based on morphology. Since Dlx should be active in multiple GABAergic interneuron types in the neocortex (Dimidschstein et al., 2016), we systematically analyzed the morphology of EGFP+ cells (Figure 1—figure supplement 2). The principal excitatory neurons in the cortex and hippocampus have a typical pyramidal shape with a tear-drop cell body and a long, primary apical dendrite, whereas inhibitory interneurons are more variable in their appearance. Furthermore, excitatory neurons in the hippocampus are restricted to specific pyramidal and granule layers, and spatial location could be used as a second classification criterion there to lend extra confidence to the morphological assessment. Based on these morphological and spatial criteria, the vast majority of EGFP-expressing cells resemble interneurons (92.7 %, n = 314 cells in five animals, Figure 1—figure supplement 2A, C). 3.2 % of EGFP+ cells exhibited pyramidal morphology. We compared these results with expression driven from the same Dlx sequence in a conventional orientation upstream of a minimal β-Globin promotor, a construct that has previously been validated to exhibit interneuron-biased expression (Dimidschstein et al., 2016). The specificity for interneurons in our hands was similar in both constructs (Figure 1—figure supplement 2B, C). These data, taken together, indicate that enhancer orientation and choice of minimal promoter did not greatly affect cell-type-specific activity of the enhancer reporter construct.

We next performed a small-scale parallel reporter assay (miniMPRA), testing 16 mouse sequences representing putative enhancer and negative control sequences (Figure 1D, Supplementary file 1). Long sequences of interest (~900 bp) were selected to maximize the chromosomal context of candidate enhancer elements within these regions. These sequences were predicted to have either enhancer activity or no activity in postnatal mouse brain based on previous transgenic mouse enhancer assays and on mouse brain H3K27ac ChIP-seq data (Nord, 2013; Visel, 2013). We batch-cloned these sequences into the 3’-UTR of our scAAV reporter construct, and we packaged this into a scAAV library using AAV9(2YF) (Dalkara et al., 2012). AAV9(2YF) is a tyrosine-mutated derivative of AAV9, which has increased transduction in the brain and similar tropism compared to standard AAV9. We delivered the scAAV library to P0 mouse prefrontal cortex by direct injection and collected the brains for DNA and RNA analysis at P7 (Figure 1A). In this proof-of-principle experiment, putative negative control sequences displayed low transcription levels based on the ratio of RNA-seq read counts to DNA-seq read counts. On the other hand, previously validated embryonic brain enhancers (Visel, 2013) and putative enhancers of interest near genes associated with neurodevelopmental disorders or epilepsy (Turner, 2016; Nakayama et al., 2010; Gao et al., 2017) were found to be active as enhancers in this assay at P7 (Figure 1D). Based on this proof-of-principle experiment, we moved forward with scaling up our scAAV STARR-seq strategy to testing hundreds of sequences in mouse brain.

Cloning and AAV packaging of MPRA library

We generated an MPRA library targeting 408 candidate human DNA sequences for testing, each ~900 bp in size, to assess functional enhancer activity in vivo in early postnatal mouse brain. These sequences were categorized into four groups (Figure 2A), two groups without ascertainment based on potential regulatory activity, and two groups selected based on presence of epigenomic signatures commonly used to predict enhancer activity (Supplementary file 2). For the first group (‘GWAS’), we selected genomic intervals containing single-nucleotide polymorphisms (SNPs) identified as lead SNPs from genome-wide association studies (GWAS) for epilepsy (Abou-Khalil, 2018) and schizophrenia (Ripke, 2014) as identified using the NHGRI-EBI GWAS Catalog (Buniello et al., 2019). For the second group (‘LD’), we selected five lead SNPs from these same studies, and included genomic intervals that included all SNPs in linkage disequilibrium (LD, r2 >0.8, Auton et al., 2015) within these non-coding regions associated with the CACNA1C and SCN1A loci. Because SNPs can be inherited together, GWAS lead SNPs are not necessarily causal and further evidence is necessary to prioritize linked variants. Thus, most SNP-containing amplicons in these first two groups were expected to be negative for enhancer activity. Indeed, less than 20 % of GWAS and LD group sequences had strong evidence of open chromatin in fetal brain or adult neurons (Figure 2A, lower panel). For the third group (‘FBDHS’), we selected regions overlapping DNaseI hypersensitive sites in fetal human brain from ENCODE (Dunham, 2012), selecting candidates within copy number variant regions near autism risk genes (Turner, 2016; Turner et al., 2017). We expected this group to be enriched for active enhancers in our assay, although enhancers active in fetal brain may not continue to have activity at the P7 postnatal time point tested in our assay. Finally, we included human orthologs of the 16 sequences tested in our miniMPRA experiment (‘PutEnh’), which included putative positive and negative sequences. Thus our final library included a balance of sequences that are expected to have no enhancer activity (59 %), as well as sequences with evidence for regulatory capacity in the brain (41 %).

Figure 2 with 5 supplements see all
In vivo AAV MPRA yields amplicons capable of enhancing transcription, enriched for signatures associated with enhancers.

(A) Graphical representation of library composition. Top panel shows the fraction of the total library made up by each group of amplicons. Bottom panel shows the fraction of amplicons in each group that were positive for the given epigenomic signature. (B) Mean RNA and DNA representation in the assay for candidates that passed inclusion criteria (N = 308). Amplicons with significantly (p < 0.05, FDR < 0.1) increased model residual value (Res.) in RNA compared to DNA are shown in orange and red. Normal p-values; empirical FDR q-values (See Materials and methods). (C) Bar plot representing mean activity based on RNA/DNA ratio in the test assay with individual replicates shown as dots. Amplicons are sorted by linear model residuals (p < 0.05 colored red). (D) The top 20 active amplicons with consistent activity across both linear and ratiometric models. Bars represent mean activity based on RNA/DNA ratio and individual replicates are shown as dots. Three amplicons were used for downstream validation in single-candidate deliveries (magenta). (E) Amplicon intersection with fetal brain epigenomic datasets including DNase Hypersensitive loci, H3K4me1, H3K4me3, H3K36me3, H3K9me3, and H3K27me3. Amplicons were divided into two groups based on the statistical significance of their activity in the MPRA. (F) Amplicon intersection with human neuron or glia ATAC-seq, vertebrate conserved elements, and digital transcription factor footprints in fetal brain, fetal lung, and K562 cells. Asterisks in E and F indicate significant enrichment for positive amplicons with annotation class (p < 0.05, permutation test).

We cloned all sequences of interest (referred to from here on as amplicons) from pooled human DNA via polymerase chain reaction (PCR) and Gibson assembly directly into the viral DNA plasmid backbone. Following batch cloning, we sequenced the pooled plasmid library, verifying presence of 345/408 (84.6 %) target amplicons. There was no obvious pattern among amplicons that did not make it into the batch cloned library, and we presume differences in PCR primer performance and general stochasticity were the primary drivers of batch cloning success. This cloned library of candidates was then packaged in an scAAV9(2YF) viral vector in the same manner as our miniMPRA.

Application of in vivo MPRA and assessment of reproducibility of biological replicates

We delivered the viral library to the left hemisphere of four neonatal mouse brains at P0 via direct injection into the prefrontal cortex and collected forebrain tissue seven days after transduction at P7. We isolated viral genomic DNA, representing input or delivery control, as well as total RNA, and generated amplicon sequencing libraries for the left hemispheres for each replicate as well as for the previral batch-cloned plasmid library. RNA samples were treated with DNase, to prevent cross-contamination, and tagmentation libraries were prepared to capture the 3’-UTR variable region. Following sequencing, de-duplicated aligned reads to the human genome were used to generate amplicon summary counts (Supplementary file 3). We then filtered the dataset by removing amplicons with raw counts less than 200 and mean amplicon library proportions less than 2–15 in any DNA sample, leaving 308/345 (89 %) passing these quality control thresholds. We observed significant correlation between biological replicates (Pearson correlation, p < 0.001), although correlation was consistently higher for DNA replicates compared to RNA replicates (r > 0.824 for DNA comparisons, and r > 0.546 for RNA comparisons, Figure 2—figure supplement 1A, B). Correlation within DNA or RNA was higher than between library type, suggesting limited, if any, impact of cross-contamination. We also observed strong correlation between amplicon read counts in the pooled previral plasmid library (‘Library’) and genomic AAV DNA (‘DNA’) collected from each injected P7 brain (r > 0.804, p < 0.001, Figure 2—figure supplement 1A). To test the effect of PCR bias during library preparation on amplicon counts, we generated one technical RNA replicate, Sample 4–35, using a higher cycle count (Figure 2—figure supplement 1B), and found similar amplicon representation and dropout patterns as its matched lower-cycle technical replicate (Pearson r = 0.827, p < 0.001). Taken together, this demonstrates that viral packaging, neonatal delivery, P7 sample collection and processing, library generation, and sequencing did not substantially affect amplicon representation. Although correlation between amplicons generated from P7 cDNA (‘RNA’) was strong overall, particularly for amplicons with robust cDNA expression, each delivery replicate included a small subset of lower cDNA expression amplicons that showed replicate-specific cDNA dropout (Figure 2—figure supplement 1B). Amplicon representation in the previral library and P7 viral DNA was correlated with amplicon GC content (Figure 2—figure supplement 2A-B), indicating GC-based differences impact PCR and cloning efficiency. There was reduced strength in correlation between GC content and the MPRA RNA (Figure 2—figure supplement 2C-D) and no further GC bias arose between viral packaging, delivery, and recapture of the library, suggesting that GC bias originated from the batch cloning process. Input viral DNA and MPRA RNA amplicon counts were also generally correlated, indicating some basal MPRA transcription from the minimal promoter independent of candidate enhancer sequence activity, as reported in other MPRA studies (Lee, 2020; Ashuach et al., 2019).

Identification of putative enhancer activity in P7 brain based on MPRA analysis

Having confirmed reproducibility of input viral DNA and MPRA RNA collection, we performed activity estimation. We used two approaches for activity estimates: first, regression residual-based estimate to correct for background or basal activity, and second, via simple comparison of RNA to DNA amplicon representation.

First, we used the middle 80 % of amplicons (N = 248) ranked by RNA/DNA ratio (Figure 2—figure supplement 2E) to build a linear model to account for background MPRA RNA based on amplicon representation and GC content (Figure 2—figure supplement 2E-H). For regression-based estimation of activity, amplicon representation across each replicate was combined to generate a summary activity value. To identify amplicons with activity above expected based on background (i.e. presumed enhancers), we applied the model to the full set of amplicons (N = 308) to generate residual values that represent observed RNA levels compared to expected. p-values for amplicon activity versus background were empirically defined using the distribution of standardized residuals (Figure 2B, Figure 2—figure supplement 3, Supplementary file 4). We identified 41 amplicons (13 %) with significantly increased RNA (p < 0.05), suggestive of positive activity, with 17 passing with a false discovery rate (FDR) < 0.1. Amplicon group was confirmed to be a significant predictor of mean amplicon activity (One-way ANOVA, p = 0.006), with the highest activity reported for the PutEnh and FBDHS groups.

Second, we also estimated activity based on RNA/DNA ratio (Figure 2C), a common metric used to report activity in published MPRAs (Inoue and Ahituv, 2015; Klein et al., 2020; Ashuach et al., 2019). We found that 71 amplicons were considered active using the criteria RNA/DNA ratio > 1.5 and RNA/DNA standard deviation less than its mean. Amplicons that exhibited RNA/DNA ratio > 1.5 but did not pass the regression model-based p-value potentially included sequences with weaker enhancer activity. 78 % (32/41) amplicons identified using model residuals were also active in the ratiometric comparison. Among 41 positive amplicons with significant regression residual activity, 20 (49 %) had mean ratio – 1 s.d. >1.5 across individual replicates, representing the amplicons with the strongest and most consistent MPRA-defined activity (Figure 2D). For statistical testing of RNA/DNA ratios, we used a Wilcoxon rank sum, a non-parametric test comparing DNA vs RNA proportion values across delivery replicates. 43 amplicons passed p < 0.05 with confidence intervals greater than 0 (one-tailed for increased RNA), and all of them had Benjamini-Hochberg corrected FDR < 0.05. In comparison to the linear model approach, 26 amplicons were significant at p < 0.05 in the Wilcoxon rank sum test and in the linear model.

We compared the 41 model-based significant amplicons to the remainder of the library for intersection with fetal brain epigenomics datasets from the Roadmap Epigenomics Project (Bernstein et al., 2010). We found higher than expected intersection of significant amplicons with genomic loci characterized as DNase hypersensitive (DNase, p = 0.0355). We also observed that significant amplicons had increased enrichment for H3K4me1 (p = 0.0274), and H3K4me3 (p = 0.0131), epigenomic marks associated with transcriptional activity and enhancer function. We did not find enrichment for H3K36me3, a histone mark associated with gene bodies, or for H3K9me3 and H3K27me3, histone marks associated with heterochromatin (Figure 2E).

We also intersected amplicons with a human neuron and glia ATAC-seq dataset from the Brain Open Chromatin Atlas (BOCA) (Fullard et al., 2018), vertebrate evolutionary conserved elements (UCSC Genome Browser), and digital transcription factor (TF) footprints (Vierstra et al., 2020). Similar to Figure 2E, we found marginally significant increased overlap of significant amplicons with human neuron ATAC-seq peaks (p = 0.0497). We also found increased enrichment of significant amplicons in open chromatin in human glia ATAC-seq (p = 0.0159). There were no significant differences in enrichment among amplicon groups in TF footprints in fetal brain or lung, nor in K562 cell lines, nor for conserved elements (Figure 2F). We also examined each epigenetic mark as a co-variate of a general linear model, to see if it improved a regression model predicting cDNA levels considering all amplicons (N = 308). If cDNA levels in our MPRA are reflective of true enhancer activity, we would expect epigenomic signatures associated with enhancer elements to improve prediction of cDNA expression. Indeed, Fetal DNase hypersensitivity (p = 0.0002), H3K4me1 (p = 0.0002), adult neuron ATAC-seq (p = 0.0073), and fetal brain TF footprints (p = 0.04) were found to be significant, while signatures not relevant to enhancer activity in brain were not (Supplementary files 5 and 6). Fetal brain DNase hypersensitivity, H3K4me1, and human neuron ATAC-seq improved the regression model (reduced BIC), and fetal brain DNase hypersensitivity and H3K4me1 signatures had the strongest predictive power of amplicon activity (0.2 and 0.178 point-biserial correlations, respectively).

Based on the intersections with DNaseI hypersensitive and neuronal ATAC-seq signatures, there was a high representation of ascertained likely enhancers (20/41) among significantly active amplicons. These 20 active amplicons represented 15.9 % of the set of ascertained likely enhancers, while the remaining 21 active amplicons represented 11.5 % of predicted negative sequences. The difference in amplicons with MPRA activity (model-based p-value < 0.05) comparing ascertained positive and negative amplicon sets was not significant based on Fisher’s exact test (odds ratio of 1.45, one-tailed p-value = 0.176). The representation of ascertained likely enhancers was higher among the most significant amplicons with FDR < 0.1 (12/17). These 12 amplicons represented 9.5 % of the ascertained likely enhancers, while the remaining five active amplicons represented 2.7 % of the predicted negative sequences, resulting in an odds ratio of 3.73. The enrichment of this highly significant set of MPRA amplicons for pre-ascertained likely enhancers was statistically significant (Fisher’s exact test, one-tailed p-value = 0.011). Therefore, consistent with our expectations, enhancer ascertainment based on open chromatin patterns in the brain was associated with MPRA activity.

Lack of strong effects and insufficient power for allele-specific activity analysis

Our MPRA library included a subset of amplicons harboring SNPs that were either lead SNPs from GWAS studies (121 amplicons), or SNPs in LD with lead SNPs within associated intervals at CACNA1C (21 amplicons) and SCN1A (62 amplicons). As amplicons were cloned from pooled human population DNA, we anticipated capturing representative alleles for these SNPs in our MPRA library. Indeed, 439/440 of these SNPs, representing 313/345 cloned amplicons, were represented in our library before viral packaging, but only 147 of them had minor allele frequency above 0.1, critical for allele comparison. Among the eight amplicons with significant MPRA activity, none exhibited significant allelic differences (Figure 2—figure supplement 4G). We further used luciferase assays in two cell lines (HEK and SH-SY5Y) to compare haplotype activity for one MPRA active enhancer, finding no significant difference between the alleles of the CACNA1C SNP containing amplicon #3 (Figure 2—figure supplement 4H). We note that despite strong DNA reference allele frequency correlation in the viral library DNA, limited allele coverage, unbalanced allele frequencies, and reduced correlation of RNA allele frequencies resulted in lack of power to detect moderate to small allelic differences in our experiments (Figure 2—figure supplement 4, Supplementary file 7).

Activity reproduces between full MPRA and miniMPRA experiments

Since the PutEnh group in our MPRA contained human orthologs of the miniMPRA sequences, this experiment enabled comparison of human and mouse sequences in an in vivo mouse brain context and offered further validation of reproducibility for our MPRA results. Eight amplicons were included in both experiments, and we observed general correlation between activity in the mouse miniMPRA and the human orthologs in the full MPRA library. These results further indicate reproducible activity and that there is conserved function between orthologous mouse and human sequences tested in the same postnatal mouse brain context (Figure 2—figure supplement 5).

Confirmation of in vivo P7 cortex MPRA results for single candidate sequences

To validate enhancer activity in the mouse brain at P7, we next cloned individual amplicons for selected positive and negative hits from the MPRA into the same HspMinP-EGFP 3’-UTR oriented reporter and generated scAAV9 for each construct. Amplicon #161 (FBDHS group), an enhancer candidate that overlaps both a DNaseI hypersensitive site in fetal human brain and a copy number variant region near the autism- and epilepsy-associated gene SCN2A, and that displayed particularly strong activity in the screen (Figure 3—figure supplement 1A), showed consistent expression of EGFP in the mouse brain at P7 (Figure 3A, top row). EGFP expression driven by this amplicon was seen wherever the brain had been exposed to the virus (based on mRuby3 expression), with a trend toward greater density of EGFP+ cells observed in the lower-middle layers of the cortex (lower half of Layer IV and Layer V). On the other hand, a predicted negative sequence that did not display enhancer activity in the screen, amplicon #264, did not show expression of EGFP (Figure 3A, bottom row, Figure 3—figure supplement 1B).

Figure 3 with 2 supplements see all
Functional validation of STARR-seq screen.

(A) Validation of positive and negative hits from in vivo AAV MPRA screen. Representative images of coronal sections of AAV-transduced P7 brains stained with an anti-GFP antibody is shown (left panels). Closeup of the boxed regions are shown in the panels on the right (from left to right: Red channel, mRuby3 injection control; Green channel, EGFP expression driven by candidate amplicon; Merge with DAPI in gray). The brain shown in the top row was transduced with AAV9-CAG-mRuby3 (injection delivery control) and scAAV9-HspMinP-EGFP carrying in the 3’-UTR Amplicon #161, a highly active amplicon in the AAV MPRA. In the bottom row, a similar transduction is shown for Amplicon #264, a negative control with no predicted enhancer activity that did not display activity in the in vivo AAV MPRA. (B) Functional validation of enhancer activity in different cell types. Brains were transduced as in A with AAV9-CAG-mRuby3 and scAAV9-HspMinP-EGFP carrying either Amplicon #161 or Dlx in the 3’-UTR. Brains were collected at P7 and stained for GFP and Ctip2, a transcription factor necessary for axon development in excitatory projection neurons in Layer V during embryonic development. Representative staining of coronal sections is shown (left panels). Zoomed in views of boxed regions are shown in single-channel images in the panels on the right (Green, EGFP; Red, mRuby3; Magenta, Ctip2). Ctip2 channel images are shown with EGFP+ cells outlined (top) and mRuby3+ cells outlined (bottom). (C) Quantification of Ctip2 co-labeling shown in B with additional animals co-transduced with AAV9-CAG-mRuby3 and AAV9-Dlx-βGlobinMinP-EGFP included for comparison. Individual GFP+ and mRuby3+ cells were counted and scored for whether each cell contained a Ctip2-positive nucleus. Cell counts were summed across images for the same brain. Data is presented as mean ± SEM for the fraction of fluorescent cells that are Ctip2+. Cells that expressed EGFP under the control of the inhibitory interneuron enhancer Dlx displayed a lower frequency of Ctip2+ nuclei compared to cells that drove EGFP under the control of amplicon #161 or drove mRuby3 under the control of the general mammalian promoter CAG (n = 5 animals co-injected with HspMinP-EGFP-#161 and CAG-mRuby3, four animals co-injected with HspMinP-EGFP-Dlx and CAG-mRuby3, and two animals co-injected with Dlx-βGlobinMinP-EGFP and CAG-mRuby3).

To further validate enhancer function in excitatory glutamatergic neurons, we counter-stained for Ctip2, a transcription factor involved in axonal development in excitatory cortical projection neurons in Layer V of the cortex (Arlotta et al., 2008; Leyva-Díaz and López-Bendito, 2013). Although expression of Ctip2 is not exclusive to excitatory neurons in adult mice (Nikouei et al., 2016), it is commonly used as an excitatory Layer V marker during embryonic and early postnatal development (Leyva-Díaz and López-Bendito, 2013; Alcamo et al., 2008; Leone et al., 2015; Gompers et al., 2017). We reasoned that if our in vivo MPRA reporter construct accurately reproduced cell-type specific enhancer activity, then we should see a difference in the Ctip2 overlap with EGFP driven by amplicon #161 compared to EGFP driven by Dlx. Indeed, after counterstaining brain sections transduced as described above with antibodies for Ctip2, we found that EGFP+ cells driven by Dlx exhibited significantly lower frequencies of Ctip2+ nuclei compared to either CAG-driven mRuby3+ cells or EGFP+ cells driven by amplicon #161 (Figure 3B and C). These results demonstrate that our MPRA could accurately reflect enhancer activity of particular candidate sequences in vivo.

Dissection of regulatory elements within the third intron of CACNA1C

Our library included amplicons spanning across a psychiatric disorder-associated LD interval within the ~330 kb third intron of the gene CACNA1C, which encodes the α1 subunit of the L-type voltage-gated calcium channel CaV1.2. This region contains previously in vitro defined regulatory elements (Eckart et al., 2016; Roussos, 2014) harboring schizophrenia- or bipolar disorder-associated SNPs, predominantly rs1006737 (Ferreira et al., 2008; Sklar, 2008), rs2007044 (Ripke, 2014), rs4765905 (Hamshere et al., 2013), and rs4765913 (Ripke, 2014). Via MPRA, we assessed the activity of 17 amplicons within the CACNA1C intron covering SNPs in LD (r2 >0.8). As a comparison set, we also included five amplicons covering SNPs in LD (r2 >0.8) associated with a non-neuronal SNP associated with hematocrit (rs7312105) (van der Harst et al., 2012).

Three amplicons within the CACNA1C psychiatric disorder LD interval drove significant RNA transcript expression in both linear and ratiometric models in our MPRA: amplicons #3 (overlapping rs1108075 and rs11062166), #6 (overlapping rs12315711 and rs2159100), and #7 (overlapping rs11062170 and rs4765905) (Figure 4A). In comparison, no amplicons from the hematocrit LD interval passed significance threshold for activity (Figure 4—figure supplement 1). We validated #3 and #6 (highlighted in blue, Figure 4A) for enhancer activity in postnatal brain using our single-candidate reporter construct strategy. We also validated lack of activity for #2, an amplicon in the same LD block that did not have significant enhancer activity based on the MPRA results (highlighted in red, Figure 4A). Similar to the negative controls and consistent with MPRA findings, #2 did not drive significant EGFP expression in the brains of P7 mice (Figure 4B, top panels). On the other hand, #3 and #6 drove detectable EGFP expression in cells throughout the cerebral cortex wherever viral exposure was detected by the co-injected CAG-driven mRuby3 positive control (Figure 4B, middle and bottom panels respectively). In a follow-up experiment, we transduced scAAV9-HspMinP-EGFP-#3 and AAV9-CAG-mRuby3 at P0 but waited to collect the brains until P28, at which time we observed that amplicon #3 drove EGFP expression in cortical neurons of adolescent mice (Figure 4C, Figure 4—figure supplement 2).

Figure 4 with 3 supplements see all
Functional dissection of the large third intron of CACNA1C.

(A) UCSC Genome Browser representation of amplicons #1 through #7 in the third intron of CACNA1C (hg38, chr12:2,220,500–2,242,499). UCSC tracks for GENCODE v36 and 100 vertebrate conservation, normalized coverage of aligned reads for the previral library, and DNA and RNA samples for the four biological replicates are shown; y-axis scale is 0–50,000 reads. MPRA analysis is shown as graphs of linear model residuals and -log10 transformed p-values. Three amplicons, #3, #6, and #7, were found significantly active in our assay. Amplicons which were tested in single-candidate experiments are highlighted (red for no activity in MPRA, blue for significant activity in MPRA) (B) Confocal images of single-candidate validation of amplicons #2, #3, and #6. Mice were transduced at P0 with two AAV vectors: one for an HspMinP-EGFP-3’-UTR enhancer reporter construct carrying the indicated amplicon and a second control vector, CAG-mRuby3. Brains were fixed at P7 and sectioned and stained with an antibody for EGFP for signal amplification. Tiled, whole section images are shown on the left. Closeup of boxed regions are shown in the panels on the right. Green, EGFP; red mRuby3; grey, DAPI. These experiments validated robust EGFP expression driven by the two positive MPRA hits (#3 and #6), with substantial EGFP reduction for the MPRA negative amplicon #2. (C) Mice were transduced with AAV including positive amplicon #3 and processed as in B, but were raised to P28 before fixing, sectioning, and staining.

These results suggest that the AAV MPRA implementation was effective at screening putative regulatory sequences for in vivo activity in the brain, with reporter expression concordant between MPRA results and single candidate tests of EGFP expression in P7 mouse forebrain for five individually validated amplicons, and that these results could be extended to study activity in later development.

Discussion

The ability to test in parallel the regulatory capacity of candidate sequences in vivo offers considerable potential for elucidating the role of enhancers in the developing brain and for efficient identification of enhancers that are capable of driving precise expression patterns. Here, we report successful rAAV-mediated delivery of a 3’-oriented parallelized enhancer reporter assay to early postnatal mouse brain and demonstrate the utility of this approach in screening human DNA sequences for regulatory activity in the brain. Via this MPRA, we identified novel presumed enhancers active in P7 mouse cortex, showcasing example applications including identifying regulatory sequences associated with ASD-associated loci, screening amplicons that include lead SNPs from human genetic studies, and comprehensive testing for enhancers harboring SNPs from disease-associated non-coding LD intervals. We show that amplicons active in our MPRA were more likely to have enhancer signature across functional genomics datasets and that orthologous mouse sequences tested in an independent parallel reporter assay showed strong activity correlation. Finally, we validated MPRA activity predictions via imaging of EGFP for four positive and two negative sequences in vivo in P7 brain, and we confirmed that activity for one of these positives continued to P28. This study provides a model for applying this powerful screening approach in vivo in mammalian brain.

Our study represents one of the first parallelized enhancer assays testing human sequences in early postnatal mouse brain via rAAV-mediated delivery. We validated MPRA performance, verifying both the capacity for the construct to drive characteristic cell-type restricted expression for a known interneuron enhancer, as well as demonstrating EGFP protein expression driven by novel putative enhancers that were active in vivo in P7 mouse cortex. Based on MPRA and validation experiments, the main sources of variation across MPRA replicates were rAAV transduction rate and injection site (Figure 1—figure supplement 4, Figure 3—figure supplement 2, Figure 4—figure supplement 3), highlighting the need for delivery controls to ensure reproducibility when applying MPRAs in vivo via viral delivery. We observed some variability in amplicon RNA dropout across replicates, likely due to a combination of transduction efficiency variability across animals, sensitivity recapturing amplicons with low viral representation, and PCR stochasticity (Kebschull and Zador, 2015). Based on the high degree of correlation between biological replicate #4 and its high amplification cycle technical replicate, we do not believe that PCR amplification bias explains this replicate-specific dropout. On the other hand, we consider it likely that stochastic differences in the number and population of transduced cells in each replicate would alter the apparent activity of amplicons in the assay. For this reason, it is critical that AAV MPRAs use high titer preparations and take advantage of serotypes with high infectivity for the target tissue to maximize the number and diversity of cell types sampled. While not an obvious issue here, the effect of PCR-based clonal amplification bias in other MPRAs has been shown to be reduced with the addition of barcodes and unique molecular identifiers to the reporter construct (Neumayr et al., 2019).

Our rAAV-based approach enabled efficient transduction of the mammalian central nervous system. Our approach differs from recent MPRA strategies which use lentiviral vectors (Klein et al., 2020; Gordon et al., 2020; Morgan et al., 2020; Inoue et al., 2017) in that AAV is expressed episomally while lentivirus must first integrate into the genome. Lentivirus is especially useful for in vitro applications with difficult to transfect cell types, as AAV is not very efficient at transducing cell lines (Ellis et al., 2013). AAV offers a complementary strategy to lentivirus in vivo, as well as considerable flexibility in targeting various tissue types based on capsid serotypes (Srivastava, 2016). AAV is gaining traction as a powerful method for enhancer-driven, cell-type-specific manipulations (Graybuck et al., 2021; Mich, 2021; Rubin et al., 2020).

Amplicons in this assay were approximately 900 bp long, representing some of the longer sequences tested in MPRAs to date. Our rationale, that longer sequences afford more native biological context to enhancer activity, appears in accordance with a recent study comparing MPRA designs (Klein et al., 2020), which finds that longer sequences add biological signal including enrichment of an RNA polymerase III catalytic subunit, histone-modifying enzymes, and increased transcription factor binding. In our vector design, ~900 bp was the maximum length an amplicon could be and still be efficiently packaged into an scAAV. Using traditional, single-stranded AAV vectors would increase packaging capacity, potentially with a trade-off of reduced transduction efficiency. However, increased length, especially when inserted into the 3’-UTR, may contribute to increased variability across replicates (Klein et al., 2020), and may drive mRNA degradation (Rabani et al., 2017) that would inhibit RNA transcript detection in the assay. We detected a number of amplicons that had reduced RNA compared to expected background levels, which may represent sequences with silencer or repressor activity such as those that have been reported in recent MPRA screens (Doni Jayavelu et al., 2020). However, because of both the insertion of these amplicons into the 3’-UTR and the length of the amplicons, the modified 3’-UTR may have caused transcripts to be subject to RNA degradation. Thus, we are hesitant to draw conclusions about these amplicons with lower than expected activity without further characterization. Nevertheless, while artifacts from inserting sequences into the 3’-UTR may impact the STARR-seq design and our results, we show that orthologous mouse sequences exhibited correlated presence and absence of activity and similarly that amplicons with absence or presence of MPRA activity across replicates consistently reproduced in independent single deliveries to the brain. Further, active amplicons were enriched for DNase Hypersensitivity Sites, H3K4me1, and H3K4me3 peaks generated from fetal human brain tissue, and with ATAC-seq peaks from FACS-purified neurons and glia from postmortem human brain. Overall, the combination of reproducible MPRA activity, enrichment for neuron-specific enhancer signatures, and activity validation in AAV experiments provide evidence that positive results of the MPRA indeed act as enhancers.

Parallelized enhancer assays such as the one reported here have the possibility to become fundamental for assessing active enhancers in the brain and offer great potential for functional dissection of sequence-based enhancer activity. The vast majority of sequence variants associated with genetic risk for neurodevelopmental and neuropsychiatric disorders are found in non-coding regions (Zhang and Lupski, 2015), many of which are presumed to be located in enhancers (Ripke, 2014). As an example, our assay enabled functional annotation of 17 intronic CACNA1C amplicons spanning regions harboring schizophrenia- and bipolar disorder-associated SNPs within strong linkage disequilibrium and identified three regions with enhancer activity in the brain. Notably, we detected no activity in our assay from amplicon #5 (LD group), which spans a region harboring the SNPs rs1006737 and rs2007044, two of the most statistically significant risk SNPs (Moon et al., 2018) from GWAS of schizophrenia and bipolar disorder (Figure 4A). Two of the three amplicons we identified as enhancers in P7 brain had prior evidence of enhancer activity in in vitro models (Eckart et al., 2016; Roussos, 2014), indicating our assay can reproducibly detect enhancer activity identified in orthogonal studies. The region spanning amplicon #6 was previously annotated as a putative enhancer in physical proximity to the CACNA1C promoter in human induced pluripotent stem cell (hiPSC)-derived neurons (Roussos, 2014). The region spanning amplicon #7 exhibited enhancer activity via luciferase assay in SK-N-SH cells (Eckart et al., 2016). In addition, #7 was predicted as a putative enhancer based on open chromatin signatures in both fetal and postmortem human brain (Dunham, 2012; Bernstein et al., 2010; Fullard et al., 2018). Finally, we show that activity of one of these CACNA1C intronic enhancers continues at P28, highlighting the potential to apply the MPRA and single enhancer AAV testing at later ages. Although our design was not well powered to compare the activity of sequence variants (Figure 2—figure supplement 4), future studies using allele-balanced libraries, libraries with reduced complexity, or libraries with allele-specific barcodes could greatly improve the sensitivity and power of this assay to evaluate the effects of sequence variation on enhancer activity. We attempted to assess allelic differences in enhancer function for amplicon #3 using a luciferase assay, but we found no difference in activity between the reference and variant SNPs. However, disease-relevant SNPs may alter enhancer function by modulating activity in specific cell-type or developmental contexts which were not replicated in the in vitro luciferase assay. For this reason, in vivo parallel functional assays will be critical in understanding how sequence variation within enhancers contributes to altered gene expression, including for the disease-associated CACNA1C locus.

In summary, our in vivo parallelized enhancer reporter assay in P7 mouse brain enabled us to identify novel enhancers active in early postnatal mouse brain and pinpoint potential regulatory regions where disease-associated sequence variation (e.g. GWAS SNPs) may contribute to transcriptional pathology. Our results highlight the opportunities gained by in-depth functional dissection of enhancers in the developing brain via in vivo adaption of MPRAs, toward deeper understanding of enhancer biology across neurodevelopment and in the etiology of neuropsychiatric disorders.

Materials and methods

Selection of library candidates

Request a detailed protocol

For the miniMPRA, 16 noncoding sequences were manually selected based on putative regulatory activity by mouse H3K27ac data (Nord, 2013). For the full MPRA, 408 candidates were selected for screening based on the following criteria: 22 regions containing CACNA1C-associated SNPs (rs1006737, rs2007044, rs4765913, rs4765905, and rs7312105) and variants in linkage disequilibrium (r2 >0.8) of listed SNPs; 70 regions containing SCN1A-associated SNPs (rs7587026, rs12987787, rs11890028, rs6732655, rs11692675) and variants in linkage disequilibrium (r2 >0.8); 29 epilepsy-associated SNPs (GCST001662: rs13026414, rs72823592, rs10496964, rs12059546, rs39861, rs2717068, rs771390, rs10030601, rs12720541; GCST002547: rs28498976, rs2947349, rs1939012, rs55670112, rs535066, rs111577701; GCST002141: rs492146, rs72700966, rs61670327, rs143536437, rs11861787, rs72698613, rs12744221; GCST000691: rs346291, rs2601828, rs2172802, rs2841498, rs1490157, rs2475335; GCST001329: rs2292096); 142 regions overlapping schizophrenia-associated SNPs (Ripke, 2014); 129 regions overlapping DNaseI hypersensitive sites identified from fetal human brain (Dunham, 2012; Bernstein et al., 2010) found within autism-associated copy number variants (Turner, 2016; Turner et al., 2017); and 16 human homologs of the mouse miniMPRA sequences. Schizophrenia- and epilepsy-associated SNPs were identified using the NHGRI-EBI GWAS Catalog (Buniello et al., 2019). SNPs in linkage disequilibrium for query SNPs listed as CACNA1C- or SCN1A-associated above were identified using HaploReg v3 (Ward and Kellis, 2012) using LD reported from the 1000 Genomes Project (Auton et al., 2015). Full selection metadata for each region of interest can be found in Supplementary files 1 and 2.

Vector preparation and library cloning

Request a detailed protocol

We replaced the CMV-EGFP expression cassette in scAAV-CMV-EGFP (Rosas et al., 2012) with a synthesized gBlocks Gene Fragment (IDT) containing the Hsp1a minimal promoter followed by the gene for EGFP with a cloning site inserted into the 3’-UTR. This cloning site consisted of two restriction sites flanked by two 35 bp sequences used for Gibson homology. The resulting plasmid, pscAAV-HspMinP-EGFP, housed the 3’-UTR-oriented enhancer reporter construct between the ITR sequences necessary for packaging scAAV particles (Figure 1—figure supplement 1). Unlike other STARR-seq constructs (Arnold et al., 2013; Muerdter et al., 2018; Lea et al., 2018), we did not include an intronic sequence in the reporter ORF, as we preferred to reduce construct length and complexity considering the novel implementation in an scAAV context and 900 bp amplicon sequence.

We designed 16 PCR amplicons (miniMPRA) and 408 PCR amplicons (MPRA) covering the regions of interest described above by customizing a batch primer design pipeline using Primer3 (Untergasser et al., 2012; Supplementary files 1 and 2). PCR primers for the miniMPRA were designed manually using the mouse genome assembly mm10. All PCR primers for the full MPRA were designed using the human genome assembly GRCh37 and the Primer3 human mispriming library cat_humrep_and_simple.fa. In brief, amplicons were designed to fit the following criteria with all other parameters set to the Primer3 default: amplicon product size between 880–940 bases; primer size between 18–25 bases with an optimum of 20 bases; primer anneal temperature between 57 °C–61 °C, with an optimum of 60 °C; maximum primer pair annealing temperature difference of 2 °C; and primer GC content between 30 %–70 %. We excluded any primer pairs that contained a SNP within 50 bases of either end of the PCR product. PCR primers were then appended with homology sequences (forward primer: tgtacaagtaacatagtcatttctagaTTAATTAA; reverse primer: gctctagtcgacggtatcgataagcttGGCGCGCC) for Gibson Assembly cloning, up to a total primer length of 60 bases.

To generate the amplicon library, we performed individual PCRs using Phusion Hot Start II High Fidelity DNA Polymerase (Invitrogen #F549S) in 10 µL reaction volumes in 96-well plates with 50 ng of pooled human DNA (1:1 mix of Coriell DNA pool #NA13405 and Coriell DNA pool #NA16600) as the DNA template. Thirty cycles of PCR were performed with annealing at 60 °C and 30 s extension time. We then quantified the concentrations of each PCR reaction using Quant-iT PicoGreen dsDNA Assay Kit (Invitrogen #P7589) to calculate equimolar amounts of each PCR product for downstream pooling.

We next linearized the vector pscAAV-HspMinP-EGFP (Figure 1—figure supplement 1) using PacI (NEB #R0547L) and AscI (NEB #R0558L). AAV DNA vectors are prone to recombination at the ITR sites, so we also performed a separate restriction digest using SmaI (NEB #R0141L) using pscAAV-HspMinP-EGFP to check for vector integrity. The PacI/AscI digested vector was run on a 1.0 % agarose gel, excised, and purified using the Promega Gel & PCR Cleanup system (#A9282) and quantified using Nanodrop. After concentration quantification of both vector and PCR products, 5 ng of each PCR product was pooled into one tube for purification using the QIAquick PCR purification kit (Qiagen #28106) and eluted into 30 µL Nuclease Free Water. Pooled DNA concentration and quality was then assessed by Nanodrop and Qubit dsDNA High Sensitivity assay kit (Thermo Fisher #Q33230). We also ran the pooled PCR products on a 1.0 % agarose gel to verify the expected product sizes (~1–1.1 kb).

We performed two separate Gibson assemblies (NEB #E2611S) using 100 ng of PacI/AscI-linearized vector and 3-fold molar excess of pooled PCR amplicons in a 20 µL reaction volume. We then incubated the Gibson assembly mixtures for 1 hour at 50 °C. For bacterial transformation, 2.0 µL of Gibson assembly reaction was added to 50.0 µL of NEB Stable competent cells (#C3040I), in triplicate for each reaction. Competent cells were heat shocked at 42 °C for 30 s, and recovered in 950 µL of SOC media for 1 hour at 37 °C. A total of 100 µL of each transformation replicate was plated onto 100 µg/mL carbenicillin agar plates and grown overnight at 30 °C. 900 µL of cells were transferred directly into 250–500 mL of 100 µg/mL carbenicillin Luria-Bertani broth and grown overnight on a 30 °C shaker at >250 rpm. Agar plates were compared for colony growth, indicating Gibson assembly success. Individual liquid cultures were purified using the Macherey-Nagel Xtra Maxi EF kit (#740424), eluted into 500 µL Nuclease Free Water, and assessed for concentration and purity using Nanodrop. Randomly sampled individual colonies as well as all maxiprep replicates were assessed by Sanger sequencing to confirm assembly at the cloning site within pscAAV-HspMinP-EGFP, then maxiprep replicates were pooled 1:1 by concentration for the final library used for viral packaging.

Viral packaging

Request a detailed protocol

For both the miniMPRA and the full AAV MPRA, adeno-associated virus (AAV9(2YF)) (Dalkara et al., 2012) was produced using a triple plasmid transfection method in 293T cells (Grieger et al., 2006). After ultracentrifugation, the interphase between the 54 % and 40 % iodixanol fraction, and the lower three-quarters of the 40 % iodixanol fraction, were extracted and diluted with an equal volume of phosphate-buffered saline (PBS) plus 0.001 % Tween 20. Amicon Ultra-15 centrifugal filter units (Millipore, Bedford, MA) were preincubated with 5 % Tween in PBS and washed with PBS plus 0.001 % Tween. The diluted iodixanol fractions were buffer-exchanged and concentrated to 250 μL in these filter units. Virus was washed three times with 15 mL of sterile PBS plus 0.001 % Tween. Vector was then titered for DNase-resistant vector genomes by real-time PCR relative to a standard.

Injections of library viruses to neonatal mice

Request a detailed protocol

C57BL/6 mice from Jackson Laboratories were used for mouse experiments. All procedures were performed in accordance with the ARVO statement for the Use of Animals in Ophthalmic and Vision Research and were approved by the University of California Animal Care and Use Committee (AUP #R200-0913BC). Surgery was performed under anesthesia, and all efforts were made to minimize suffering. Neonatal mice (on a C57BL/6J background) were injected in the prefrontal cortex in two separate locations with 0.5 µL of virus. Only the left side of the brain was injected. A beveled 34-gauge needle (Hamilton syringe) was used to push through the skull and to inject virus into the brain. Virus was delivered slowly by hand over the course of 1 min. Mice were anesthetized on ice for 3–4 min prior to injection, and after injection mice were heated on a warm pad before being returned to the cage.

Tissue collection, library preparation, and sequencing

Request a detailed protocol

Bulk forebrains were dissected from P7 mice injected with the viral library at P0. Dissection included whole forebrain, separated into left and right hemispheres, after removing surface tissue and skull. Hemispheres were stored in RNAlater Stabilization Solution (Thermo Fisher #AM7020) for downstream dual DNA and RNA collection. Hemispheres were then physically homogenized and processed using the AllPrep DNA/RNA Mini Kit (Qiagen #80204) to collect total RNA and genomic DNA from the same tissue. RNA was treated on-column with 1 µL RNase-Free DNase (Qiagen #79254). First strand cDNA was synthesized from purified total RNA using the SuperScript VILO cDNA Synthesis Kit (Invitrogen #11755050) with random primers. We amplified the variable regions of the reporter library from cDNA and genomic (g)DNA using PCR with Phusion Hot Start II High Fidelity DNA polymerase (Invitrogen #F549S) using primers flanking the variable region within the 3’-UTR of pscAAV-HspMinP-EGFP (forward primer, Amplicon PCR Primer F: GATCACTCTCGGCATGGAC; reverse primer, Amplicon PCR Primer R: GATGGCTGGCAACTAGAAGG). PCR consisted of 30 cycles (or 35, for technical replicate Sample 4–35) at 60 °C for annealing and 1 min for extension. PCRs were then loaded onto a 1.0 % agarose gel, the ~1–1.1 kb fragment excised, and gel fragment purified using the Promega Gel & PCR Cleanup system (#A9282). Purified gel fragments were then re-purified using the Qiagen MinElute PCR Purification Kit (#28004) to remove residual chemical contamination, and concentration and quality assessed with Nanodrop and the Qubit dsDNA High Sensitivity assay kit (Thermo Fisher #Q33230). Purified amplicons were prepared for sequencing using the Nextera XT DNA Library Preparation Kit (Illumina #FC-131–1001) using 1 ng of purified amplicon library per sample. Quality and concentration of sequencing libraries was assessed using an Agilent BioAnalyzer instrument. Each library was quantified and pooled for sequencing on the Illumina HiSeq 4000 using a paired end 150 bp sequencing strategy (Supplementary file 8). The miniMPRA was sequenced on a MiSeq instrument since lower coverage was required due to lower library complexity.

Sequencing alignment

Request a detailed protocol

Raw sequencing reads were trimmed to remove Nextera XT adapters using NGmerge (v0.2_dev) (Gaspar, 2018) with maximum input quality score set to 41 (-u 41). Trimmed reads were aligned to the human genome (GRCh38) using BWA-MEM (v0.7.17) (Li, 2013) with an option marking shorter split hits as secondary (-M) and otherwise standard parameters. As additional quality control, reads were also aligned to the mouse genome (GRCm38) to check for cross-contamination of mouse DNA from the amplicon PCR. Aligned SAM files were processed using Picard tools (v 2.23.3) (Eckart et al., 2016): aligned reads were sorted by coordinates and converted to BAM files using SortSam, optical duplicates were flagged by MarkDuplicates, and BAM file indexing was conducted using BuildBamIndex. The number of de-duplicated reads within in silico PCR amplicon coordinates was counted using samtools view -F 0 × 400 c (v1.7) (Li et al., 2009) and reported as the raw amplicon count.

Allele-specific MPRA activity

Request a detailed protocol

Allele-specific counts were generated from deduplicated bam files using bam-readcount (v0.8.0) (https://github.com/genome/bam-readcount; Khanna et al., 2021) with min-base-quality set to 15, and min-mapping-quality set to 20. Allelic counts were further processed using a custom R script, applying min base quality filter of 25, and min allele frequency filter of 0.01, producing a set of 440 biallelic SNPs.

Bioinformatics analysis with multiple linear regression

Request a detailed protocol

Analyses were performed in R (v4.0.2) and RStudio (v1.3.1093) software. To consider an amplicon as present in our MPRA library, we set a threshold of at least 200 read counts per amplicon in our library sample, resulting in a reduced dataset from 408 to 345 amplicons. This threshold was established based on the distribution of library counts. No GC or sequence bias was detected following in silico PCR analysis when present and missing amplicons were compared. Two primer pairs, expected to produce PCR products in our original design using the hg19 reference genome had no predicted products in our in silico approach, as well as in our MPRA library aligned to the GRCh38 genome (Supplementary file 9). In silico PCR analysis was performed using a custom R script and DECIPHER R package (Erik, 2016).

Raw amplicon counts were converted to proportion by normalizing raw count values for each amplicon to all amplicon counts in a given sample. A value of 1 was added to the numerator and denominator to avoid dividing by zero. For downstream analysis, mean amplicon proportions for genomic DNA (‘DNA’) and cDNA (‘RNA’) were calculated across all samples excluding the technical replicate Sample 4–35. RNA/DNA ratio (‘ratiometric activity’) was calculated by dividing porportionRNA by proportionDNA for each amplicon per replicate. After proportion conversion, the dataset was cleaned by removing low count amplicons with less than 200 raw counts in any of the DNA samples, and mean amplicon library proportions less than 2–15, resulting in 308 amplicons for downstream analysis.

The dataset used for building the linear model was selected by first removing the 10 % of amplicons with the highest and 10 % with the lowest mean ratiometric activity across the four biological replicates, resulting in 247 amplicons used for training the model. The model was fit using log2-transformed mean RNA and log2-transformed mean DNA proportions, with amplicon GC content provided as a model covariate. GC content had a significant impact, but low effect size, on the model as demonstrated by higher P values, 1.416 × 10-8 vs < 2.2 × 10-16, and lower F statistics, 34.446 vs 659.44, when log2 DNA proportions were compared with the GC content. Inclusion of the GC content was justified based on the reduction of Bayesian information criterion (BIC), from 626.2976 to 599.1863, for models without and with GC content, respectively. The normality of residuals distribution was confirmed using the Kolmogorov-Smirnov test and a graphical approach. The model was applied to the complete dataset, and one-tailed p-values were calculated from the Z-scaled distribution of residuals, using the normal distribution implemented using R stats pnorm (R Development Core Team, 2018). Empirical tail area-based FDR q-values, estimated from Z scores, were calculated using fdrtool R package (Strimmer, 2008), with threshold adjustment for one-tail testing. We compared the linear activity model to using ratiometric activity alone. To control low-count variability, 1000 counts were added to RNA and DNA counts to stabilize ratios where counts were low. Mean ratiometric activity and standard deviation was then calculated for each amplicon across all biological replicates, excluding the technical replicate Sample 4–35. We compared the mean ratio, mean ratio – 1. s.d., ranked residuals, and p-value for all amplicons passing initial quality control. We also compared our linear model with a Wilcoxon rank sum, a non-parametric test comparing DNA vs RNA proportions, using Benjamini-Hochberg FDR < 0.05 for multiple testing correction.

Intersections with epigenomics datasets

Request a detailed protocol

Amplicons were intersected with epigenomics datasets from Roadmap (macs2 peaks filtered at q < 0.05) (Bernstein et al., 2010), with BOCA merged neuron and glia ATAC-seq peaks from postmortem human brain tissues (Fullard et al., 2018), with DNase I digestion-based digital TF footprints (Vierstra et al., 2020), and with vertebrate evolutionary conserved elements in the human genome from the UCSC Genome Browser. Consolidated peak intervals (genomic ranges) were merged prior to intersecting using R package GenomicRanges reduce function. Intersections were carried out using GenomicRanges findOverlaps function. Significance of enrichment was calculated using a one-tailed permutation test with 20,000 random samplings. The initial selection bias for active amplicons was accounted for by using the background of our MPRA library for random sampling, ensuring the robustness of statistical testing.

The predictive power of epigenomic signatures for MPRA activity was tested by adding the dichotomous Boolean variables of epigenomic intersection to a general linear model (GLM) predicting the level of RNA from DNA, and a GC content covariate, one epigenomic intersection at a time using the R stats glm function. T-test p-values were extracted from model summary, and models with and without the epigenomic covariates were compared using BIC. In addition to the GLM, we calculated point-biserial correlations between the Z-scaled residuals from our original lm model, and each epigenomic signature intersection using the biserial.cor function from the ltm R package (Rizopoulos, 2006).

Single-candidate enhancer reporter cloning

Request a detailed protocol

To validate selected positive hits, single-candidate enhancer reporter plasmids were constructed. Candidate sequences, amplicons #2, #3, #6 (LD group), #161 (FBDHS group), or #264 (PutEnh group negative control) were cloned by PCR from a pooled sample of human DNA (1:1 mix of Coriell DNA pool #NA13405 and Coriell DNA pool #NA16600) using the same primer sequences as used to clone the amplicons in the combined test library. Each amplicon was then inserted between the PacI/AscI cloning site downstream of EGFP in pscAAV-HspMinP-EGFP, using In-Fusion (Takara Bio #639649) or Gibson Assembly (NEB #E2611S) according to manufacturer’s specifications. In-Fusion or Gibson reaction products were used to transform Stellar competent cells (Takara Bio #636763) via heat shock at 42 °C and ampicillin-resistant clones were selected at 37 °C using LB-agar plates inoculated with carbenicillin. Clones were confirmed by PCR and Sanger sequencing. Amplicon #2 included the following alleles: A at rs4765904, A at rs1108221, and C at rs1108222. Amplicon #3 included the following alleles: G at rs1108075 and G at rs11062166. Amplicon #6 included the following alleles: C at rs2159100 and T at rs12315711. The integrity of the viral ITR sequences was verified by restriction digest with XmaI (NEB #R0180L) before proceeding with AAV packaging.

Single-candidate enhancer reporter AAV packaging

Request a detailed protocol

Adeno-associated viruses for single-candidate enhancer reporter constructs were packaged in-house using a triple-transfection protocol followed by concentration of viral particles from culture media using a table-top centrifuge protocol (Broussard et al., 2018). Briefly, for each construct, ~106 AAV293 cells (Agilent #240073) were plated on a 10 cm tissue culture dish in 10 mL of standard DMEM media (Corning #10–013-CV) supplemented with 10 % FBS (ThermoFisher #10082139) and 1 % penicillin/streptomycin (ThermoFisher #15070063) and maintained in a humidified incubator at 37 °C and 5 % CO2. After cells reached ~60 % confluency (1–2 days), the cells were passaged and split onto two 10 cm dishes and grown for an additional 1–2 days before passaging again onto two 15 cm dishes in 20 mL of supplemented culture media each. When the cells in 15 cm dishes were 60–80 % confluent the triple-transfection protocol was performed. The media was changed for fresh, supplemented DMEM 1 hr before transfection and the culture was returned to the incubator. 17 µg of enhancer reporter plasmid (or CAG-driven mRuby3 control plasmid) was mixed with 14.5 µg AAV helper plasmid (gift from Tian lab), 8.4 µg AAV9 rep/cap plasmid (gift from Tian lab), and 2 mL jetPRIME buffer (Polyplus #114–15), vortexed 10 s and incubated at room temperature for 5 min. Eighty µL jetPRIME transfection reagent (Polyplus #114–15) was then added to the plasmid/buffer mixture, vortexed, and incubated at room temperature for 10 min. One mL of jetPRIME transfection solution was added dropwise to each 15 cm culture dish which was then returned to the incubator overnight. The next day, the transfection media was exchanged for fresh, supplemented DMEM, and the cultures were incubated for an additional 2 days to allow viral particles to accumulate in the media. Approximately 3 days post-transfection, ~40 mL of media from both 15 cm culture plates was combined in a 50 mL conical centrifuge tube. Cellular debris was pelleted using a table-top centrifuge at 1500 x g for 1 min, and the supernatant was filtered over a 0.22 µm filter (EMD Millipore #SLGP033RS) into a fresh conical tube. The viral particles were precipitated from the media using AAVanced concentration reagent (System Biosciences #AAV100A-1) according to manufacturer’s instructions. Briefly, 10 mL of pre-chilled AAVanced was added to 40 mL filtered media, mixed by pipetting, and then incubated at 4 °C for 24–72 hr. After incubation, viral particles were pelleted by centrifugation at 1500 x g for 30 min at 4 °C. The supernatant was removed, the viral pellet was resuspended in 500 µL cold, serum-free DMEM, transferred to a 1.5 mL Eppendorf tube, and centrifuged again at 1500 x g for 3–7 min. At this step, if there was an AAVanced/viral pellet the supernatant was removed by aspiration and the pellet was resuspended in 40–60 µL cold Dulbecco’s PBS (Fisher Scientific #MT21031CV). If there was no viral pellet after this final centrifugation, we used the viral DMEM solution as our final virus prep. The virus solution was then aliquoted, flash frozen on liquid nitrogen, and stored at –80 °C until use. Once thawed, an aliquot was stored at 4 °C.

Viral titer was estimated by qPCR with an MGB-NFQ FAM-labeled probe that recognized GFP (Life Technologies, sequence: 5’-CTTCAAGGAGGACGGCAA-3’) and using pscAAV-HspMinP-EGFP vector plasmid as a standard. DNA was purified from 1 µL of virus preparation and 20 ng of plasmid DNA using the QIAGEN DNeasy Blood and Tissue kit (QIAGEN #69504). qPCR reactions of serial-dilutions of the DNA samples and plasmid standards were prepared using TaqMan Universal PCR Master Mix (ThermoFisher #4304437) in a 384-well plate and run on an Applied Biosystems QuantStudio Real-time PCR instrument at the UC Davis School of Veterinary Medicine Real-time PCR Core Facility. Viral titers ranged from 6.8 × 1010–2.6 x 1012 genome copies/mL, and all viruses were normalized to the lowest titer for each experiment.

Injections of single-candidate enhancer reporter viruses to neonatal mice

Request a detailed protocol

Intracranial virus injections were performed on C57BL/6 mice at P0-1 using protocols approved by the UC Davis Institutional Animal Care and Use Committee (Protocol #20506). First, mouse pups were cryo-anesthetized until cessation of pedal withdrawal reflex. Each pup was then injected in each lateral ventricle with 1 µL of a virus mixture consisting of a 2:1 ratio of enhancer reporter virus to CAG-mRuby3 control virus and containing 0.06 % Fast Green dye (Grainger #F0099-1G) (2 µL total, ~1.4 x 108 particles of enhancer reporter virus). The injection sites for each ventricle were located midway between anatomical landmarks lambda and bregma, approximately midway between the eye and the sagittal suture. After injection, the mouse pup was placed in a warmed recovery chamber for 5–10 min before being returned to its parent’s home cage.

Immunohistochemistry

Request a detailed protocol

After 7 or 28 days of virus incubation, the mice were anesthetized under isoflurane and sacrificed by transcardial perfusion with 4 % PFA in PBS. The brain was then dissected and placed in 4 % PFA overnight to complete fixation. The fixed brain was dehydrated in a 30 % sucrose PBS solution for 3–4 days before freezing in O.C.T. medium (VWR #25608–930). 30 to 35 µm coronal sections were cut on a cryostat and collected in PBS. Floating sections were stained in 24-well plates using the following protocol. All steps at 4 °C and room temperature were performed with gentle agitation on an orbital shaker (Cole-Parmer product #60–100 RPM). Antigen retrieval was performed in 1 x Citrate buffer (pH 6.0; Sigma-Aldrich #C9999-1000ML) at 60 °C for 1 h. After allowing to cool for 20 min, sections were permeabilized for 20 min at room temperature in PBS containing 0.5 % Triton X-100. Sections were then blocked for 1 h at room temperature in a blocking solution containing 0.1 % Triton X-100 and 5 % milk in PBS. The sections were then transferred to blocking solution containing primary antibodies and incubated for 1–3 days at 4 °C. After primary antibody incubation, sections were washed 5 times in 0.1 % Triton X-100 PBS for 20 min each time. Sections were then transferred to blocking solution containing secondary antibodies and incubated for 45 min to 1 hr at room temperature, followed by 3–5 more 20 min washes in 0.1 % Triton X-100. Finally, sections were stained with DAPI (ThermoFisher #D1306) at room temperature for 20–30 min and mounted on slides with ProLong Gold Antifade Mountant (ThermoFisher #P36934). Primary antibodies included chicken anti-GFP (ThermoFisher #A10262, 1:1000) and rat anti-Ctip2 (Abcam #ab18465, 1:500), and secondary antibodies included a Donkey anti-chicken AlexaFluor-488 (Jackson ImmunoResearch #703-545-155, 1:500) and a Donkey anti-rat AlexaFluor-647 (Jackson ImmunoResearch #712-605-153, 1:500 or 1:250).

For immuno-fluorescent co-staining of EGFP and Lhx6, a slightly different staining protocol was used. Thirty µm coronal brain sections were first mounted onto microscope slides and then incubated at 37 °C for 30 min. Next, the slides were treated to a series of steps to perform antigen retrieval to detect the Lhx6 antigen. These included 8 min shaking in a base solution of 0.5 g NaOH in 50 mL water, followed by 8 min shaking in 0.3 % glycine in 1 x PBS, and finally 8 min shaking in 0.3 % SDS in 1x PBS. The slides were then incubated in a blocking solution of 5 % BSA and 0.3 % Triton X-100 in 1x PBS for 1 h, followed by primary antibody incubation overnight at 4 °C. The next day, slides were washed in 0.3 % Triton X-100 in 1x PBS three times and then incubated in secondary antibodies for 1 h at room temperature. Slides were finally washed three more times and coverslips mounted with Vectashield anti-fade medium (Vector labs, H-1000). Primary antibodies included a mouse anti-Lhx6 (Santa Cruz Biotechnologies #sc-271443, 1:200) and a rabbit anti-GFP (ThermoFisher #A11122, 1:2000), and secondary antibodies included a goat anti-mouse AlexaFluor-647 (ThermoFisher #A32728, 1:300) and a goat anti-rabbit AlexaFluor-488 (ThermoFisher #A32731, 1:300).

Fluorescence microscopy and image analysis

Request a detailed protocol

Confocal images were acquired on a Zeiss LSM800 microscope using a 0.25 NA 5x air objective or a 0.8 NA 25x oil objective. Images were acquired using the same settings for laser intensity, PMT gain, and pixel dwell time. For whole-brain section images showing overall injection and virus expression, images were tiled across the section with a 10 % overlap and stitched in postprocessing using the Stitching plugin in Fiji64 (Preibisch et al., 2009). Brain sections co-labeled for Lhx6 were imaged using a Leica DM2000 fluorescent compound microscope with a mounted DFC3000 G monochrome camera. Images were acquired using a 10x objective which was sufficient to capture the entire neocortical region from white matter to pia. For all images, cells expressing EGFP or mRuby3 were counted and analyzed using Fiji (Schindelin et al., 2012). Images with co-labeling were then scored for EGFP and mRuby3 cells that were double-positive for the co-label. Example images shown have been processed with a 3 × 3 median filter in Fiji, background has been subtracted from each channel, and the contrast of each channel has been adjusted to the same levels for comparison across images.

Luciferase assay

Request a detailed protocol

We constructed luciferase reporter plasmids by replacing the minP promoter in pGL4.24 (Promega) with the Hsp1a minimal promoter using restriction enzymes HindIII and NcoI to make pGL4.24-HspMinP. We then cloned a ~900 bp region containing either Amplicon #256, a predicted negative (forward primer: 5’- ccggagctcTGGTATGGGTGAAAACGGCT-3’; reverse primer: 5’- cggctcgagGAGGTTTGTGGGGAGGAGTG-3’), or Amplicon #3 (forward primer: 5’- ttaattaaggtaccTCTAGAGCTCTCTTTTGAACTGC-3’; reverse primer: 5’- ttaattaactcgagATAATTCTCTTTGTGCAATGCTACA-3’) into the pGL4.24-HspMinP vector (Promega) upstream of HspMinP using restriction enzymes XhoI and either SacI or KpnI. Sequence validation of the plasmids allowed us to distinguish #3 reference and variant. HEK293 cells and SH-SY5Y cells were obtained from ATCC and cultured following standard protocols. No further validation or testing of cell line veracity or contamination was performed. When cultures were 40–60 % confluent, cells were transfected using Lipofectamine 3000 (Invitrogen) with each construct (400 ng) and the Renilla luciferase expression vector pRL-TK (40 ng; Promega) in triplicate. After 24 h, the luciferase activity in the cell lysates was determined using the Dual Luciferase Reporter System (Promega). Firefly luciferase activities were normalized to that of Renilla luciferase, and expression relative to the activity of the empty pGL4.24-HspMinP vector was noted.

Data availability

All supplementary information, including links to raw and processed data, can be found at the Nord Lab Resources page (https://nordlab.faculty.ucdavis.edu/resources/). Software can be found at the Nord Lab Git Repository (https://github.com/NordNeurogenomicsLab/) and https://github.com/NordNeurogenomicsLab/Publications/tree/master/Lambert_eLIFE_2021,(copy archived at swh:1:rev:5f884418d28e2baba09b91efc264b294a246f6a3). Sequencing data have been deposited in GEO under accession code GSE172058.

The following data sets were generated
The following previously published data sets were used
    1. Fullard JF
    2. Hauberg ME
    3. Bendl J
    4. Roussos P
    (2018) NCBI Gene Expression Omnibus
    ID GSE96949. An Atlas of Chromatin Accessibility in the Adult Human Brain.
    1. Vierstra J
    2. Lazar J
    3. Sandstrom R
    (2020) N/A
    ID resources. Global reference mapping of human transcription factor footprints.

References

  1. Book
    1. Broussard GJ
    2. Unger EK
    3. Liang R
    4. McGrew BP
    5. Tian L
    (2018) Imaging glutamate with genetically encoded fluorescent sensors
    In: Parrot S, Denoroy L, editors. In Biochemical Approaches for Glutamatergic Neurotransmission. Springer. pp. 117–153.
    https://doi.org/10.1007/978-1-4939-7228-9_5
  2. Software
    1. R Development Core Team
    (2018) R: A language and environment for statistical computing
    R Foundation for Statistical Computing, Vienna, Austria.
    1. Rizopoulos D
    (2006)
    Ltm: An R package for latent variable modeling and item response theory analyses
    Journal of Statistical Software 17:i05.
    1. van der Harst P
    2. Zhang W
    3. Mateo Leach I
    4. Rendon A
    5. Verweij N
    6. Sehmi J
    7. Paul DS
    8. Elling U
    9. Allayee H
    10. Li X
    11. Radhakrishnan A
    12. Tan S-T
    13. Voss K
    14. Weichenberger CX
    15. Albers CA
    16. Al-Hussani A
    17. Asselbergs FW
    18. Ciullo M
    19. Danjou F
    20. Dina C
    21. Esko T
    22. Evans DM
    23. Franke L
    24. Gögele M
    25. Hartiala J
    26. Hersch M
    27. Holm H
    28. Hottenga J-J
    29. Kanoni S
    30. Kleber ME
    31. Lagou V
    32. Langenberg C
    33. Lopez LM
    34. Lyytikäinen L-P
    35. Melander O
    36. Murgia F
    37. Nolte IM
    38. O’Reilly PF
    39. Padmanabhan S
    40. Parsa A
    41. Pirastu N
    42. Porcu E
    43. Portas L
    44. Prokopenko I
    45. Ried JS
    46. Shin S-Y
    47. Tang CS
    48. Teumer A
    49. Traglia M
    50. Ulivi S
    51. Westra H-J
    52. Yang J
    53. Zhao JH
    54. Anni F
    55. Abdellaoui A
    56. Attwood A
    57. Balkau B
    58. Bandinelli S
    59. Bastardot F
    60. Benyamin B
    61. Boehm BO
    62. Cookson WO
    63. Das D
    64. de Bakker PIW
    65. de Boer RA
    66. de Geus EJC
    67. de Moor MH
    68. Dimitriou M
    69. Domingues FS
    70. Döring A
    71. Engström G
    72. Eyjolfsson GI
    73. Ferrucci L
    74. Fischer K
    75. Galanello R
    76. Garner SF
    77. Genser B
    78. Gibson QD
    79. Girotto G
    80. Gudbjartsson DF
    81. Harris SE
    82. Hartikainen A-L
    83. Hastie CE
    84. Hedblad B
    85. Illig T
    86. Jolley J
    87. Kähönen M
    88. Kema IP
    89. Kemp JP
    90. Liang L
    91. Lloyd-Jones H
    92. Loos RJF
    93. Meacham S
    94. Medland SE
    95. Meisinger C
    96. Memari Y
    97. Mihailov E
    98. Miller K
    99. Moffatt MF
    100. Nauck M
    101. Novatchkova M
    102. Nutile T
    103. Olafsson I
    104. Onundarson PT
    105. Parracciani D
    106. Penninx BW
    107. Perseu L
    108. Piga A
    109. Pistis G
    110. Pouta A
    111. Puc U
    112. Raitakari O
    113. Ring SM
    114. Robino A
    115. Ruggiero D
    116. Ruokonen A
    117. Saint-Pierre A
    118. Sala C
    119. Salumets A
    120. Sambrook J
    121. Schepers H
    122. Schmidt CO
    123. Silljé HHW
    124. Sladek R
    125. Smit JH
    126. Starr JM
    127. Stephens J
    128. Sulem P
    129. Tanaka T
    130. Thorsteinsdottir U
    131. Tragante V
    132. van Gilst WH
    133. van Pelt LJ
    134. van Veldhuisen DJ
    135. Völker U
    136. Whitfield JB
    137. Willemsen G
    138. Winkelmann BR
    139. Wirnsberger G
    140. Algra A
    141. Cucca F
    142. d’Adamo AP
    143. Danesh J
    144. Deary IJ
    145. Dominiczak AF
    146. Elliott P
    147. Fortina P
    148. Froguel P
    149. Gasparini P
    150. Greinacher A
    151. Hazen SL
    152. Jarvelin M-R
    153. Khaw KT
    154. Lehtimäki T
    155. Maerz W
    156. Martin NG
    157. Metspalu A
    158. Mitchell BD
    159. Montgomery GW
    160. Moore C
    161. Navis G
    162. Pirastu M
    163. Pramstaller PP
    164. Ramirez-Solis R
    165. Schadt E
    166. Scott J
    167. Shuldiner AR
    168. Smith GD
    169. Smith JG
    170. Snieder H
    171. Sorice R
    172. Spector TD
    173. Stefansson K
    174. Stumvoll M
    175. Tang WHW
    176. Toniolo D
    177. Tönjes A
    178. Visscher PM
    179. Vollenweider P
    180. Wareham NJ
    181. Wolffenbuttel BHR
    182. Boomsma DI
    183. Beckmann JS
    184. Dedoussis GV
    185. Deloukas P
    186. Ferreira MA
    187. Sanna S
    188. Uda M
    189. Hicks AA
    190. Penninger JM
    191. Gieger C
    192. Kooner JS
    193. Ouwehand WH
    194. Soranzo N
    195. Chambers JC
    (2012) Seventy-five genetic loci influencing the human red blood cell
    Nature 492:369–375.
    https://doi.org/10.1038/nature11677

Decision letter

  1. Genevieve Konopka
    Reviewing Editor; University of Texas Southwestern Medical Center, United States
  2. Kathryn Song Eng Cheah
    Senior Editor; The University of Hong Kong, Hong Kong
  3. Stefan Barakat
    Reviewer
  4. Joseph D Dougherty
    Reviewer; Washington University in St. Louishington University, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This study identifies functional enhancers in vivo in the postnatal mouse brain using massively parallel reporter assays. The authors also carry out a number of validation assays to support their findings and show how the approach can be generalizable to other questions.

Decision letter after peer review:

Thank you for submitting your article "Parallel functional testing identifies enhancers active in early postnatal mouse brain" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Kathryn Cheah as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Stefan Barakat (Reviewer #1); Joseph D Dougherty (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Please either provide additional experimentation to address these two points or at least discuss why they should be done but you have not provided that data in this manuscript.

1) Additional negative controls.

2) Assessment of CACNA1C variant activity.

Reviewer #1 (Recommendations for the authors):

The authors perform an in vivo screening to identify functional enhancers in early postnatal mouse cortex by means of a massively parallel reporter assay (MPRA). To this end they designed an MPRA library containing 408 human DNA sequences with potential regulatory activity in the brain and choose an AAV vector for delivery of the reporter library. With this approach 41 amplicons with regulatory activity were identified, including the ones located in the disease-associated third intron of CACNA1C. By performing miniMPRA screening of orthologous mouse sequences, the authors confirm the reproducibility of the approach and provide evidence for cross-species conservation of regulatory elements' function. In a series of confirmation experiments the authors validate the activity of a small number of individual amplicons and, also show that the cell type specificity of a small number of investigated enhancers is preserved in this reporter assay. The described methodology might be instrumental for further in vivo studies of regulatory elements.

Strengths:

The study successfully implements a MPRA in an in vivo system and highlights considerations for the experimental design.

Weaknesses:

As also noted by the authors in the discussion, the approach does not yet allow to compare the activity of sequence variants in this assay, which will be crucial for understanding the effects of variants on regulatory function. The number of regions assessed is relatively small, and this might be a bottleneck to apply this approach to study regulatory elements at large scale.

I would like to congratulate Nord and colleagues for an interesting application of STARR-seq in vivo. I think the manuscript is definitively interesting, but sometimes I feel the statements are a bit overrated, and I would suggest that the authors tone down a bit on some of their claims. Also, there could be a few experiments added to further strengthen the quality of the paper.

My specific points, in order of appearance in the manuscript:

– If I understand correctly, the authors start off with 408 regions of interest that they start to clone, but then in line 108 they say that they could upon batch cloning, only verify the presence of 345 of these regions. Why is that? Just failed PCRs for the remaining 63 regions? Is this introducing any bias that should be explained perhaps? Perhaps also good to keep in perspective that 408 regions of interest is rather small, and finding 41 of them as active is not a lot (also given that there were several layers of pre-selection for putative regulatory regions to reach to the 408 regions); Just in general, one might think how many enhancers are required, to put on the label "massively parallel reporter assay"; a question that perhaps also remains unanswered in the manuscript is how likely the author find it that such an approach can be further up-scaled, to investigate the many millions of putative enhancers or GWAS SNPs that are there for human brain?

– In line 94-99, the authors describe the different groups from which their regions of interest were selected. Amongst this, DNaseI hypersensitivity sites were chosen. In line 170, the authors then mention that amongst the region they found active, there was a significant enrichment for loci characterized by DNaseI hypersensitivity. How much of that is explained by the selection bias of selecting those regions in the first place? Same for the putative enhancer group that was used; I assume the reason that these enhancers were putative enhancers in the first place, was their epigenome landscape. So is significant enrichment for H3K4me1/3 then really surprising? The authors should comment on that potential bias, and explain whether their statistics correct for this. Especially given the weight that the authors put on these findings in their conclusive sentence in line 184-186. Some of this information might be in Figure 2C, but is not really clear to me.

– In the section "Confirmation of in vivo P7 cortex MPRA enhancer results" the authors do a couple of validation experiments, to show that their STARR-seq active enhancers show cell type specific enhancer activity. Although important, I feel that the number of enhancers studied in here is very small, so I am not sure that the data will allow to justify broadly extrapolated conclusions from this handful tested sequences. Preferably, the authors should extend this analysis, including more enhancer loci, and perhaps also include experiments in which they test the orientation of the enhancer insertion into the STARR-seq plasmid. E.g., does it affect the activity and cell type specific expression if an enhancer is cloned in the sense or antisense orientation in the STARR-seq cassette? This would also be important to know if one would apply this assay at a larger scale (e.g., testing millions of sequences at ones, in which bulk cloning would probably cause different enhancer orientations in the library). In a way, it doesn't seem so surprising that a reporter assay that has already been validated in many studies (e.g. STARR-seq), seems to behave as a bona fide reporter construct (e.g., if the authors were to clone their validation enhancers in a lacZ reporter, they would probably see the same). So despite being a good control, it is also not such a "special" finding, so I think looking at the sense or antisense orientation might be more impact full?

– In the discussion (line 373-374), the authors say that their assay was not designed to directly compare sequence variants for activity; I don't see why? Would that not be something to include, for example for the CACNA1C locus, to test activity of variants in some of the SNPs in the enhancer regions they found? By including such an experiments, I think the author could increase the impact of their study, and this could be "the icing on the cake", which I feel in the current manuscript is still a bit lacking to some extent.

– Technical question: in line 494-495 the authors describes the STARR-seq library prep. In the original STARR-seq protocol from Stark and colleagues, and also later used by us and others, there is a two-step PCR protocol, with the first set of primers spanning an intron in the STARR-seq plasmid (thereby repressing any potential remaining plasmid DNA contamination) followed by a second round of nested PCR to amplify the amplicons themselves; both PCRs together have <30 cycles. In this paper, the authors use a single set of primers, and amplify both RNA and DNA samples 30x; why is that? Did they modify the original STARR-seq cassette in their adenoviral application, to no longer contain that intron design? What is the reason behind that? Could this be a reason for only a modest number of regions where they find an increased RNA/plasmid ratio? Please explain.

Reviewer #2 (Recommendations for the authors):

Summary: In this paper the authors develop a high throughput assay for in vivo enhancer activity, adapting an MPRA/STARR-seq AAV approach. They use ~900 bp candidate enhancers, selected from the human genome with 4 different strategies for selection, and successfully screen ~300 of these, including some from psychiatric disease loci. They then validate a handful in single construct injections. The major claims are (1 ) that they developed a 'self transcribing regulatory element MPRA strategy' for mouse forebrain. and (2) and validated enhancers that worked in mouse forebrain, including one from an intronic CACNA1C block that is in linkage with disease associated regions.

Major Strengths and Weaknesses: The strengths are that they were able to successfully screen a much larger number of enhancers in a single experiment than prior approaches, and that they included careful first-pass validation of some of these by microscopy in independent animals. The particular approach chosen, PCR cloning of amplicons, also allowed for longer elements to be studied than some of the oligonucleotide based libraries. I also thought the step during validation using co-injection of the test construct with GFP with the positive control in RFP was a very elegant way to control for locus of injection. However, there are some places where the next study building on this one might be I improved; drawbacks to the approach were (1) the lack of a good set of null sequences/negative controls in the MPRA that would have helped define what basal activity looks like. This lack made the analysis steps a bit more challenging, though their relative ranking of most active to least active elements is still likely accurate. (2) As admitted by the authors, the embedding of the enhancers in the UTR can risk conflating posttranscriptional effects of the elements (e.g. Poly A signals, miRNA binding sites) with enhancer effects (though this likely does not explain the activity of their strongest, validated elements). (3) There will be room for improvement in measurement of the enhancer activity by RNA sequencing in the assays, as correlation between biological replicates was not ideal (.54-.842). The noise in the assay can likely be overcome with additional replicates or additional improvements in library delivery or recovery steps. Such improvements might also allow for an increase in the number of elements assessed in parallel.

Likely impact: I think the study demonstrates feasibility for assaying hundreds of assays in parallel for activity in the brain (and similar approaches could be taken for other tissues), and provides a good foundation for future improvements to similar approaches. As it stands, it should be useful for screening more genomic loci for the most active candidate sequences. It also validates a handful of enhancers that do have function, including some in disease associated loci. These validated enhancers are potentially useful tools: understanding the cell types they express would provide some hints as to the cell types important for the disease, and having defined them also provides an opportunity to study the impact of human risk alleles on the function of these specific enhancers.

I have some concerns on the statistical approach I would like to see clarified, but I would say it is likely these are addressable. I also have some suggestions to improve flow and impact.

Concerns:

The MPRA design would have certainly benefited from a larger number of negative controls or blanks to help define which elements had enhancer activity. Something similar to https://www.pnas.org/content/110/29/11952/tab-article-info where they used scrambled sequences to understand what random genomic sequences look like, then defined enhancers based on an activity that was higher than random. That being said, the analysis they conducted does allow them to identify the sequences with the most and least relative activity. So it is probably of use even without this control if it is not feasible to add.

I could use clarification on the statistical approaches though, to be able to fully evaluate them. Use of residuals is not unheard of, but it is a bit unusual and I have a few questions I would like clarified. Specifically:

1) I am confused on the linear model as to why they authors included GC content. If, as per line ~137, no further GC bias arose after cloning, then if the model already accounts for DNA amplicon count, then the remaining effect of GC on RNA might be biological rather than technical. Perhaps high GC content correlates with more enhancer activity or RNA stability? This has certainly been seen in other studies. The authors could rerun the model without the term for GC and see how similar the results are, and see which model better predicts their existing follow up data or some of the genomic enrichment assays.

2) Overall, the authors have 2-4 different statistical approaches to picking the hits from the data they tried. Fortunately, these correspond to each other fairly well. Still, it might be helpful to use some kind of benchmarks (e.g., enrichment for overlap with the epigenomic datasets perhaps?) to determine which ought to be the primary analysis that is presented.

3) Line 539 I could also use a more clear explanation (or reference) to explain how they convert the Z-score of the residual into a p-value.

4) And, when conducting the Wilcoxon rank sum, what are they comparing each enhancer to?

5) In the end, regardless of which statistical method they lead with, they ought to make sure to also correct for multiple testing. It is not clear this was done here.

Suggestions to improve the impact:

The library is actually made of 4 sub-libraries. Are there any differences in activity between the four sub libraries? Is there anything we can learn about the relationship of LD SNPs to function, or the efficicacy of epigenetic marks at predicting function by comparing the 'hit rates' between these 4 libraries?

It is also unclear if they ended up with multiple alleles in the cloning. They cloned from a pool of DNA, so presumably different haplotypes were present. If so, were they powered to look at the allelic effects of in the functional enhancers? Or assess any in follow up?

Finally, it would bolster their second claim a bit if they did some more in depth characterization of the enhancers they validated. Are they expressed only in neurons, or neurons and glia? Are they specific to a subpopulation of neurons (at least excitatory vs. inhibitory?)? Recent comparable AAV/enhancer screens have done more in this regard, and it could be a nice deliverable of this paper if one of these enhancers happens to target a useful subpopulation. Or if it helped us better understand the regulation of some of the psychiatric disease genes. That would improve impact and it seems they have the reagents in hand to do these studies.

Suggestions on writing:

Regarding figure 1B – are the elements with sig negative residuals thought to be repressors? It might be worth discussing. Are they enriched in any particular marks?

It is good the authors discussed some limitations to cloning candidate enhancers into the 3'-UTR position, as they might have some effects on the RNA level via post-transcriptional regulation. Also, sequences containing elements that would have obvious negative consequences (e.g., poly A signals) on the assay might need to be filtered out, especially when looking at repressive sequences.

To my fresh eyes, it sometimes feels experiments are a bit out of order. For example, some of the controls showing the method should work come after the experiment rather than before it. Specifically, it is just a suggestion, but some of the experiments like those establishing that DLX works regardless of whether it is in 3'-UTR of 5' to the promoter, or perhaps the miniMPRA, might make more sense presented before the main MPRA comes up?

I'd like to see a histogram of the count number of the elements in the DNA library. I am just curious about the range between the best and least cloned elements. Presumably the least well cloned elements are also the 25 % that were filtered out? Also, perhaps some scatter plot of DNA read depth vs. coefficient of variation (with color coding of the 25 % filtered out) might make a nice supplement, and provide a benchmark for future improvements to the method.

Reviewer #3 (Recommendations for the authors):

The article 'Parallel functional testing identifies enhancer active in early postnatal mouse brain' uses an adeno-associated virus (AAV) based high throughput approach (massively parallel reporter assay (MPRA)) to test the regulatory capacity of candidate enhancer sequences in early postnatal mouse brain. To ascertain the reproducibility of enhancer activity across MPRA studies, the authors tested four positive enhancers, two negative sequences, and orthologous mouse sequences via miniMPRA. The results these assays suggest that consistent results can be achieved through in vivo MPRA. Finally, the authors dissect regulatory elements within the CACNA1C intron that have been previously associated with disease (schizophrenia), determining that in vivo MPRA could be an efficient tool for assessing neural disease-associated regions. This paper would be of interest to the broad readership of eLife, primarily due to the use of an in vivo AAV-based brain MPRA and characterization of disease-associated regions. However, several point need to be addressed.

– A lot more information is needed about how sequences were chosen for the MPRA. The authors tested 408 sequences from four groups. How many from each group? Was there other filters to get to 408 for each group and if so what were they? Why did they shoot for this number and not more? Are there technical reasons for this? Was there overlap between the four groups and if so what was it? How did the authors choose candidate sequences in the GWAS and SNP groups? For those (GWAS and SNP) how many overlap predictive enhancer marks?

Would also mention the problem of choosing lead SNPs in text, i.e. these may not be causative SNPs, as they are just the SNPs on the genotyping array.

Would add to Figure 1A, some table/Venn diagram or other that shows the numbers of sequences that were tested for each group to make it easier for the reader to understand what was assayed.

Would mention that the fourth group was intended as a 'potential positive control', correct me if I'm wrong here.

No negative controls were tested. This is vital in most MPRAs to compare to in order to assess activity. This needs to be mentioned and discussed in detail.

– SNPs: The sequences were cloned from 'pooled human DNA'. More info on how many individuals these are, ethnicity etc. is needed here, especially as different alleles could have different function. If the applicants have variant data from their sequenced library that would be great to add. This is even more lacking in the individual testing, in particular for the CACNA1C region, where variant might affect activity. Those definitely need to have info on the haplotype actually tested d.

– The R in the correlation between RNA replicates is poor, just a little above 0.5 in some comparisons. It also appears between viral DNA and GC content. The authors need to explain this and also provide potential causes for the reproducibility of individual assays and PCR stochasticity in the library preparation and cDNA.

– Were samples 4-35 included in the subsequent studies or not? If not, what was their performance, considering the influences of the cycles in preparation. How did the authors de-duplicate reads in line 119? Were UMIs used in the experiment and if so, more info on them is needed.

– In the miniMPRA section line 188-200 it is not clear what sequences they chose? How many? What was the criteria for selecting the sequences? A brief explanation of the sequences in the main text will help the reader understand the experimental set up better. Also, more info on stats in the main text comparing between the MPRA is needed.

– Overall only a small N of sequences was tested individually, so I would reduce the tone that the individual assays validated the MPRA and make it more that these individual ones validated.

– More detailed info on the brain expression of the individual constructs is needed.

– For the CACNA1C regions, other than reporting which haplotype was tested, it would significantly improve the manuscript if both the unassociated and associated haplotype were tested to assay whether they lead to alternate function.

https://doi.org/10.7554/eLife.69479.sa1

Author response

Essential revisions:

Please either provide additional experimentation to address these two points or at least discuss why they should be done but you have not provided that data in this manuscript.

1) Additional negative controls.

We apologize that our manuscript and analysis failed to convey that we did actually include sequences that were presumed to be negative, which made up 59 % of the tested sequences. We updated text, analyses, and figures to clarify our library design and inclusion of these presumed negatives, which were human genome regions that did not harbor predictive signatures of potential brain enhancer activity. We now show in the figures the distribution of putative positive and negative amplicons across the four amplicon ascertainment groups that made up our library, and we show that these presumed negatives indeed had a lower rate of MPRA activity (see updates to Figure 2). We also clarify that our miniMPRA included negatives as well and that these also had lower MPRA activity compared to predicted positives (updates to Figure 1). Finally, in the original submission we also validated two amplicons that were identified as MPRA negative, showing that these sequences did not drive GFP in P7 mouse brain, in contrast to positives that did drive GFP (Figure 4 and 5).

Discussing this in more detail, there were two points that we needed to communicate more clearly, and we have now done so in the updated manuscript via revising text and Figure 2. First, the miniMPRA and the putative enhancer (“PutEnh”) group of the full MPRA included 4 negative control amplicons selected based on absence of H3K27ac in mouse brain. Second, and more relevant, our full library design did include a substantial proportion (59 %) that were putative “negative control or blanks” to enable comparison with amplicons that had some evidence for brain enhancer activity. These presumed negatives were included in the GWAS and LD amplicons selected based on their linkage to GWAS-associated SNPs. These amplicon sets contrast to our “FBDHS” putative positive amplicons, selected to be enriched with fetal brain DNAse hypersensitive signal (FBDHS). 64 % FBDHS set amplicons intersect with DHS peaks (those that didn’t had sub-significant signal enrichment), compared to 19 % and 13 % of the GWAS and LD amplicons, respectively. While not used for ascertainment, we also intersected our amplicons with a post-mortem ATAC-seq dataset from human neurons. We found that 25 % FBDHS amplicons intersect with adult neuron ATAC-seq, compared to 11 % of each GWAS and LD amplicon group. Thus, we used the LD and GWAS amplicons with no epigenomic or sequence signatures suggestive of activity as our presumed negative set.

Our MPRA library composition is explained now in Figure 2A , and the updated manuscript text now reads:

“We generated an MPRA library targeting 408 candidate human DNA sequences for testing, each ~900 bp in size, to assess functional enhancer activity in vivo in early postnatal mouse brain. […] Thus our final library included a balance of sequences that are expected to have no enhancer activity (59 %), as well as sequences with evidence for regulatory capacity in the brain (41 %).”

2) Assessment of CACNA1C variant activity.

We appreciate the general interest in testing allele-specific activity and the specific interest for the CACNA1C putative enhancers we identified. To address this issue, we include an all new analysis of SNP alleles in our experiments and we performed new luciferase experiments for one top candidate enhancer identified in the CACNA1C interval.

We note that we did perform allele-specific analysis in our MPRA, but omitted this in our original manuscript. We felt the allele-specific analysis of our experiments were not sufficiently powered to make firm conclusions and we did not want to distract from the more successful general investigation of enhancer activity in our MPRA study. However, we acknowledge that omitting the variant analysis completely would be likely to raise questions as to why it is not in the manuscript considering our design that included SNPs in the cloning. As a proposed middle ground, we describe our attempts to examine allele-specific differences in the results and discussion and show the results in a supplemental figure (Figure 2—figure supplement 4).

In this analysis, we show that our cloning strategy captured common SNP variants in our library and that variant allele frequency was highly correlated in the packaged AAV and in our MPRA DNA preps, though more variable in our RNA. We show how using 900bp amplicons in STARR-seq orientation with a tagmentation-based library preparation results in mostly reads that were not informative for the allele, and thus we had much reduced power for variant comparison. Despite this limitation, we include results comparing allele activity for the subset of six MPRA positive enhancers that harbored SNPs, finding that differences in allele frequencies were not significant. Finally, we performed new experiments for one positive amplicon that harbors disease-relevant SNPs, comparing the activity of the two alleles in luciferase assays and finding no differences in the two human cell lines tested. In our Discussion section, we added recommendations for improving our assay for assessing differences in enhancer activity of sequence variants in vivo.

We added the following to the Results section starting at line 370 of the updated manuscript to indicate this new analysis and our conclusions:

“Lack of strong effects and insufficient power for allele-specific activity analysis

Our MPRA library included a subset of amplicons harboring SNPs that were either lead SNPs from GWAS studies (121 amplicons), or SNPs in LD with lead SNPs within associated intervals at CACNA1C (21 amplicons) and SCN1A (62 amplicons). […] We note that despite strong DNA reference allele frequency correlation in the viral library DNA, limited allele coverage, unbalanced allele frequencies, and reduced correlation of RNA allele frequencies resulted in lack of power to detect moderate to small allelic differences in our experiments (Figure 2—figure supplement 4, Supplementary Table 7).”

We modified our Discussion to reflect this analysis as well, and we further discussed the limitations of our current study design and suggestions for future work (lines 601-614 of the updated manuscript):

“Although our design was not well powered to compare the activity of sequence variants (Figure 2—figure supplement 4), future studies using allele-balanced libraries, libraries with reduced complexity, or libraries with allele-specific barcodes could greatly improve the sensitivity and power of this assay to evaluate the effects of sequence variation on enhancer activity. […] For this reason, in vivo parallel functional assays will be critical in understanding how sequence variation within enhancers contributes to altered gene expression, including for the disease-associated CACNA1C locus.”

We have also added the following to the Methods section (lines 776-781 and 982-999):

“Allele-specific MPRA activity

Allele-specific counts were generated from deduplicated bam files using bam-readcount (https://github.com/genome/bam-readcount) with min-base-quality set to 15, and min-mapping-quality set to 20. Allelic counts were further processed using a custom R script, applying min base quality filter of 25, and min allele frequency filter of 0.01, producing a set of 440 biallelic SNPs.”

“Luciferase assay

We constructed luciferase reporter plasmids by replacing the minP promoter in pGL4.24 (Promega) with the Hsp1a minimal promoter using restriction enzymes HindIII and NcoI to make pGL4.24-HspMinP. […] Firefly luciferase activities were normalized to that of Renilla luciferase, and expression relative to the activity of the empty pGL4.24-HspMinP vector was noted.”

Reviewer #1 (Recommendations for the authors):

My specific points, in order of appearance in the manuscript:

– If I understand correctly, the authors start off with 408 regions of interest that they start to clone, but then in line 108 they say that they could upon batch cloning, only verify the presence of 345 of these regions. Why is that? Just failed PCRs for the remaining 63 regions? Is this introducing any bias that should be explained perhaps?

We attempted to clone the 408 sequences at once into our MPRA backbone using batch cloning. We considered amplicons as successfully cloned if they exceed a threshold of 200 counts in the pre-viral library, and 63 amplicons were excluded from further analysis due to batch cloning failure. This threshold was chosen because it separated the bimodal distribution of counts. Cloning failure for these sequences could have occurred at any of a number of steps, including PCR, Gibson assembly, bacterial transformation, or during clonal expansion in liquid culture. Most of this is likely due to stochastic success in batch cloning.

However, to address the reviewer question, we further investigated possible sources of cloning bias. We performed in silico PCR analysis using an adapted AmplifyDNA function from the DECIPHER R package. We found that two primer pairs did not have in silico products in this analysis, and as predicted, they do not have any reads aligning to their respective regions. Eight primer pairs produced more than one in silico PCR product, with lengths permitting packaging into an AAV vector. Four of these predicted nonspecific amplicons were excluded due to low RNA or DNA abundance. The remaining 4 were detected in our library and mapped to their intended loci (none of these were MPRA active amplicons). We did not find differences in GC content or predicted amplification efficiency between the successfully and unsuccessfully cloned amplicons. Although we did not observe a bias based on any obvious sequence features, we cannot rule out this possibility. However, we note that batch cloning factors do not impact downstream analysis. We provided a supplementary table with in silico PCR results and added this paragraph to the Methods section:

“To consider an amplicon as present in our MPRA library, we set a threshold of at least 200 read counts per amplicon in our library sample, resulting in a reduced dataset from 408 to 345 amplicons. This threshold was established based on the distribution of library counts. No GC or sequence bias was detected following in silico PCR analysis when present and missing amplicons were compared. Two primer pairs, expected to produce PCR products in our original design using the hg19 reference genome had no predicted products in our in-silico approach, as well as in our MPRA library aligned to the GRCh38 genome (Supplementary Table 9). In silico PCR analysis was performed using a custom R script and DECIPHER R package79.”

Perhaps also good to keep in perspective that 408 regions of interest is rather small, and finding 41 of them as active is not a lot (also given that there were several layers of pre-selection for putative regulatory regions to reach to the 408 regions); Just in general, one might think how many enhancers are required, to put on the label "massively parallel reporter assay"; a question that perhaps also remains unanswered in the manuscript is how likely the author find it that such an approach can be further up-scaled, to investigate the many millions of putative enhancers or GWAS SNPs that are there for human brain?

There are a couple points raised here, but the main ones that are not addressed in other reviewer points seems to be: what should the bar be for MPRA status regarding number of sequences assayed and a suggestion to better discuss possibility and limitations for scaling up. First, we do not have an opinion on where the bar should be for MPRA designation, but do feel that our approach is consistent with the concept (i.e. large numbers of candidates tested in parallel) and that use of another term or qualifier would just lead to more confusion. Our “hit rate” of 41/345 (~12 %) is generally in line with other MPRA enhancer studies. We note that we included a high proportion of putative negatives so “pre-selection” included both positive and negative signatures, as described in depth in other reviewer points. Nonetheless, we acknowledge that our study is on the smaller end of MPRAs, and we are open to other descriptors though our opinion is that using the term MPRA will be of greatest utility for literature searches. Second, there are very different considerations for parallel testing (i.e. MPRA) in the context of a heterogeneous mammalian tissue, made further challenging by methods for delivery that yield sparse cell uptake. We did updated our discussion to highlight these limitations.

– In line 94-99, the authors describe the different groups from which their regions of interest were selected. Amongst this, DNaseI hypersensitivity sites were chosen. In line 170, the authors then mention that amongst the region they found active, there was a significant enrichment for loci characterized by DNaseI hypersensitivity. How much of that is explained by the selection bias of selecting those regions in the first place? Same for the putative enhancer group that was used; I assume the reason that these enhancers were putative enhancers in the first place, was their epigenome landscape. So is significant enrichment for H3K4me1/3 then really surprising? The authors should comment on that potential bias, and explain whether their statistics correct for this. Especially given the weight that the authors put on these findings in their conclusive sentence in line 184-186. Some of this information might be in Figure 2C, but is not really clear to me.

We apologize that we failed to convey design considerations for sequences to include. This point is also relevant to essential revision #1, as our amplicon ascertainment groups each had a different set of criteria. To answer this specific point, only one group of our test sequences was selected based on Fetal brain DNAseI hypersensitivity (FBDHS) signal. This group (labeled FBDHS) made up 129/408 (31.6 %). We also included a small set of sequences that were screened in our miniMPRA, which included presumed positive/negatives based on presence/absence of mouse forebrain H3K27ac. The other two ascertained groups were selected based on overlap with lead SNPs (171/408, 41.9 %) or SNPs in two LD blocks (92/408, 22.5 %) from GWAS studies of schizophrenia and epilepsy. For these SNP-based amplicon sets, most do not overlap with epigenomic signatures indicating brain enhancer activity such as fetal brain DHS (81.3 % for GWAS; 87.0 % for LD) or adult neuron ATAC-seq (88.9 % for GWAS; 89.1 % for LD). This design enabled us to compare sequences that had FBDHS (and other relevant epigenomic datasets, e.g. H3K4me1, etc.) with those that did not have these signatures. Hopefully this response and our revised manuscript better explain our study design and why there is no bias in the MPRA versus epigenomic comparisons. We believe that the question of how MPRA-based activity compares to FBDHS and other enhancer signatures is fair and of great interest.

– In the section "Confirmation of in vivo P7 cortex MPRA enhancer results" the authors do a couple of validation experiments, to show that their STARR-seq active enhancers show cell type specific enhancer activity. Although important, I feel that the number of enhancers studied in here is very small, so I am not sure that the data will allow to justify broadly extrapolated conclusions from this handful tested sequences. Preferably, the authors should extend this analysis, including more enhancer loci, and perhaps also include experiments in which they test the orientation of the enhancer insertion into the STARR-seq plasmid. E.g., does it affect the activity and cell type specific expression if an enhancer is cloned in the sense or antisense orientation in the STARR-seq cassette? This would also be important to know if one would apply this assay at a larger scale (e.g., testing millions of sequences at ones, in which bulk cloning would probably cause different enhancer orientations in the library). In a way, it doesn't seem so surprising that a reporter assay that has already been validated in many studies (e.g. STARR-seq), seems to behave as a bona fide reporter construct (e.g., if the authors were to clone their validation enhancers in a lacZ reporter, they would probably see the same). So despite being a good control, it is also not such a "special" finding, so I think looking at the sense or antisense orientation might be more impact full?

We appreciate the general point here, but believe that further validation work examining more loci in different orientations is outside the scope of our study. Questions of STARR-seq performance and assay validity have been looked at by others in depth in cell lines, where it is much more tractable to do these types of experimental permutations. Indeed, there is expected to be some differences in assay response depending on sequence and orientation context. Regardless, published studies have shown the validity of STARR-seq overall and shown general concordance compared with other MPRA designs. Nonetheless, we agree that there remain questions of how MPRA results are impacted by construct design and experimental methods but argue that those questions are outside the focus and scope of our study.

We felt it most critical to address two basic items specifically relevant to in vivo activity in our validation experiments. The first, was whether cell-type specificity in vivo in brain was preserved in the STARR-seq orientation, which we did using a well characterized Dlx5/6 interneuron enhancer (now shown in Figure 1). The second was to show that our MPRA library results reproduce when sequences were tested individually for both positive and negative sequences (now shown in Figures 3 and 4). These findings are not presented as particularly special findings, just as experimental support for rigor and reproducibility. Significant effort went into these validation experiments. Each candidate sequence that is tested individually requires cloning, packaging, injection to neonatal mouse brain, tissue collection and processing, and image analysis. We feel that our study goes further towards validation experiments than many similar studies. We validated similar performance between orientations for one enhancer and we validated three positive and two negative predictions. Adding one or two more validation experiments to look at a small number of additional known or putative enhancers would require significant time and resources and does not seem like it would address this reviewer point that there are likely to be sequences that show more inter-assay differences or assay bias than the ones we tested.

We have updated the manuscript to limit the scope of our conclusions based on our validation experiments, and we have also moved our testing of 5’ vs. 3’ orientation to a new section titled “Developing and Validating AAV-MPRA strategy” where we describe our efforts to validate the appropriateness of the STARR-seq strategy to the in vivo mouse brain context before cloning the library (lines 85-178 in the updated manuscript). We also revised discussion text to frame these issues, describing the need to extend studies to characterization of different enhancer orientations and interactions with different minimal promoters.

The passage in the results referenced by Reviewer #1 in the above comment now reads:

“To validate enhancer activity in the mouse brain at P7, we next cloned individual amplicons for selected positive and negative hits from the MPRA into the same HspMinP-EGFP 3’-UTR oriented reporter and generated scAAV9 for each construct. […] These results demonstrate that our MPRA could accurately reflect enhancer activity of particular candidate sequences in vivo.”

– In the discussion (line 373-374), the authors say that their assay was not designed to directly compare sequence variants for activity; I don't see why? Would that not be something to include, for example for the CACNA1C locus, to test activity of variants in some of the SNPs in the enhancer regions they found? By including such an experiments, I think the author could increase the impact of their study, and this could be "the icing on the cake", which I feel in the current manuscript is still a bit lacking to some extent.

See our response in Essential Revision #2 for details. Briefly, we agree that allele-based enhancer activity differentiation is of high interest and is an exciting potential application of our MPRA methods. We indeed included alleles in our library which potentially could have displayed allele-specific differences in activity, and we explored this when analyzing our MPRA data. We note that our primary goal was to test for enhancer activity overall, with the hope that we would have power to compare alleles as well. We feel that our power for allelic comparison was not sufficient to highlight findings. However, we now include this analysis of allele composition and allele-based differences in enhancer activity and have added luciferase testing of one candidate amplicon comparing reference and variant alleles. These analyses and findings are now summarized in the text and in new Figure 2—figure supplement 4.

– Technical question: in line 494-495 the authors describes the STARR-seq library prep. In the original STARR-seq protocol from Stark and colleagues, and also later used by us and others, there is a two-step PCR protocol, with the first set of primers spanning an intron in the STARR-seq plasmid (thereby repressing any potential remaining plasmid DNA contamination) followed by a second round of nested PCR to amplify the amplicons themselves; both PCRs together have <30 cycles. In this paper, the authors use a single set of primers, and amplify both RNA and DNA samples 30x; why is that? Did they modify the original STARR-seq cassette in their adenoviral application, to no longer contain that intron design? What is the reason behind that? Could this be a reason for only a modest number of regions where they find an increased RNA/plasmid ratio? Please explain.

We did not include the intron in our vector design. Rather than use a nested PCR strategy to eliminate viral DNA contamination from the RNA samples, we treated these samples with DNase during the RNA purification procedure. We elected not to include the 130-230 bp chimeric intron in the vector design so that we could maximize the length of amplicons in our library (~900 bp). Eliminating the two-step PCR also enabled us to use a higher cycle count PCR program to amplify the target viral DNA and RNA sequences because the mouse brain samples included many cells that were not transduced by the library vector and therefore genomic DNA and cellular RNA represented the majority of molecules in our DNA and RNA preparations. We found we needed the higher cycle count to obtain DNA or cDNA amplicons with high enough concentration and purity for sequencing. We do not believe that viral DNA contamination of our RNA preparations explains the modest number of significant MPRA hits; as we have clarified above, the majority of our test library was expected to be negative for enhancer activity.

We have added specific details about how our vector backbone was designed and cloned in the updated manuscript:

“We replaced the CMV-EGFP expression cassette in scAAV-CMV-EGFP71 with a synthesized gBlocks Gene Fragment (IDT) containing the Hsp1a minimal promoter followed by the gene for EGFP with a cloning site inserted into the 3’-UTR. This cloning site consisted of two restriction sites flanked by two 35 bp sequences used for Gibson homology. The resulting plasmid, pscAAV-HspMinP-EGFP, housed the 3’-UTR-oriented enhancer reporter construct between the ITR sequences necessary for packaging scAAV particles (Figure 1—figure supplement 1). Unlike other STARR-seq constructs24,72,73, we did not include an intronic sequence in the reporter ORF, as we preferred to reduce construct length and complexity considering the novel implementation in an scAAV context and 900 bp amplicon sequence.”

Reviewer #2 (Recommendations for the authors):

I have some concerns on the statistical approach I would like to see clarified, but I would say it is likely these are addressable. I also have some suggestions to improve flow and impact.

Concerns:

The MPRA design would have certainly benefited from a larger number of negative controls or blanks to help define which elements had enhancer activity. Something similar to https://www.pnas.org/content/110/29/11952/tab-article-info where they used scrambled sequences to understand what random genomic sequences look like, then defined enhancers based on an activity that was higher than random. That being said, the analysis they conducted does allow them to identify the sequences with the most and least relative activity. So it is probably of use even without this control if it is not feasible to add.

See our Essential Revisions (1) for this point. Briefly, we did indeed include a presumed negative set, but obviously failed in communicating this in our original submission. We added a new figure panel and updated the text to address this.

I could use clarification on the statistical approaches though, to be able to fully evaluate them. Use of residuals is not unheard of, but it is a bit unusual and I have a few questions I would like clarified. Specifically:

1) I am confused on the linear model as to why they authors included GC content. If, as per line ~137, no further GC bias arose after cloning, then if the model already accounts for DNA amplicon count, then the remaining effect of GC on RNA might be biological rather than technical. Perhaps high GC content correlates with more enhancer activity or RNA stability? This has certainly been seen in other studies. The authors could rerun the model without the term for GC and see how similar the results are, and see which model better predicts their existing follow up data or some of the genomic enrichment assays.

We agree that GC content could theoretically impact enhancer activity or RNA stability, which was why we looked at it in the first place. On balance, we thought including the term had more value, as we were more concerned with bias arising from a GC impact that was not associated with enhancer activity (e.g. impact on RNA stability). That said, the inclusion or exclusion of a GC term in the regression model had very little impact on which amplicons were deemed statistically significant using model residuals.

As requested, we include a comparison of linear models constructed with and without the GC content, we found that the addition of the GC content is a significant covariate in the model (P < 1.42e-08), and inclusion increases the adjusted R2 from 0.7012 to 0.7369 and reduces the Bayesian information criteria (BIC) from 626.2976 to 599.1863. The P value of the GC content is 1.416e-08 (F = 34.446), whereas the P value for the log2 DNA proportions is < 2.2e-16 (F = 659.440), indicating that although statistically significant, the GC content term has minor effect on the model. We opted to leave GC in our modeling but include Figure 2–Supplementary Figure 3 panel G and H now, demonstrating a linear model with and without the GC content covariate.

“GC content had a significant impact, but low effect size, on the model as demonstrated by higher P values, 1.416e-08 vs < 2.2e-16, and lower F statistics, 34.446 vs 659.44, when log2 DNA proportions were compared with the GC content. Inclusion of the GC content was justified based on the reduction of Bayesian information criterion (BIC), from 626.2976 to 599.1863, for models without and with GC content, respectively.”

In the figure caption, we explained the justification for including the GC content in the model:

“(G) Linear model summary for estimate of background activity without the GC content covariate. (H) Linear model summary for estimate of background activity including the GC content covariate. Addition of the GC content covariate reduces the Bayesian information criteria (BIC) from 626 to 599 indicating that GC content is a valid covariate in the model.”

2) Overall, the authors have 2-4 different statistical approaches to picking the hits from the data they tried. Fortunately, these correspond to each other fairly well. Still, it might be helpful to use some kind of benchmarks (e.g., enrichment for overlap with the epigenomic datasets perhaps?) to determine which ought to be the primary analysis that is presented.

We apologize for any confusion here. Our intent was to increase statistical rigor via using multiple methods for determining MPRA activity. We note that we think the regression model residuals method is the strongest and most conservative (i.e. fewer positive calls), as this method effectively corrects for input DNA and makes no assumption of equal baseline representation in DNA and RNA libraries (which ratiometric comparison does). Nonethless, we also used the simple ratiometric method, as this method is more widespread in published MPRAs and enables inclusion of the biological replicates.

We consider the in vivo single-candidate validation of our MPRA to be the strongest validation of our approach, and we hoped that using multiple statistical approaches for MPRA activity calculation would make our analysis more transparent to readers coming from different backgrounds or with different familiarity with MPRA methods. As the reviewer notes, the two methods were overall highly concordant. We revised the section of the results to try to better explain our use of the two methods and our explanation that the resulting active amplicons were very similar.

Regarding benchmarking, the residual and ratiometric methods were similar enough and our MPRA library small enough that there wasn’t really anything learned by comparing statistical methods. That said, we did follow the advice to try benchmarking against the epigenomic signatures via including these in our regression model and seeing if the epigenomic evidence improves prediction of MPRA RNA levels. We used the full dataset of 308 amplicons that met the minimum read count thresholds. We then added epigenetic features, one at a time, in addition to the GC content covariate, and tested if the epigenetic covariate tests as significant in the model. The improvement of the model prediction was tested by assessing the reduction of the bayesian information criterion (BIC) upon the addition of an epigenomic mark.

The following fragment in the Results section addresses the predictive power of epigenomic covariates in our MPRA:

“We also examined each epigenetic mark as a co-variate of a general linear model, to see if it improved a regression model predicting cDNA levels considering all amplicons (N = 308). If cDNA levels in our MPRA are reflective of true enhancer activity, we would expect epigenomic signatures associated with enhancer elements to improve prediction of cDNA expression. Indeed, Fetal DNase hypersensitivity (P = 0.0002), H3K4me1 (P = 0.0002), adult neuron ATAC-seq (P = 0.0073), and fetal brain TF footprints (P = 0.04) were found to be significant, while signatures not relevant to enhancer activity in brain were not (Supplementary Table 5 and 6). Fetal brain DNase hypersensitivity, H3K4me1, and human neuron ATAC-seq improved the regression model (reduced BIC), and fetal brain DNase hypersensitivity and H3K4me1 signatures had the strongest predictive power of amplicon activity (0.2 and 0.178 point-biserial correlations, respectively).”

Supplementary Table 5 includes the details of this analysis.

Methods section has been updated as well:

“The predictive power of epigenomic signatures for MPRA activity was tested by adding the dichotomous Boolean variables of epigenomic intersection to a general linear model (GLM) predicting the level of RNA from DNA, and a GC content covariate, one epigenomic intersection at a time using the R stats glm() function. T test P values were extracted from model summary, and models with and without the epigenomic covariates were compared using BIC. In addition to the GLM, we calculated point-biserial correlations between the Z-scaled residuals from our original lm model, and each epigenomic signature intersection using the biserial.cor() function from the ltm R package82.”

3) Line 539 I could also use a more clear explanation (or reference) to explain how they convert the Z-score of the residual into a p-value.

Z-score and resulting P-values were generated using standard approaches where residual values were centered and scaled using mean and standard deviation, and then p-values were calculated using an assumed normal distribution. We have clarified our methodology in the updated manuscript by adding the underlined fragments and a citation:

“The model was applied to the complete dataset, and one-tailed p-values were calculated from the Z-scaled distribution of residuals, using the normal distribution implemented using R stats pnorm()80.”

4) And, when conducting the Wilcoxon rank sum, what are they comparing each enhancer to?

We compared the proportion reads in RNA libraries in each biological replicate to the proportion reads in DNA for each library using the standard rank sum test to compare these distributions. We added multiple testing correction as suggested by the reviewer. We clarified this approach in the updated manuscript. This section (lines 825-828) now reads:

“We also compared our linear model with a Wilcoxon rank sum, a non-parametric test comparing DNA vs RNA proportions, using Benjamini-Hochberg FDR < 0.05 for multiple testing correction.”

5) In the end, regardless of which statistical method they lead with, they ought to make sure to also correct for multiple testing. It is not clear this was done here.

We would like to thank Reviewer #2 for bringing up this concern. We now include multiple testing corrections, using an empirical tail area-based FDR for the residual model method and Benjamini-Hochberg FDR for the ratiometric wilcoxon tests. See results text and figures for inclusion of FDR values. We updated our methods with the following text:

“Empirical tail area-based FDR q values, estimated from Z scores, were calculated using fdrtool R package81, with threshold adjustment for one-tail testing.”

Suggestions to improve the impact:

The library is actually made of 4 sub-libraries. Are there any differences in activity between the four sub libraries?

As expected, the four amplicon groups in our MPRA library (FBDHS, GWAS, LD, PutEnh) differ in average amplicon activity, with the highest values reported for the PutEnh group, followed by the FBDHS group. In contrast, the amplicons ascertained just on presence of relevant SNP had lower activity. The effect of the amplicon group on the activity is statistically significant (One-way ANOVA, P = 0.006), and persists after excluding the PutEnh group (One-way ANOVA, P = 0.011).

We added the following sentence to emphasize the effect of amplicon group:

“Amplicon group was confirmed to be a significant predictor of mean amplicon activity (One-way ANOVA, P = 0.006), with the highest activity reported for the PutEnh and FBDHS groups.”

Is there anything we can learn about the relationship of LD SNPs to function, or the efficicacy of epigenetic marks at predicting function by comparing the 'hit rates' between these 4 libraries?

We respectfully direct Reviewer #2 to our response to Essential Revision #1 and #2, and Figure 2 Supplement 4 explaining how SNPs and allelic analysis were handled. See also above for our analysis on the relationship between epigenetic marks and MPRA activity.

It is also unclear if they ended up with multiple alleles in the cloning. They cloned from a pool of DNA, so presumably different haplotypes were present. If so, were they powered to look at the allelic effects of in the functional enhancers? Or assess any in follow up?

Please see our response to Essential Revisions #2. Briefly, in the updated manuscript we have added this new allelic analysis, discussed the limitations of our current study for determining allele-based enhancer activity differences, and presented suggestions for future library designs that may address these limitations (Figure 2—figure supplement 4, lines 370-386, and 601-614 of the updated manuscript). Note that we agree that this is an important future direction for this work, and we regret that our current study design was not well powered to address it.

Finally, it would bolster their second claim a bit if they did some more in depth characterization of the enhancers they validated. Are they expressed only in neurons, or neurons and glia? Are they specific to a subpopulation of neurons (at least excitatory vs. inhibitory?)? Recent comparable AAV/enhancer screens have done more in this regard, and it could be a nice deliverable of this paper if one of these enhancers happens to target a useful subpopulation. Or if it helped us better understand the regulation of some of the psychiatric disease genes. That would improve impact and it seems they have the reagents in hand to do these studies.

We agree that cell type specific characterization of enhancer activity is of significant interest, particularly to find new drivers or resolve disease-variant specificity. However, this work requires substantial experiments and further improvements in MPRA deployment in vivo in brain. As such, while of great interest, these types of studies are outside the scope of this manuscript, though we are working on these questions both at the library and single enhancer level. We note that there have been some beautiful studies recently published in these areas that we reference, and these publications represent years of work and still face issues with library complexity, relationship between epigenomic prediction to enhancer activity, and cell-type and brain region specificity.

Suggestions on writing:

Regarding figure 1B – are the elements with sig negative residuals thought to be repressors? It might be worth discussing. Are they enriched in any particular marks?

It is possible that these elements are repressors, but we hesitate to draw that conclusion because of our particular study design, which inserts a relatively long sequence into the 3’-UTR and could conceivably lead to higher rates of mRNA degradation. We have expanded on this in our Discussion, which now reads:

“We detected a number of amplicons that had reduced RNA compared to expected background levels, which may represent sequences with silencer or repressor activity such as those that have been reported in recent MPRA screens68. However, because of both the insertion of these amplicons into the 3’-UTR and the length of the amplicons, the modified 3’-UTR may have caused the transcripts to be subject to RNA degradation. Thus, we are hesitant to draw conclusions about these amplicons with lower than expected activity without further characterization. Nevertheless, while artifacts from inserting sequences into the 3’-UTR may impact the STARR-seq design and our results, we show that orthologous mouse sequences exhibited correlated presence and absence of activity and similarly that amplicons with absence or presence of MPRA activity across replicates consistently reproduced in independent single deliveries to the brain.”

It is good the authors discussed some limitations to cloning candidate enhancers into the 3'-UTR position, as they might have some effects on the RNA level via post-transcriptional regulation. Also, sequences containing elements that would have obvious negative consequences (e.g., poly A signals) on the assay might need to be filtered out, especially when looking at repressive sequences.

We followed a few lines of analysis of sequence motifs in our amplicons, but we feel that we are vastly underpowered to look at motif analysis within the 900 bp amplicons considering the small number of high (e.g. presumed enhancers) and low (e.g. potential repressors) activity amplicons we found, and so do not report in the manuscript. We describe here to address the reviewer question.

First, we divided our data set into low, medium, and highly active amplicons and performed motif enrichment analysis using HOMER. However, no motifs were singularly associated with any activity group or were present across most amplicons in the set.

We also looked for the presence of the canonical polyA signal AATAAA in our amplicon sequences. 167/308 of our amplicons include this sequence, including 27/41 of the significantly active amplicons. We did not find a significant association between the presence of a polyA signal and the significance of amplicon activity (See distribution of PolyA-containing amplicons in Author response image 1, p-value = 0.1302, Fisher’s exact test, two-sided), and it was not a significant covariate when included in the linear model either together with the GC content (p = 0.76475), or without the GC content covariate (p = 0.0608).

Author response image 1

To my fresh eyes, it sometimes feels experiments are a bit out of order. For example, some of the controls showing the method should work come after the experiment rather than before it. Specifically, it is just a suggestion, but some of the experiments like those establishing that DLX works regardless of whether it is in 3'-UTR of 5' to the promoter, or perhaps the miniMPRA, might make more sense presented before the main MPRA comes up?

We wish to thank Reviewer #2 for this suggestion. We had struggled some with how to organize our manuscript, and so opted to reorganize following the reviewer’s suggestion. We have rewritten the manuscript, presenting the mouse miniMPRA and the Dlx 3’ and 5’ orientation validation experiments as Figure 1 before moving to the full MPRA for the rest of the manuscript. In addition to more logically flowing from proof-of-principle to full experiment, this provided us more space to discuss our vector and library design. The reorganization resulted in a change in Figure numbering, and many figure panels have been moved or altered to fit this narrative flow. The first section of our Results now reads (lines 85-178):

“Developing and Validating AAV-MPRA strategy

We used a modified STARR-seq24 MPRA orientation, in which the candidates of interest were cloned in the 3’ untranslated region (3’-UTR) of a reporter gene (here EGFP) driven by the Hsp1a minimal promoter (HspMinP; formerly referred to as Hsp68 minP26). […] Based on this proof-of-principle experiment, we moved forward with scaling up our scAAV STARR-seq strategy to testing hundreds of sequences in mouse brain.”

I'd like to see a histogram of the count number of the elements in the DNA library. I am just curious about the range between the best and least cloned elements. Presumably the least well cloned elements are also the 25 % that were filtered out? Also, perhaps some scatter plot of DNA read depth vs. coefficient of variation (with color coding of the 25 % filtered out) might make a nice supplement, and provide a benchmark for future improvements to the method.

As the reviewer expects, being poorly represented in the pre-viral library (indicating a failure in cloning) corresponded with poor representation in the DNA read counts, justifying the exclusion of these amplicons from further analysis. Please see the . (Author response image 2).

Author response image 2

Reviewer #3 (Recommendations for the authors):

The article 'Parallel functional testing identifies enhancer active in early postnatal mouse brain' uses an adeno-associated virus (AAV) based high throughput approach (massively parallel reporter assay (MPRA)) to test the regulatory capacity of candidate enhancer sequences in early postnatal mouse brain. To ascertain the reproducibility of enhancer activity across MPRA studies, the authors tested four positive enhancers, two negative sequences, and orthologous mouse sequences via miniMPRA. The results these assays suggest that consistent results can be achieved through in vivo MPRA. Finally, the authors dissect regulatory elements within the CACNA1C intron that have been previously associated with disease (schizophrenia), determining that in vivo MPRA could be an efficient tool for assessing neural disease-associated regions. This paper would be of interest to the broad readership of eLife, primarily due to the use of an in vivo AAV-based brain MPRA and characterization of disease-associated regions. However, several point need to be addressed.

Major comments:

– A lot more information is needed about how sequences were chosen for the MPRA. The authors tested 408 sequences from four groups. How many from each group? Was there other filters to get to 408 for each group and if so what were they? Why did they shoot for this number and not more? Are there technical reasons for this? Was there overlap between the four groups and if so what was it? How did the authors choose candidate sequences in the GWAS and SNP groups? For those (GWAS and SNP) how many overlap predictive enhancer marks?

Would also mention the problem of choosing lead SNPs in text, i.e. these may not be causative SNPs, as they are just the SNPs on the genotyping array.

Would add to Figure 1A, some table/Venn diagram or other that shows the numbers of sequences that were tested for each group to make it easier for the reader to understand what was assayed.

Would mention that the fourth group was intended as a 'potential positive control', correct me if I'm wrong here.

No negative controls were tested. This is vital in most MPRAs to compare to in order to assess activity. This needs to be mentioned and discussed in detail.

We did not adequately explain our library design in our original manuscript, as is demonstrated by all three reviewers asking for clarification about the different groups of amplicons and our negative controls. We have addressed these concerns in our response to Essential Revision #1. Briefly, we have updated the manuscript with more detail on how amplicons were chosen and that many amplicons in the GWAS and LD groups were expected to be negative for enhancer activity because disease-associated SNPs may not be causal. We thank Reviewer #3 for the suggestion to add a figure panel showing this, and we have included a new figure panel (Figure 2A) which shows the relative proportions of the amplicon groups and the proportion of each group which overlaps signatures of open chromatin suggestive of potential enhancer activity in the brain. We have also clarified that the “putative enhancer” group (PutEnh) contains both positive and negative controls.

– SNPs: The sequences were cloned from 'pooled human DNA'. More info on how many individuals these are, ethnicity etc. is needed here, especially as different alleles could have different function. If the applicants have variant data from their sequenced library that would be great to add. This is even more lacking in the individual testing, in particular for the CACNA1C region, where variant might affect activity. Those definitely need to have info on the haplotype actually tested d.

We have clarified the specific pools of DNA we used in the updated manuscript. For LD amplicons, we have further clarified the alleles of each GWAS SNP that we tested in our single-candidate validation experiments. This section of our methods now reads:

“Clones were confirmed by PCR and Sanger sequencing. Amplicon #2 included the following alleles: A at rs4765904, A at rs1108221, and C at rs1108222. Amplicon #3 included the following alleles: G at rs1108075 and G at rs11062166. Amplicon #6 included the following alleles: C at rs2159100 and T at rs12315711.”

– The R in the correlation between RNA replicates is poor, just a little above 0.5 in some comparisons. It also appears between viral DNA and GC content. The authors need to explain this and also provide potential causes for the reproducibility of individual assays and PCR stochasticity in the library preparation and cDNA.

We have adjusted the text in our Results section to highlight the replicate-specific dropout of amplicons that limits the correlations between cDNA samples, and we have added more information on how GC content influences our model-based activity signal. We have also further addressed reproducibility of our assay in the Discussion section (lines 526-542 in the updated manuscript).

The passage in the Discussion now reads:

“Based on MPRA and validation experiments, the main sources of variation across MPRA replicates were AAV transduction rate and injection site (Figure 1—figure supplement 4, Figure 3—figure supplement 2, Figure 4—figure supplement 3), highlighting the need for delivery controls to ensure reproducibility when applying MPRAs in vivo via viral delivery. […] While not an obvious issue here, effect of PCR-based clonal amplification bias in other MPRAs has been shown to be reduced with the addition of barcodes and unique molecular identifiers to the reporter construct60.”

– Were samples 4-35 included in the subsequent studies or not? If not, what was their performance, considering the influences of the cycles in preparation.

Sample 4-35 was not included in the calculation of mean RNA proportions, which is stated in the methods section:

“For downstream analysis, mean amplicon proportions for genomic DNA (“DNA”) and cDNA (“RNA”) were calculated across all samples excluding the technical replicate Sample 4-35.”

And

“Mean ratiometric activity and standard deviation was then calculated for each amplicon across all biological replicates, excluding the technical replicate Sample 4-35.”

Sample 4-35 demonstrated the highest correlation with its technical replicate subjected to 30 PCR cycles (Figure 2 —figure supplement 1).

How did the authors de-duplicate reads in line 119?

The process of read deduplication of reads is explained in the Methods section:

“Aligned SAM files were processed using Picard tools (v.2.23.3)53: aligned reads were sorted by coordinates and converted to BAM files using SortSam, optical duplicates were flagged by MarkDuplicates, and BAM file indexing was conducted using BuildBamIndex. The number of de-duplicated reads within in silico PCR amplicon coordinates was counted using samtools view -F 0x400 -c (v1.7)78 and reported as the raw amplicon count.”

The -F 0x400 in samtools view -F 0x400 -c instructs samtools view counts to ignore any reads marked as duplicates by Picard tools MarkDuplicates.

Were UMIs used in the experiment and if so, more info on them is needed.

Our experimental strategy was not designed for UMI inclusion, as we sequenced the variable amplicon using a tagmentation approach.

– In the miniMPRA section line 188-200 it is not clear what sequences they chose? How many? What was the criteria for selecting the sequences? A brief explanation of the sequences in the main text will help the reader understand the experimental set up better. Also, more info on stats in the main text comparing between the MPRA is needed.

We performed the miniMPRA as a proof-of-concept in advance of our full MPRA experiment. When we initially wrote the manuscript, we wanted to focus on the full MPRA. This led us to neglect to include important details about the miniMPRA design and rationale. We have restructured the updated manuscript, which now fully describes this experiment (see lines 162-179 in the updated manuscript). For more details on this restructuring, see our response to Reviewer #2 above.

– Overall only a small N of sequences was tested individually, so I would reduce the tone that the individual assays validated the MPRA and make it more that these individual ones validated.

We have adjusted our language in the Results and Discussion sections to limit the scope of our claims.

The Discussion section now contains the following text:

“These results suggest that the AAV MPRA implementation was effective at screening putative regulatory sequences for in vivo activity in the brain, with reporter expression concordant between MPRA results and single candidate tests of EGFP expression in P7 mouse forebrain for five individually validated amplicons, and that these results could be extended to study activity in later development.”

– More detailed info on the brain expression of the individual constructs is needed.

We have added some more detailed observations on the brain expression of the individual reporter constructs. This statement has been added to the Results section (lines 447-450):

“EGFP expression driven by this amplicon was seen wherever the brain had been exposed to the virus (based on mRuby3 expression), with a trend toward greater density of EGFP+ cells observed in the lower-middle layers of the cortex (lower half of Layer IV and Layer V).”

– For the CACNA1C regions, other than reporting which haplotype was tested, it would significantly improve the manuscript if both the unassociated and associated haplotype were tested to assay whether they lead to alternate function.

We agree with Reviewer #3 that this is an exciting application of the MPRA method. See our detailed response in Essential Revision #2. Briefly, when we initially designed our study, we hoped that cloning from a pooled sample of DNA representing human population variation would enable us to analyze allelic differences in amplicon activity. However, our data was not well powered for this application, largely because the number of sequencing reads that were allele-informative were a small subset of the data and the minor allele frequencies for SNPs in our library were quite low. That said, we did do this analysis and describe our results. We also performed allele-specific luciferase for one of the CACNA1C amplicons, finding no difference between the two alleles. We have also included a new supplementary figure (Figure 2—figure supplement 4) which details our analysis of allele-specific activity of amplicons in our MPRA.

We have expanded our discussion of these limitations, along with suggestions for future MPRA experiments specifically designed for the study of alleles, in the updated manuscript.

https://doi.org/10.7554/eLife.69479.sa2

Article and author information

Author details

  1. Jason T Lambert

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Validation, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Linda Su-Feher
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0264-9482
  2. Linda Su-Feher

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Visualization, Writing – original draft, Writing – review and editing
    Contributed equally with
    Jason T Lambert
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0660-7925
  3. Karol Cichewicz

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5926-3663
  4. Tracy L Warren

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Visualization, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5125-0868
  5. Iva Zdilar

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Investigation, Methodology
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0080-3132
  6. Yurong Wang

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Data curation, Investigation, Methodology, Software, Writing – review and editing
    Competing interests
    No competing interests declared
  7. Kenneth J Lim

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Software
    Competing interests
    No competing interests declared
  8. Jessica L Haigh

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9518-4003
  9. Sarah J Morse

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Investigation, Methodology, Validation
    Competing interests
    No competing interests declared
  10. Cesar P Canales

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Methodology, Validation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2505-8367
  11. Tyler W Stradleigh

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Investigation, Methodology, Validation
    Competing interests
    No competing interests declared
  12. Erika Castillo Palacios

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
  13. Viktoria Haghani

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3700-4027
  14. Spencer D Moss

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
  15. Hannah Parolini

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
  16. Diana Quintero

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
  17. Diwash Shrestha

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Formal analysis, Investigation, Validation
    Competing interests
    No competing interests declared
  18. Daniel Vogt

    Department of Pediatrics and Human Development, Grand Rapids Research Center, Michigan State University, Grand Rapids, United States
    Contribution
    Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  19. Leah C Byrne

    1. Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, United States
    2. Departments of Ophthalmology and Neurobiology, University of Pittsburgh, Pittsburgh, United States
    Contribution
    Investigation, Methodology, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  20. Alex S Nord

    1. Department of Psychiatry and Behavioral Sciences, University of California, Davis, Davis, United States
    2. Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    Contribution
    Conceptualization, Data curation, Funding acquisition, Project administration, Resources, Supervision, Writing – original draft, Writing – review and editing
    For correspondence
    asnord@ucdavis.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4259-7514

Funding

National Institutes of Health (R35GM119831)

  • Alex S Nord

National Institutes of Health (T32-GM008799)

  • Linda Su-Feher

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Sequencing was performed at the UC San Francisco and UC Davis DNA cores. We thank John Flannery for advice and providing space in which we performed preliminary experiments. We also thank the lab of Lin Tian at UC Davis for generously gifting us AAV helper and rep/cap plasmids. This work was supported by NIH/NIGMS R35GM119831. L.S.-F. was supported by the UC Davis Floyd and Mary Schwall Fellowship in Medical Research, the UC Davis Emmy Werner and Stanley Jacobsen Fellowship, and by grant number T32-GM008799 from NIGMS-NIH.

Ethics

All procedures were performed in accordance with the ARVO statement for the Use of Animals in Ophthalmic and Vision Research and were approved by the University of California Animal Care and Use Committee (AUP #R200-0913BC). Surgery was performed under anesthesia, and all efforts were made to minimize suffering.

Senior Editor

  1. Kathryn Song Eng Cheah, The University of Hong Kong, Hong Kong

Reviewing Editor

  1. Genevieve Konopka, University of Texas Southwestern Medical Center, United States

Reviewers

  1. Stefan Barakat
  2. Joseph D Dougherty, Washington University in St. Louishington University, United States

Publication history

  1. Preprint posted: January 15, 2021 (view preprint)
  2. Received: April 19, 2021
  3. Accepted: October 2, 2021
  4. Accepted Manuscript published: October 4, 2021 (version 1)
  5. Version of Record published: November 9, 2021 (version 2)

Copyright

© 2021, Lambert et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,964
    Page views
  • 260
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jason T Lambert
  2. Linda Su-Feher
  3. Karol Cichewicz
  4. Tracy L Warren
  5. Iva Zdilar
  6. Yurong Wang
  7. Kenneth J Lim
  8. Jessica L Haigh
  9. Sarah J Morse
  10. Cesar P Canales
  11. Tyler W Stradleigh
  12. Erika Castillo Palacios
  13. Viktoria Haghani
  14. Spencer D Moss
  15. Hannah Parolini
  16. Diana Quintero
  17. Diwash Shrestha
  18. Daniel Vogt
  19. Leah C Byrne
  20. Alex S Nord
(2021)
Parallel functional testing identifies enhancers active in early postnatal mouse brain
eLife 10:e69479.
https://doi.org/10.7554/eLife.69479
  1. Further reading

Further reading

    1. Genetics and Genomics
    2. Plant Biology
    Simon Snoeck, Bradley W Abramson ... Adam D Steinbrenner
    Research Article Updated

    As a first step in innate immunity, pattern recognition receptors (PRRs) recognize the distinct pathogen and herbivore-associated molecular patterns and mediate activation of immune responses, but specific steps in the evolution of new PRR sensing functions are not well understood. We employed comparative genomic and functional analyses to define evolutionary events leading to the sensing of the herbivore-associated peptide inceptin (In11) by the PRR inceptin receptor (INR) in legume plant species. Existing and de novo genome assemblies revealed that the presence of a functional INR gene corresponded with ability to respond to In11 across ~53 million years (my) of evolution. In11 recognition is unique to the clade of Phaseoloid legumes, and only a single clade of INR homologs from Phaseoloids was functional in a heterologous model. The syntenic loci of several non-Phaseoloid outgroup species nonetheless contain non-functional INR-like homologs, suggesting that an ancestral gene insertion event and diversification preceded the evolution of a specific INR receptor function ~28 my ago. Chimeric and ancestrally reconstructed receptors indicated that 16 amino acid differences in the C1 leucine-rich repeat domain and C2 intervening motif mediate gain of In11 recognition. Thus, high PRR diversity was likely followed by a small number of mutations to expand innate immune recognition to a novel peptide elicitor. Analysis of INR evolution provides a model for functional diversification of other germline-encoded PRRs.

    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Evgeniya N Andreyeva, Alexander V Emelyanov ... Dmitry V Fyodorov
    Research Article

    Asynchronous replication of chromosome domains during S phase is essential for eukaryotic genome function, but the mechanisms establishing which domains replicate early versus late in different cell types remain incompletely understood. Intercalary heterochromatin domains replicate very late in both diploid chromosomes of dividing cells and in endoreplicating polytene chromosomes where they are also underrelicated. Drosophila SNF2-related factor SUUR imparts locus-specific underreplication of polytene chromosomes. SUUR negatively regulates DNA replication fork progression; however, its mechanism of action remains obscure. Here we developed a novel method termed MS-Enabled Rapid protein Complex Identification (MERCI) to isolate a stable stoichiometric native complex SUMM4 that comprises SUUR and a chromatin boundary protein Mod(Mdg4)-67.2. Mod(Mdg4) stimulates SUUR ATPase activity and is required for a normal spatiotemporal distribution of SUUR in vivo. SUUR and Mod(Mdg4)-67.2 together mediate the activities of gypsy insulator that prevent certain enhancer-promoter interactions and establish euchromatin-heterochromatin barriers in the genome. Furthermore, SuUR or mod(mdg4) mutations reverse underreplication of intercalary heterochromatin. Thus, SUMM4 can impart late replication of intercalary heterochromatin by attenuating the progression of replication forks through euchromatin/heterochromatin boundaries. Our findings implicate a SNF2 family ATP-dependent motor protein SUUR in the insulator function, reveal that DNA replication can be delayed by a chromatin barrier and uncover a critical role for architectural proteins in replication control. They suggest a mechanism for the establishment of late replication that does not depend on an asynchronous firing of late replication origins.