Massively parallel reporter assay for mapping gene-specific regulatory regions at single nucleotide resolution

  1. Department of Genetics, Harvard Medical School, Boston, United States
  2. Howard Hughes Medical Institute, Harvard Medical School, Boston, United States
  3. Ocular Genomics Institute, Mass Eye and Ear and Department of Ophthalmology, Harvard Medical School, Boston, United States
  4. Department of Ophthalmology, Harvard Medical School, Boston, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    H Efsun Arda
    National Cancer Institute, Bethesda, United States of America
  • Senior Editor
    Lori Sussel
    University of Colorado Anschutz Medical Campus, Aurora, United States of America

Reviewer #1 (Public review):

MPRAs are a high-throughput and powerful tool for assaying the regulatory potential of genomic sequences. However, linking MPRA-nominated regulatory sequences to their endogenous target genes and identifying the more specific functional regions within these sequences can be challenging. MPRAs that tile a genomic region, and saturation mutagenesis-based MPRAs, can help to address these challenges. In this work, Tulloch et al. describe a streamlined MPRA system for the identification and investigation of the regulatory elements surrounding a gene of interest with high resolution. The use of BACs covering a locus of interest to generate MPRA libraries allows for an unbiased and high-coverage assessment of a particular region. Follow-up degenerate MPRAs, where each nucleotide in the nominated sequences is systematically mutated, can then point to key motifs driving their regulatory activity. The authors present this MPRA platform as straightforward, easily customizable, and less time- and resource-intensive than traditional MPRA designs. They demonstrate the utility of their design in the context of the developing mouse retina, where they first use the LS-MPRA to identify active regulatory elements for select retinal genes, followed by d-MPRA, which allowed them to dissect the functional regions within those elements and nominate important regulatory motifs. These assays were able to recapitulate some previously known cis-regulatory modules (CRMs), as well as identify some new potential regulatory regions. Follow-up experiments assessing co-localization of the gene of interest with the CRM-linked GFP reporter in the target cells, and CUT&RUN assays to confirm transcription factor binding to nominated motifs, provided support linking these CRMs to the genes of interest. Overall, this method appears flexible and could be an easy-to-implement tool for other investigators aiming to study their locus of interest with high resolution.

Strengths:

(1) The method of fragmenting BACs allows for high, overlapping coverage of the region of interest.

(2) The d-MPRA method was an efficient way to identify key functional transcription factor motifs and nominate specific transcription factor-driven regulatory pathways that could be studied further.

(3) Additional assays like co-expression analyses using the endogenous gene promoter, and use of the Notch inhibitor in the case of Olig2, helped correlate the activity of the CRMs to the expression of the gene of interest, and distinguish false positives from the initial MPRA.

(4) The use of these assays across different time points, tissues, and even species demonstrated that they can be used across many contexts to identify both common and divergent regulatory mechanisms for the same gene.

Weaknesses:

The LS-MPRA assay most strongly identified promoters, which are not usually novel regulatory elements you would try to discover, and the signal-to-noise ratio for more TSS-distal, non-promoter regulatory elements was usually high, making it difficult to discriminate lower activity CRMs, like enhancers, from the background. For example, NR2 and NR3 in Figure 3 have very minimal activity peaks (NR3 seems non-existent). The ex vivo data in Figure 2 are similarly noisy. Is there a particular metric or calculation that was or could be used to quantitatively or statistically call a peak above the background? The authors mention in the discussion some adjustments that could reduce the noise, such as increased sequencing depth, which I think is needed to make these initial LS-MPRA results and the benchmarking of this assay more convincing and impactful.

Reviewer #2 (Public review):

Summary:

In this study, Tulloch et al. developed two modified massively parallel reporter assays (MPRAs) and applied them to identify cis-regulatory modules (CRMs) - genomic regions that activate gene expression, controlling retinal gene expression. These CRMs usually function at specific developmental stages and in distinct cell types to orchestrate retinal development. Studying them provides insights into how retinal progenitor cells give rise to various retinal cell types.

The first assay, named locus-specific MPRA (LS-MPRA), tests all genomic regions within 150-300 kb of the gene of interest, rather than relying on previously predicted candidate regulatory elements. This approach reduces potential bias introduced during candidate selection, lowers the cost of synthesizing a library of candidate sequences, and simplifies library preparation. The LS-MPRA libraries were electroporated into mouse retinas in vivo or ex vivo. To benchmark the method, the authors first applied LS-MPRA near stably expressed retinal genes (e.g., Rho, Cabp5, Grm6, and Vsx2), and successfully identified both known and novel CRMs. They then used LS-MPRA to identify CRMs in embryonic mouse retinas, near Olig2 and Ngn2, genes expressed in subsets of retinal progenitor cells. Similar experiments were conducted in chick retinas and postnatal mouse retinas, revealing some CRMs with conserved activity across species and developmental stages.

Although the study identified CRMs with robust reporter activity in Olig2+ or Ngn2+ cells, the data do not provide sufficient evidence to support the claims that these CRMs regulate Olig2 or Ngn2, rather than other nearby genes, in a cell-type-specific manner. For example, the authors propose that three regions (NR1/2/3) regulate Olig2 specifically in retinal progenitor cells based on: (1) the three regions are close to Olig2, (2) increased Olig2 expression and NR1/2/3 activity upon Notch inhibition, and (3) reporter activity observed in Olig2+ cells (though also present in many Olig2- cells). While these are promising findings, they do not directly support the claims.

The second assay, called degenerate MPRA (d-MPRA), introduces random point mutations into CRMs via error-prone PCR to assess the impact of sequence variations on regulatory activity. This approach was used on NR1/2/3 to identify mutations that alter CRM activity, potentially by influencing transcription factor binding. The authors inferred candidate transcription factors, such as Mybl1 and Otx2, through motif analysis, co-expression with Olig2 (based on single-cell RNA-seq), and CUR&RUN profiling. While some transcription factors identified in this way overlapped with the d-MPRA results, others did not. This raises questions about how well d-MPRA complements other methods for identifying transcriptional regulators.

Strengths:

(1) The study introduces two technically robust MPRA protocols that offer advantages over standard methods, such as avoiding reliance on predefined candidate regions, reducing cost and labor, and minimizing selection bias.

(2) The identified regulatory elements and transcription factors contribute to our understanding of gene regulation in retinal development and may have translational potential for cell-type-specific gene delivery into developing retinas.

Weaknesses:

(1) The claims for gene-specific and cell type-specific CRMs would benefit from further validation using complementary approaches, such as CRISPR interference or Prime editing.

Reviewer #3 (Public review):

Summary:

Use of reporter assays to understand the regulatory mechanisms controlling gene expression moves beyond simple correlations of cis-regulatory sequence accessibility, evolutionary sequence conservation, and epigenetic status with gene expression, instead quantifying regulatory sequence activity for individual elements. Tulloch et al., provide a systematic characterization of two new reporter assay techniques (LS-MPRA and d-MPRA) to comprehensively identify cis-regulatory sequences contained within genomic loci of interest during retinal development. The authors then apply LS-MPRA and d-MPRA to identify putative cis-regulatory sequences controlling Olig2 and Ngn2 expression, including potential regulatory motifs that known retinal transcription factors may bind. Transcription factor binding to regulatory sequences is then assessed via CUT&RUN. The broader utility of the techniques is then highlighted by performing the assays across development, across species, and across tissues.

Strengths:

(1) The authors validate the reporter assays on retinal loci for which the regulatory sequences are known (Rho, Vsx2, Grm6, Cabp5) mostly confirming known regulatory sequence activity but highlighting either limitations of the current technology or discrepancies of previous reporter assays and known biology. The techniques are then applied to loci of interest (Olig2 and Ngn2) to better understand the regulatory sequences driving expression of these transcription factors across retinal development within subsets of retinal progenitor cells, identifying novel regulatory sequences through comprehensive profiling of the region.

(2) LS-MPRA provides broad coverage of loci of interest.

(3) d-MPRA identifies sequence features that are important for cis-regulatory sequence activity.

(4) The authors take into account transcript and protein stability when determining the correlation of putative enhancer sequence activity with target gene expression.

Weaknesses:

(1) In its current form, the many important controls that are standard for other MPRA experiments are not shown or not performed, limiting the interpretations of the utility of the techniques. This includes limited controls for basal-promoter activity, limited information about sequence saturation and reproducibility of individual fragments across different barcode sequences, limitations in cloning and assay delivery, and sequencing requirements. Additional quantitative metrics, including locus coverage and number of barcodes/fragments, would be beneficial throughout the manuscript.

(2) There are no statistical metrics for calling a region/sequence 'active'. This is especially important given that NR3 for Olig2 seems to have a small 'peak' and has non-significant activity in Figure 4.

(3) The authors present correlational data for identified cis-regulatory sequences with target gene expression. Additionally, the significance of transcription factor binding to the putative regulatory sequences is not currently tested, only correlated based on previous single-cell RNA-sequencing data. While putative regulatory sequences with potential mechanisms of regulation are identified/proposed, the lack of validation (and discrepancies with previous literature) makes it hard to decipher the utility of the techniques.

(4) While the interpretations that Olig2 mRNA/protein expression is dynamically regulated improved the proportions of cells that co-expressed CRM-regulated GFP and Olig2, alternate explanations (some noted) are just as likely. First, the electroporation isn't specific to Olig2+ progenitors. Also, the tested, short CRM fragments may have activating signals outside of Olig2 neurogenic cells because chromatin conformation, histone modifications, and DNA methylation are not present on plasmids to precisely control plasmid activity. Alternatively, repressive elements that control Olig2 expression are not contained in the reporter vectors.

(5) It is unclear as to why the d-MPRA uses a different barcoding strategy, placing a second copy of the cis-regulatory sequence in the 3' UTR. As acknowledged by the author, this will change the transcript stability by changing the 3' UTR sequence. Because of this, comparisons of sequence activity between the LS-MPRA and d-MPRA should not be performed as the experiments are not equivalent.

(6) Furthermore, details of the mutational burden in d-MPRA experiments are not provided, limiting the interpretations of these results.

(7) Many figures are IGV screenshots that suffer from low resolution. Many figures could be consolidated.

Author response:

We thanks the Reviewers for their thorough reviews and helpful suggestions. We will provide additional quantification as requested for several aspects of the study.

The methods that we developed were meant to provide candidates for regulatory elements for a gene of interest. These candidates could be used to further understand the regulation of a gene, a complex and difficult task, especially for dynamically regulated genes in the context of development. These candidates could also, or instead, be used to drive gene expression specifically in a target cell of interest for applications such as gene therapy or perturbations that need this type of specificity. In the first case, to use the candidates to understand the regulation of a gene, one would need to validate the candidates using the types of methods typically employed for this purpose, most rigorously in the in vivo genomic context. We did not pursue this level of validation as it would encompass a great deal of work outside the scope of the current study. However, by initially testing loci and CRMs which have been studied by several groups (Rho, Grm6, Vsx2, and Cabp5), and at least in the cases of Rho and Vsx2, shown to be relevant in the genomic context in vivo, we provide evidence that the LS-MPRA can identify relevant CRMs. These data show that the method is worth using for loci of interest, particularly when only one or a few loci are of interest, i.e. one does not need to use genome-wide approaches. It is also apparent that our methods are not perfect and that the LS-MPRA does not pick up all CRMs. We do not know of a method that has been shown to do so.

Some of the statistical and quantitative data asked for by the Reviewers will be provided. However, it is important to note that the types of statistics using peak callers asked for regarding candidate choice will be of limited value. If one is testing a library in a single cell type in vitro, and/or running genome-wide assays, these statistics could aid in the choice of candidates. However, here we are electroporating a complex and dynamic set of cells, present at very different frequencies. In addition, at least for Olig2 and Ngn2, their expression is very transient, and each is expressed in only a small subset of cells. An additional confound is that the level of expression of each gene that one might test is variable. All of these variables render a statistical prediction of strong candidates to be less valuable than one might hope, and might lead one to miss those CRMs of interest. Instead, we suggest that one use one’s own level of interest and knowledge in choosing CRM candidates. We provide several examples of experimental, rather than purely statistical, approaches that might help in one’s choice of candidates. We used a functional read-out of CRM activity (Notch perturbation), carried out in the context of the entire LS-MPRA library, as one method. Co-expression in single cells of candidate regulators identified by the d-MPRA is another. One can of course use chromatin structure and sequence conservation, as used in many studies of regulatory regions, as other ways to narrow down candidates. The d-MPRA predictions also can be viewed in light of previous genetic studies, i.e. mutations in TFs that effect the cell type of interest or the regulation of the gene of interest, as we were able to do here for CRMs predicted to be regulated by Otx2.

If one wishes to use a candidate CRM to drive gene expression in a targeted cell type, one needs to establish specificity. In particular, specificity needs to be established in the context of the vector that is being used. Non-integrated vs integrated vectors, different types of viral vectors with their own confounding regulatory sequences, and copy number can all effect specificity. We provided a double in situ hybridization method for the examination of specificity for some of the novel candidate CRMs. It was quite difficult in the case of Olig2 and Ngn2 as their RNAs and proteins are unstable. We would need to provide further evidence should we wish to use these candidate CRMs for directing expression specifically in Olig2- or Ngn2-expressing cells. We suggest that an investigator can choose the vector and method for establishing specificity depending upon the goals of the application.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation