Advances in single-cell sequencing technologies have provided novel insights into the dynamics of gene expression throughout development, been used to characterize somatic variation and heterogeneity within tissues, and are currently enabling the construction of transcriptomic cell atlases. However, despite these remarkable advances, linking anatomical information to transcriptomic data and positively identifying the cell types that correspond to gene expression clusters in single-cell sequencing data sets remains a challenge. We describe a straightforward genetic barcoding approach that takes advantage of the powerful genetic tools available in Drosophila to allow in vivo tagging of defined cell populations. This method, called Targeted Genetically-Encoded Multiplexing (TaG-EM), involves inserting a DNA barcode just upstream of the polyadenylation site in a Gal4-inducible UAS-GFP construct so that the barcode sequence can be read out during single-cell sequencing, labeling a cell population of interest. By creating many such independently barcoded fly strains, TaG-EM will enable a number of potential applications that will improve the quality and information content of single-cell transcriptomic data including positive identification of cell types in cell atlas projects, identification of multiplet droplets, and barcoding of experimental timepoints, conditions, and replicates. Furthermore, we demonstrate that the barcodes from TaG-EM fly lines can be read out using next-generation sequencing to facilitate population-scale behavioral measurements. Thus, TaG-EM has the potential to enable large-scale behavioral screens in addition to improving the ability to reliably annotate cell atlas data, expanding the scope, and improving the robustness of single-cell transcriptomic experiments.
This useful study presents a genetically encoded barcoding system that could not only advance transcriptomic studies but that also has potential further applications, such as in high-throughput population-scale behavioral measurements. The evidence supporting the claims of the authors are currently inadequate to demonstrate that the method is indeed greatly superior to existing approaches in behavioural and transcriptomic studies.
Spatially and temporally regulated gene expression patterns are a hallmark of multicellular life and function to orchestrate patterning, growth, and differentiation throughout development (Ingham, 1988; Reeves et al., 2006). In mature organisms, spatial expression patterns both in tissues and within cells define functionally distinct compartments and determine many aspects of cellular and organismal physiology (Martin and Ephrussi, 2009). In addition, such expression patterns differentiate healthy and diseased tissue and impact disease etiology (Marusyk et al., 2012). Spatial and temporal expression patterns, which can be used to distinguish between cell types and provide insight into cellular function, also provide a means to understand the organization and physiology of complex tissues such as the brain (Thompson et al., 2014). Thus, robust and scalable tools for measuring spatial and temporal gene expression patterns at a genome-wide scale and at high resolution would be transformative research tools across many biological disciplines.
Single-cell sequencing technologies have provided insights into the dynamics of gene expression throughout development, been used to characterize somatic variation and heterogeneity within tissues, and are currently enabling the construction of transcriptomic cell atlases (Klein et al., 2015; Macosko et al., 2015; Zheng et al., 2017). However, linking anatomical information to transcriptomic data and positively identifying the cell types that correspond to gene expression clusters in single-cell sequencing data sets remains a challenge. The cellular identities of gene expression clusters identified in cell atlas data sets are typically inferred from the expression of distinctive gene sets (Hung et al., 2020; Li et al., 2022; Ma et al., 2021), and the lack of positive identification of gene expression clusters introduces an element of uncertainty in this analysis. Moreover, this process of manual annotation is labor-intensive and often requires additional experiments to determine or confirm the expression patterns of marker genes. Emerging spatial genomics technologies hold promise in linking anatomical and transcriptomic information (Lee et al., 2015; Lein et al., 2017). Several of the emerging commercial spatial genomics technologies rely on in situ sequencing of marker genes allowing droplet-based single-cell transcriptomic data to be mapped onto a tissue. However, these technologies currently suffer from constraints related to cost, content, or applicability to specific model systems.
In addition to descriptive cell atlas projects, studies involving multiple experimental timepoints throughout development and aging, or to assess the effects of experimental exposures or genetic manipulations would benefit from increased ability to multiplex samples. Given the fixed costs of droplet-based single-cell sequencing, generating data for single-cell transcriptomic time courses or experimental manipulations can be costly. These costs are also a barrier to including replicates to assess biological variability; consequently, a lack of replicates is a common shortcoming of single-cell sequencing experiments. Antibody-based cell hashing or feature barcoding approaches have been developed to allow multiplexing of samples in droplet-based single-cell sequencing reactions (Stoeckius et al., 2018, 2017). In addition, other multiplexing strategies for single-cell sequencing based on alternative methods for tagging cells (Cheng et al., 2021) or making use of natural genetic variation have been used (Kurmangaliyev et al., 2020). While such approaches can reduce per-sample costs, typically samples are barcoded at a population level and thus do not enable labeling of cell subpopulations within a sample.
We have developed a straightforward genetic barcoding approach that takes advantage of the powerful genetic tools available in Drosophila to allow deterministic in vivo tagging of defined cell populations. This method, called Targeted Genetically-Encoded Multiplexing (TaG-EM), involves inserting a DNA barcode just upstream of the poly-adenylation site in a Gal4-inducible UAS-GFP construct so that the barcode sequence can be read out during droplet-based single-cell sequencing, labeling a cell population of interest.
Genetic barcoding approaches have been employed in many unicellular systems, cell culture, and viral transfection, to facilitate high throughput screening using sequencing-based readouts (Bhang et al., 2015; Smith et al., 2009; van Opijnen et al., 2009). In multicellular animals, techniques such as GESTALT have enabled lineage tracing by using CRISPR to create unique barcodes in differentiating tissue (McKenna et al., 2016), and barcode sequencing has also been employed to map connectivity in the brain (Chen et al., 2019). In addition to improving single-cell sequencing-based measurements, genetically barcoded fly lines can enable highly multiplexed behavioral assays read out using high throughput sequencing. Flies carrying TaG-EM barcodes can be exposed to different experimental perturbations and then run through assays where flies, larvae, or embryos are fractionated based on behavioral outcomes or other phenotypes. Thus, TaG-EM has the potential to enable large-scale next-generation sequencing (NGS)-based behavioral or other fractionation screens analogous to BAR-Seq or Tn-Seq approaches employed in microbial organisms.
TaG-EM: A novel genetic barcoding strategy for multiplexed behavioral and single-cell transcriptomics
We cloned a fragment containing a PCR handle sequence and a diverse 14 bp barcode sequence into the SV40 3’ untranslated region (UTR) sequence just upstream of the polyadenylation sites in the 10xUAS-myr::GFP (pJFRC12, (Pfeiffer et al., 2010)) backbone (Figure 1A). A pool containing 29 unique barcode-containing plasmids was injected into Drosophila embryos for PhiC31-mediated integration into the attP2 landing site (Groth et al., 2004) and transgenic lines were isolated and confirmed by Sanger sequencing (Supplemental Figure 1). We recovered 20 distinctly barcoded Drosophila lines, with some barcodes recovered from multiple crosses (Supplemental Figure 1). Such barcoded fly lines have the potential to enable population behavioral measurements, where different exposures, experimental timepoints, and genetic or neural perturbations can be multiplexed and analyzed by measuring barcode abundance in sequencing data (Figure 1B). In addition, the barcodes, which reside on a Gal4-inducible UAS-GFP construct, can be expressed tissue-specifically and read out during droplet-based single-cell sequencing, labeling a cell population and/or an experimental condition of interest (Figure 1C).
Testing the accuracy and reproducibility of TaG-EM behavioral measurements using structured pools
We conducted initial experiments to optimize amplification of the genetic barcodes using primers targeting the PCR handle inserted just upstream of the 14 bp barcode sequence and PCR primers downstream of the TaG-EM barcode in the SV40 3’ UTR sequence (Supplemental Figure 2). To test the accuracy and reproducibility of sequencing-based measurements of TaG-EM barcodes, we constructed structured pools containing defined numbers of flies pooled either evenly with each of the 20 barcode constructs comprising 5% of the pool, or in a staggered manner with sets of barcodes differing in abundance in 2-fold increments (Figure 2A). To examine the impact of technical steps such as DNA extraction and PCR amplification on TaG-EM barcode measurements, even pools were made and extracted in triplicate and amplicon sequencing libraries were made in triplicate for each independently extracted DNA sample for both the even and staggered pools. The resulting data indicated that TaG-EM measurements are highly accurate and reproducible. Technical replicates (indicated by error bars in Figure 2B-E) showed minimal variability. Likewise, the three independently extracted replicates of the even pools produced consistent data with all 20 barcodes detected at levels close to the expected 5% abundance (Figure 2B-C). Barcode abundance values for the staggered structured pools was generally consistent with the input values and in most cases, the 2-fold differences between the different groups of barcodes could be distinguished (Figure 2D-E). For the staggered pools, abundances correlated well with the expected values, particularly when multiple barcodes for an input level were averaged, in which case R2 values were >0.99 (Figure 2D-E, inset plots). This indicates that a high level of quantitative accuracy can be attained using sequencing-based analysis of TaG-EM barcode abundance, particularly when averaging data for three to four independent barcodes for an experimental condition.
TaG-EM measurement of phototaxis behavior correlate well with video-based measurements
Next, we tested whether TaG-EM could be used to measure a phototaxis behavior. A mixture of barcoded wild type or blind norpA mutant flies were run together through a phototaxis assay. At the end of a period of light exposure, test tubes facing toward or away from the light were capped, DNA was isolated, and barcodes were amplified and sequenced for each tube. Raw read counts were scaled in proportion to the number of flies per tube and a preference index was calculated for each barcode (Figure 3A). In parallel, individual preference indices were calculated based on manual scoring of videos recorded for each line (Figure 3B). Preference indices calculated for the pooled, NGS-based TaG-EM measurements were nearly identical to conventional behavioral measurements for both wild type and norpA mutants (Figure 3A-B).
TaG-EM measurement of oviposition behavior and age-dependent fecundity
We next tested whether NGS-based pooled measurements of egg laying could be made. Fertilized females from each of the 20 barcode lines were placed together in egg laying cups, embryos were collected, aged for 12 hours to enable cell numbers to stabilize in the developing eggs, and then DNA was extracted from both the pooled adult flies and the embryos. In general, TaG-EM measurements of oviposition correlated with fly numbers, with the exception of barcode 14 which had reduced barcode abundance across multiple trials (Supplemental Figure 3). This suggests that despite the fact that the genetic barcode constructs are inserted in a common landing site, differences with respect to specific behaviors may exist among the lines and thus one should test to make sure given lines are appropriate to use in specific behavioral assays.
To determine whether TaG-EM could be used to measure age-dependent fecundity, we collected flies from twelve different TaG-EM barcode lines at four time points separated by one week (three barcode lines per timepoint). We collected eggs from these fly lines individually and scored the number of viable eggs per female. Next, we pooled the barcoded flies from all timepoints and collected eggs from the pooled flies. These eggs were aged, DNA was extracted, and the TaG-EM barcodes were amplified and sequenced. While measurements from individual barcode lines were noisy, both for manual counts and sequencing based measurements, there was a general trend toward declining fecundity over time (Supplemental Figure 4), consistent with published reports (David et al., 1975). Manually scored viable egg numbers and TaG-EM barcode abundances were well correlated across two independent experimental trials (R2 values of 0.52-0.61 for Trial 1 and 0.74-0.84 for Trial 2). When barcodes from each individual timepoint were averaged, R2 values for the correlation between manual and sequencing-based measurements were 0.95 for Trial 1 and 0.99 for Trial 2 (Figure 3C, Supplemental Figure 4).
Tissue-specific expression of TaG-EM GFP constructs
To facilitate representation of the TaG-EM barcodes in single-cell sequencing data, genetic barcodes were placed just upstream of the polyadenylation signal sequences and polyA cleavage sites (Figure 4A). To verify that the inserted sequences did not interfere with Gal4-driven GFP expression, we crossed each of the barcoded TaG-EM lines to decapentaplegic-Gal4 (dpp-Gal4). We observed GFP expression in the expected characteristic central stripe (Teleman and Cohen, 2000) in the wing imaginal disc for 19/20 lines at similar expression levels to the base pJFRC12 UAS-myr::GFP construct inserted in the same landing site (Figure 4B, Supplemental Figure 6). No GFP expression was visible for TaG-EM barcode number 8. Gal4-driven expression levels of TaG-EM barcoded GFP constructs were also similar to that of the pJFRC12 base construct for multiple driver lines (Supplemental Figure 7) indicating that the presence of the barcode does not generally impair expression of GFP.
Boosting the GFP signal of TaG-EM constructs to enable robust cell sorting
While with some driver lines, expression of the myr::GFP from the TaG-EM construct may be too weak to allow robust enrichment of the tagged cells, adding an additional hexameric GFP construct (Shearin et al., 2014) could boost expression of weak driver lines to levels that are sufficient for robust detection of labeled flies or larvae (Figure 4C) and for labeling of dissociated cells for flow cytometry (Supplemental Figure 8). Stocks with an additional UAS hexameric GFP construct recombined onto the same chromosome as the TaG-EM construct have been established for all 20 TaG-EM barcode lines.
Correlation between expression of TaG-EM barcodes and intestinal cell marker genes in single-cell sequencing data
To test whether we could detect TaG-EM barcodes in single-cell sequencing data, we crossed three TaG-EM barcode lines to two different gut Gal4 driver lines (Ariyapala et al., 2020), one expressing in the enterocytes (EC-Gal4) and the other in intestinal precursor cells (PC-Gal4), including stem cells and enteroblasts (EBs). Due to weak GFP expression with the EC-Gal4 driver, we did not see visible GFP positive cells for this driver line. The PC-Gal4 driver line contained an additional UAS-Stinger (2xGFP) construct and expressed GFP at a level sufficient for flow sorting when crossed to the TaG-EM line (Supplemental Figure 7). Larval guts were dissected, dissociated, stained with propidium iodide (PI) to label dead cells, and flow sorted to recover PI negative and GFP positive cells. Approximately 10,000 cells were loaded into a 10x Genomics droplet generator and a single-cell library was prepared and sequenced. Two clusters were observed in the resulting sequencing data, one of which had high read counts from mitochondrial genes suggesting that this cluster consisted of mitochondria, debris, or dead and dying cells, after filtering the cells with high mitochondrial reads, a single cluster remained (Supplemental Figure 9). This cluster expressed known intestinal precursor cell markers such as escargot (esg), klumpfuss (klu), and Notch pathway genes like E(spl)mbeta-HLH (Supplemental Figure 9). Expression of all three PC-Gal4 driven TaG-EM barcodes was observed in this cluster (Supplemental Figure 9) indicating that TaG-EM barcodes can be detected in single-cell sequencing data. Interestingly, TaG-EM barcode 8, for which no GFP expression was observed was represented in the single-cell sequencing data indicating that the insertion of this specific barcode sequence may affect GFP translation, but not mRNA expression for this line.
A previous study used droplet-based single-cell sequencing to characterize the cell types that make up the adult midgut (Hung et al., 2020). This study took advantage of two fluorescent protein markers, an escargot (esg)-GFP fusion protein and a prospero (pros)-Gal4 driven RFP to label the intestinal stem cells (ISCs) and enteroendocrine cells (EECs), respectively (Hung et al., 2020). The authors compared the resulting clusters to a list of known marker genes in the literature, including antibody staining, GFP, LacZ, and Gal4 reporter expression patterns to classify the cells in individual clusters, and also found that the esg-GFP expression was present in a broader subset of cells than anticipated. Thus, most of these cell classifications relied upon inference as opposed to direct positive labeling. Recently, a large collection of split-Gal4 lines were screened for expression in the adult and larval gut (Ariyapala et al., 2020). These include pan-midgut driver lines, split-Gal4 lines specific for the EBs, ECs, EECs, and ISC/EBs, as well as driver lines with regionalized gene expression. We crossed four different TaG-EM barcode lines with the pan-midgut driver (PMG-Gal4), and one barcode line to each of the precursor cell (PC-Gal4), enterocyte (EC-Gal4), enteroblast (EB-Gal4) and enteroendocrine (EE-Gal4) drivers (Ariyapala et al., 2020). Larval guts were dissected from these lines and cells were dissociated, flow sorted as described above to select live, GFP positive cells, and approximately 30,000 cells were loaded into a 10x Genomics droplet generator for single-cell sequencing. Using the additional hexameric GFP construct to boost GFP expression resulted in visible fluorescent signal for all eight barcode Gal4 line combinations.
An advantage of cell barcoding both for cell hashing (Stoeckius et al., 2018) and for TaG-EM in vivo barcoding is that such labeling facilitates the identification and removal of multiplets, which are an artifact of droplet-based single-cell sequencing approaches. After filtering and removing cells with a high percentage of mitochondrial or ribosomal reads, we searched for cells that co-expressed multiple TaG-EM barcodes. Out of 8,847 cells, we identified and removed 854 cells (9.65%) that expressed two different TaG-EM barcodes (Figure 5A, Supplemental Figure 10). We additionally attempted amplifying the TaG-EM barcodes in parallel to preparation of the gene expression library, following a modified cell hashing enrichment protocol. However, this enrichment was unsuccessful (resulting sequencing reads were predominantly off-target) and thus amplification of TaG-EM barcodes will require further troubleshooting.
After doublet removal, the remaining cells were clustered (Figure 5B) and analyzed using ScanPy (Wolf et al., 2018). Analysis of differentially expressed genes identified clusters expressing marker genes previously reported for adult gut cell types (Hung et al., 2020). These included genes associated with precursor cells (Notch pathway genes), enterocytes (trypsins, serine proteases, amalyse, mannosidases), and enteroendocrine cells (neuropeptides and neuropeptide receptors) (Supplemental Figure 11, Supplemental Figure 12).
As expected, the barcodes expressed by the pan-midgut driver were broadly distributed across the cell clusters (Supplemental Figure 13). In contrast, expression of the cell type-specific barcodes showed more restricted patterns of expression among the cell clusters and were co-localized with known marker genes for these cell types (Figure 5C-N). For instance, TaG-EM barcode 4 expression (Figure 5F), which was driven by the EC-Gal4 line, was seen in clusters that expressed enterocyte markers such as the serine protease Jon99Ciii (Figure 5C) and other enterocyte marker genes such as the amylase, Amy-d (not shown). TaG-EM barcode 4 did not label other classes of enterocyte cells such as beta-Trypsin or LManVI positive enterocytes (Figures 5D-E). TaG-EM barcode 6 (Figure 5J), driven by the EB-Gal4 line, was strongly expressed in cells that also expressed precursor cell markers such as esg (Figure 5G), klu (Figure 5H), and Notch pathway genes such as e(spl)mbeta-HLH (Figure 5I) and e(spl)m3-HLH (not shown). Finally, expression of TaG-EM barcode 9 (Figure 5N), which was expressed using the EE-Gal4 driver line, was observed in clusters of cells that also expressed enteroendocrine cell derived neuropeptide genes such as Dh31 (Figure 5K) and other enteroendocrine markers such as IA-2, a tyrosine phosphatase involved in the secretion of insulin-like peptide (Figure 5L). As with the enterocyte labeling, the EE-Gal4 expressed TaG-EM barcode 9 did not label all classes of enteroendocrine cells and other clusters of presumptive enteroendocrine cells expressing other neuropeptides such as Orcokinin (Figure 5M), AstA (not shown), and AstC (not shown), or neuropeptide receptors such as CCHa2 (not shown) were also observed. The EE-Gal4 driver uses Dh31 regulatory elements, so it is not surprising that the TaG-EM barcodes specifically labeled Dh31-positive enteroendocrine cells and this result further highlights the ability to target specific genetically defined cell types using TaG-EM based on in vivo cell labeling. Taken together, these results demonstrate that TaG-EM can be used to label specific cell populations in vivo for subsequent identification in single-cell sequencing data.
Advances in next-generation sequencing, as well as single-cell and spatial genomics are enabling new types of detailed analyses to study important biological processes such as development and nervous system function. Here we describe TaG-EM, a genetic barcoding strategy that enables novel capabilities in several different experimental contexts (Figure 1).
We demonstrate that the genetic barcodes can be quantified from mixtures of barcoded fly lines using next-generation sequencing. Analysis of structured pools of flies with defined inputs suggests that TaG-EM barcode measurements are highly accurate and reproducible, particularly in cases where multiple barcodes are used to label an experimental condition and averaged (Figure 2). Sequencing-based TaG-EM measurements recapitulated more laborious, one-at-a-time measurements in both a phototaxis assay and an age-dependent fecundity assay, demonstrating that TaG-EM can be used to measure behavior or other phenotypes in multiplexed, pooled populations (Figure 3). We did note that one line (TaG-EM barcode 14) exhibited poor performance in oviposition assays, suggesting that barcode performance should be verified for a specific assay, though other strategies such as averaging across multiple barcode lines or permutation of barcode assignment across replicates could also mitigate such deficiencies. Currently, up to twenty conditions can be multiplexed in a single pooled experiment with existing TaG-EM lines, but because sequencing indices can be added after amplification in a separate indexing PCR step, hundreds or even thousands of such experiments can be multiplexed in a single sequencing run.
In addition, we show that TaG-EM barcodes can be expressed by tissue-specific Gal4 drivers and used to tag specific cell populations upstream of single-cell sequencing (Figures 4-5). This capability will allow for positive identification of cell clusters in cell atlas projects and will facilitate multiplexing of single-cell sequencing experiments. Recently, a conceptually similar approach called RABID-Seq (Clark et al., 2021) has been described, which allows trans-synaptic labeling of neural circuits using barcoded viral transcripts. However, one distinction between the two approaches is that RABID-Seq relies on stochastic viral infection of mammalian cells while TaG-EM allows reproducible targeting of defined cell populations allowing unambiguous cell identification and potentially allowing the same cell populations to be assessed at different timepoints or in the context of different experimental manipulations. One current limitation is that TaG-EM barcodes are not observed in all cells in single-cell gene expression data and low expressing driver lines may result in particularly sparse labelling. In antibody-conjugated oligo cell hashing approaches, this sparsity of barcode representation is overcome by spiking in an additional primer at the cDNA amplification step and amplifying the hashtag oligo by PCR. A similar enrichment approach could be employed to generate dense labeling of the TaG-EM barcodes, improving upon the current mapping of barcode-containing transcripts observed in the gene expression data, and facilitating comparison between replicates and robust detection of multiplets.
In the future, generation of additional TaG-EM lines will enable higher levels of multiplexing. In addition, while the original TaG-EM lines were made using a membrane-localized myr::GFP construct, variants that express GFP in other cell compartments such as the cytoplasm or nucleus could be constructed to enable increased expression levels or purification of nuclei. Nuclear labeling could also be achieved by co-expressing a nuclear GFP construct with existing TaG-EM lines in analogy to the use of hexameric GFP described above.
In summary, combined with the large collections of Gal4 and split-Gal4 lines that have been established in Drosophila that enable precise targeting of a high proportion of cell types (Ariyapala et al., 2020; Aso et al., 2014; Davis et al., 2020; Gohl et al., 2011; Namiki et al., 2018; Pfeiffer et al., 2010, 2008; Venken et al., 2011), TaG-EM provides a means to target and label cells in vivo for subsequent detection in single-cell sequencing. Moreover, these genetic barcodes can be used to multiplex behavioral or other phenotypic measurements. Thus, TaG-EM provides a flexible system for barcoding cells and organisms.
Drosophila stocks and maintenance
Drosophila stocks were grown at 22°C on cornmeal agar unless otherwise indicated. The following stocks were used in this study:
Design and cloning of TaG-EM constructs
A gBlock with the following sequence containing a part of the SV40 3’ UTR with a PCR handle (uppercase, below) and a 14 bp randomer sequence just upstream of the SV40 polyadenylation site (bold and underlined, below) was synthesized (Integrated DNA Technologies, IDT): caaaggaaaaagctgcactgctatacaagaaaattatggaaaaatatttgatgtatagtgccttgactagagatcataatcagccata ccacatttgtagaggttttacttgctttaaaaaacctcccacacctccccctgaacctgaaacataaaatgaatgcaattgttgttgttaact tgtttattgcagcttataaCTTCCAACAACCGGAAGTGANNNNNNNNNNNNNNtggttacaaataaagcaatag catcacaaatttcacaaataaagcatttttttcactgcattctagttgtggtttgtccaaactcatcaatgtatcttatcatgtctggatcgatct ggccggccgtttaaacgaattcttgaagacgaaagggcctcgtgatacgcctatttttataggttaatgtcatgataataatg The gBlock was resuspended in 20 µl EB, incubated at 50°C for 20 minutes and then cut with PsiI and EcoRI (New England Biolabs, NEB) using the following reaction conditions: 4 µl gBlock DNA (35 ng), 2 µl 10x CustSmart buffer (NEB), 1 µl EcoRI enzyme (NEB), 1 µl PsiI enzyme (NEB), and 12 µl nuclease-free water were mixed and incubated at 37°C for 1 hour followed by 65°C for 20 minutes to heat inactivate the restriction enzymes. pJFRC12-10XUAS-IVS-myr::GFP plasmid (Addgene, Plasmid #26222) (Pfeiffer et al., 2010) was digested with the following reaction conditions: 5 µl pJFRC12-10XUAS-IVS-myr::GFP plasmid DNA (∼3 µg), 5 µl 10x CutSmart buffer (NEB), 1 µl PsiI enzyme (NEB), 1 µl EcoRI enzyme (NEB), and 38 µl nuclease-free water, were mixed and incubated at 37°C for 1 hour, followed by addition of 1 µl of CIP and incubation for an additional 30 minutes. The digested vector backbone was gel purified using the QiaQuick Gel Purification Kit (Qiagen). The digested gBlock was ligated into the digested pJFRC12-10XUAS-IVS-myr::GFP backbone using the following reactions conditions: 4 µl T4 ligase buffer (10x) (NEB), 20 µl plasmid backbone DNA (0.005 pmol), 5 µl gBlock digest DNA (0.03 pmol), 2 µl of T4 DNA ligase (NEB), and 9 µl nuclease-free water were mixed and incubated at 22°C for 2 hours. 2 µl of the ligation reaction was transformed into 50 µl of TOP10 competent cells (Invitrogen), and the cells were incubated on ice for 30 minutes, then heat shocked at 42°C for 30 seconds, and incubated on ice for 5 minutes. 250 µl SOC was added and the cells were plated on LB+ampicillin plates and incubated overnight at 37°C. DNA was isolated from 36 pJFRC12-gBlock colonies using a QIAprep Spin MiniPrep kit (Qiagen). Expected construct size was verified by diagnostic digest with EcoRI and ApaLI. DNA concentration was determined using a Quant-iT PicoGreen dsDNA assay (Thermo Fisher Scientific) and the randomer barcode for each of the constructs was determined by Sanger sequencing using the following primers:
Generation of TaG-EM transgenic lines
29 sequence verified constructs were normalized, pooled evenly, and injected as a pool into embryos (Rubin and Spradling, 1982) expressing PhiC31 integrase and the carrying the attP2 landing site (BDSC #25710). Injected flies were outcrossed to w-flies, and up to three white+ progeny per cross were identified, and the transgenic lines were homozygosed. DNA was extracted (GeneJET genomic DNA purification Kit, Thermo Scientific) and the region containing the DNA barcode was amplified with the following PCR reaction: 2.5 µl 1:10 diluted template DNA, 2 µl 10x Reaction Buffer (Qiagen), 0.2 µl dNTP mix (10 µM), 1 µl 10 µM SV40_5F primer (10 µM), 1 µl SV40_post_R primer (10 µM), 0.8 µl MgCl2 (3 mM), 0.1 µl Taq polymerase (Qiagen), 12.4 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 95°C for 5 minutes, followed by 30 cycles of 94°C for 30 seconds, 55°C for 30 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes. PCR products were treated with Exo-CIP using the following reaction conditions: 5 µl PCR product, 1 µl Exo-CIP Tube A (NEB), 1 µl Exo-CIP Tube B (NEB) were mixed and incubated at 37°C for 4 minutes, followed by 80°C for 1 minute. The barcode sequence for each of the independent transgenic lines was determined by Sanger sequencing using the SV40_5F and SV40_PostR primers. Transgenic lines containing 20 distinct DNA barcodes were recovered (Supplemental Figure 1).
Optimizing amplification of TaG-EM barcodes for next-generation sequencing
The following primers were evaluated to amplify the TaG-EM barcodes upstream of NGS: Forward primer pool; four primers with frameshifting bases to increase library sequence diversity in initial sequencing cycles were normalized to 10 µM and pooled evenly to make a B2_3’F1_Nextera_0-6 primer pool:
The following two reverse primers were tested:
SV40_pre_R_nextera is designed to produce a shorter amplicon (200 bp with Illumina adapters and indices added) and SV40-post_R_Nextera is designed to produce a longer amplicon (290 bp with Illumina adapters and indices added).
An initial test was performed with three different polymerases (Qiagen Q5, KAPA HiFi, and Qiagen Taq) at two different annealing temperatures and with both the B2_3’F1_Nextera/ SV40_pre_R_Nextera and B2_3’F1_Nextera/ SV40_post_R_Nextera primer sets to determine whether the primers amplify as expected (Supplemental Figure 1). Two different samples were tested:
Pool of 8 putative transformant samples (pooled 5 ul each of 1:10 diluted sample)
OreR (wild type - diluted 1:10)
Set up the following PCR reactions:
2.5 µl template DNA, 1 µl 10 µM Forward primer (10 µM), 1 µl Reverse primer (10 µM), 10 µl 2x Q5 Master Mix (Qiagen), 5.5 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 98°C for 30 seconds, followed by 30 cycles of 98°C for 20 seconds, 55°C or 60°C for 15 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes.
KAPA HiFi polymerase
2.5 µl template DNA, 1 µl 10 µM Forward primer (10 µM), 1 µl Reverse primer (10 µM), 10 µl 2x KAPA HiFi ReadyMix (Roche), 5.5 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 95°C for 5 minutes, followed by 30 cycles of 98°C for 20 seconds, 55°C or 60°C for 15 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes.
2.5 µl template DNA, 2 µl 10x Reaction Buffer (Qiagen), 0.2 µl dNTP mix (10 µM), 1 µl 10 µM Forward primer (10 µM), 1 µl Reverse primer (10 µM), 0.8 µl MgCl2 (3 mM), 0.1 µl Taq polymerase (Qiagen), 12.4 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 95°C for 5 minutes, followed by 30 cycles of 94°C for 30 seconds, 55°C or 60°C for 30 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes. Samples were run on a 2% agarose gel to verify amplification products (Supplemental Figure 1).
Next, the TaG-EM barcode lines were pooled in either an even or staggered manner. To optimize reaction conditions for the barcode measurements, either 5 ng or 50 ng of DNA was amplified in triplicate for each pool for either 20, 25, or 30 cycles with either KAPA HiFi using the B2_Nextera_F 0-6 forward primer pool together with either the SV40_pre_R_Nextera or the SV40_post_R_Nextera reverse primer. Next, PCR reactions were diluted 1:100 in nuclease-free water and amplified in the following indexing reactions: 3 µl PCR 1 (1:100 dilution), 1 µl indexing primer 1 (5 µM), 1 µl indexing primer 2 (5 µM), and 5 µl 2x Q5 master mix. The following indexing primers were used (X indicates the positions of the 8 bp indices):
Forward indexing primer:
Reverse indexing primer:
Reactions were amplified using the following cycling conditions: 98°C for 30 seconds, followed by 10 cycles of 98°C for 20 seconds, 55°C for 15 seconds, 72°C for 1 minute, followed by 72°C for 5 minutes. Amplicons were then purified and normalized using a SequalPrep normalization plate (Thermo Fisher Scientific), followed by elution in 20 µl of elution buffer. An even volume of the normalized libraries was pooled and concentrated using 1.8x AmpureXP beads (Beckman Coulter). Pooled libraries were quantified using a Qubit dsDNA high sensitivity assay (Thermo Fisher Scientific) and libraries were normalized to 2 nM for sequencing on the Illumina MiSeq (see below).
Structured fly pool experiments
Male or female flies from TaG-EM barcode lines were pooled in either an even or staggered manner (Figure 2A). For the even pools, three independently pooled samples were constructed in order to assess sample-to-sample variability. DNA was extracted from these structured pools using a protocol adapted from Huang et al., 2009 (Huang et al., 2009), using homemade SPRI beads (DeAngelis et al., 1995) in the last purification step and amplified in triplicate using 2.5 µl template DNA (50 ng), 1 µl 10 µM B2_Nextera_F 0-6 primers (10 µM), 1 µl SV40_post_R_Nextera (10 µM), 10 µl 2x KAPA HiFi ReadyMix (Roche), 5.5 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 98°C for 5 minutes, followed by 30 cycles of 98°C for 20 seconds, 60°C for 15 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes. Amplicons were indexed, normalized, quantified, and prepared for sequencing as described above.
A pair of white LED strip lights with Muzata LED Strip Light Diffusers (U1SW WW 1M, LU1) were mounted withing a light-tight box and controlled using an Vilros Uno Rev 3 microcontroller. Test tubes containing flies were held in place with Acoustic Foam Panels (1”x12”x12”, ALPOWL). Videos and images were acquired using an Arducam 1080P Day & Night Vision USB Camera with an IR filter and using Photo Booth software (Apple). Wild type and norpA flies carrying one of 4 different TaG-EM barcodes were tested in three independent experimental replicates. 20 male flies of each genotype were transferred into 25 mm x 150 mm glass test tubes, incubated at 34°C for 10 minutes and then run in the phototaxis assay, where a light at one end of the chamber was turned on for 30 seconds. Videos of all tests were recorded through the end of the 30 second light pulse. Videos were independently scored by two observers to determine the number of flies in the light-facing or dark-facing tubes and the results were averaged. A preference index (P.I.) was calculated using the following formula: [(number of flies in light tube) - (number of flies in dark tube)]/(total number of flies).
For TaG-EM barcode-based phototaxis measurements, the following genotypes were consolidated into a single test tube:
These pools were individually incubated at 34°C for 10 minutes and then run in the phototaxis assay. Videos of all tests were recorded and at the end of a 30 second light pulse the two test tubes were quickly separated and capped. Flies in each of these tubes were counted, then DNA was extracted from the flies from the light-facing or dark-facing tubes and amplified using 2.5 µl template DNA (50 ng), 1 µl 10 µM B2_Nextera_F 0-6 primers (10 µM), 1 µl SV40_post_R_Nextera (10 µM), 10 µl 2x KAPA HiFi ReadyMix (Roche), 5.5 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 95°C for 5 minutes, followed by 30 cycles of 98°C for 20 seconds, 60°C for 15 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes. Amplicons were indexed, normalized, quantified, and prepared for sequencing as described above.
Newly hatched flies (males and females) from three barcode lines were collected at one-week intervals during 4 consecutive weeks (12 barcode lines in total). Fresh fly food was provided every 3-4 days. Ten days after the last collection, 10 females from each barcode line were taken and pooled together in a collection cage (10 females x 12 barcode lines= 120 females). The remaining females from each barcode line were separated from the males and put in individual collection cages. Two days later, the experiment started and run for 3 consecutive days. Each day a 1-1.5 hour pre-collection was followed by a 6 hour collection, both at 25°C. 100 embryos from each individual collection plate were transferred to new plates and incubated for 2 days at 18°C. The number of hatched larvae were counted and used to calculate the egg survival rate. The pooled collection plate was also incubated at 18°C and the next day the embryos were dechorionated and frozen. The 12 individual collection plates were kept at 4°C and the number of embryos counted in the following days. For the barcode measurements, DNA was extracted from the embryos, and amplified using 2.5 µl template DNA (50 ng), 1 µl 10 µM B2_Nextera_F 0-6 primers (10 µM), 1 µl SV40_post_R_Nextera (10 µM), 10 µl 2x KAPA HiFi ReadyMix (Roche), 5.5 µl nuclease-free water. Reactions were amplified using the following cycling conditions: 95°C for 5 minutes, followed by 30 cycles of 98°C for 20 seconds, 60°C for 15 seconds, 72°C for 30 seconds, followed by 72°C for 5 minutes. Amplicons were indexed, normalized, quantified, and prepared for sequencing as described above.
Cell dissociation and isolation
Midguts from 3rd instar larvae were dissected in phosphate-buffered saline (PBS) and transferred to microcentrifuge tubes on ice containing PBS + 30% normal goat serum (NGS). After dissection, 150 µL of 2.7 mg/mL elastase was added to each sample tube. The tubes were then incubated at 27°C for one hour. During incubation, samples were mixed by pipetting ∼30 times every 15 minutes to improve elastase dissociation of the cells. Samples were then filtered through a 40 µM FlowMi tip filter (Bel-Art) to reduce debris. Afterwards, the samples were quantified on the LUNA-FL Dual Fluorescence Cell Counter (Logos Biosystems) using 9 µL of sample to 1µL AO/PI dye to ensure there were enough viable cells for flow sorting.
Once quantified, the samples were brought up to a volume of ∼1.1mL with the PBS + 30% NGS solution to facilitate flow sorting. The samples were then fluorescently sorted on a FACSAria II Cell Sorter (BD Biosciences) to isolate GFP+ cells. Following sorting, samples were centrifuged at 300x g for 10 minutes to concentrate the cells. The supernatant was aspirated off until 50 µL cell concentrate remained in each sample. Then, the samples were carefully resuspended using wide bore pipette tips before being combined into one sample tube. This sample was quantified on the LUNA-FL Dual Fluorescence Cell Counter (Logos Biosystems) as described above. If necessary, cells were centrifuged, concentrated, and re-counted.
Preparation of single-cell sequencing libraries
The resulting pool was prepared for sequencing following the 10X Genomics Single Cell 3’ protocol (version CG000315 Rev C). Custom nested primers were used to attempt enrichment of TaG-EM barcodes after cDNA creation using PCR. The remaining cDNA followed the 10X 3’ protocol without modification while the enriched TaG-EM portion followed BioLegends’ “Total-Seq-A Antibodies and Cell Hashing with 10X Single Cell 3’ Reagents Kit v3 or v3.1” Protocol starting at step III. The resulting 3’ gene expression library and TaG-EM enrichment library were sequenced together following Scenario 1 of the BioLegends protocol. Later, the TaG-EM library was sequenced alone following Scenario 2 from the same protocol.
Libraries for TaG-EM barcode analysis from structured pools or from phototaxis or oviposition experiments were denatured with NaOH and prepared for sequencing according to the protocols described in the Illumina MiSeq Denature and Dilute Libraries Guides. Single-cell libraries were sequenced on the Illumina NextSeq 2000 or Illumina NovaSeq 6000.
Demultiplexed fastq files were generated using bcl2fastq or bcl-convert. TaG-EM barcode data was analyzed using custom R and Python scripts and BioPython (Cock et al., 2009). Leading primer sequences were trimmed using cutadapt (Martin, 2011) and the first 14 bp of the remaining trimmed read were compared to a barcode reference file, with a maximum of 2 mismatches allowed, using a custom script which is available via Github: https://github.com/darylgohl/TaG-EM.
Data was initially mapped and analyzed using the Cellranger software package (version 7.0.1, 10X Genomics). A custom reference Drosophila genome was created by starting with the BDGP6.28 reference genome assembly and Ensembl gene annotations. Custom gene definitions for each of the TaG-EM barcodes were added to the fasta genome file and gtf gene annotation file and a Cellranger reference package was generated with the cellranger mkref command. Subsequent analysis was performed using Scanpy (version 1.9.3) (Wolf et al., 2018). Cells expressing less than 200 genes and genes expressed in fewer than three cells were filtered from the expression matrix. Next, cells expressing more than 4,000 genes, cells expressing more than 15% mitochondrial transcripts, and cells expressing more than 30% ribosomal (RpL) genes were filtered out. Expression counts were log normalized and highly variable genes were filtered out (min_mean=0.0125, max_mean=3, min_disp=0.5). Cells expressing multiple TaG-EM barcodes were identified and removed and cells were clustered using the Leiden algorithm (Traag et al., 2019) and projected with UMAP. Scanpy was used to call and visualize differentially expressed genes.
Availability of data, code, and materials
Sequencing data for this project is available through the National Center for Biotechnology Information (NCBI) Sequence Read Archive BioProject PRJNA912199. Fly stocks containing each of the 20 TaG-EM barcodes together with an additional UAS hexameric GFP expression construct will be available from the Bloomington Drosophila Stock Center. The TaG-EM barcode analysis script and barcode reference fasta file are available via Github: https://github.com/darylgohl/TaG-EM.
None to declare.
This study was supported by a grant from the Winston and Maxine Wallin Neuroscience Discovery Fund.
J.B.M. conceived and designed experiments, conducted experiments, analyzed data, revised manuscript, M.D. conducted experiments, revised manuscript, L.G. conducted experiments, revised manuscript, B.A. conducted experiments, revised manuscript, J.G. analyzed data, revised manuscript, D.M.G. conceived and designed experiments, conducted experiments, analyzed data, wrote and revised the manuscript.
We thank our colleagues in the University of Minnesota Genomics Center (RRID:SCR_012413), in particular Aaron Becker, Dylan Cole, and Logan Silber for help with DNA sequencing, Emma Stanley, Fernanda Rodriguez, and Patrick Grady for assistance and advice on single-cell sequencing, and Kenneth Beckman, Andrew Alegria, Troy Louwagie, and Aaron Barnes for helpful feedback and discussions. This work was supported by the resources and staff at the University of Minnesota University Imaging Centers (RRID:SCR_020997). The authors acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota for providing resources that contributed to the research results reported within this paper. URL: http://www.msi.umn.edu. Stocks obtained from the Bloomington Drosophila Stock Center (NIH P40OD018537) were used in this study.
- Identification of split-GAL4 drivers and enhancers that allow regional cell type manipulations of the drosophila melanogaster intestineGenetics 216:891–903https://doi.org/10.1534/genetics.120.303625
- The neuronal architecture of the mushroom body provides a logic for associative learningeLife 3https://doi.org/10.7554/ELIFE.04577
- Studying clonal dynamics in response to cancer therapy using high-complexity barcodingNature Medicine 21:440–448https://doi.org/10.1038/nm.3841
- High-Throughput Mapping of Long-Range Neuronal Projection Using In Situ SequencingCell 179:772–786https://doi.org/10.1016/j.cell.2019.09.023
- Multiplexing Methods for Simultaneous Large-Scale Transcriptomic Profiling of Samples at Single-Cell ResolutionAdvanced Science 8https://doi.org/10.1002/advs.202101229
- Barcoded viral tracing of single-cell interactions in central nervous system inflammationScience 372
- Biopython: freely available Python tools for computational molecular biology and bioinformaticsBioinformatics (Oxford, England) 25:1422–3https://doi.org/10.1093/bioinformatics/btp163
- The variability between individuals as a measure of senescence: A study of the number of eggs laid and the percentage of hatched eggs in the case of Drosophila melanogasterExperimental Gerontology 10:17–25https://doi.org/10.1016/0531-5565(75)90011-X
- A genetic, genomic, and computational resource for exploring neural circuit functioneLife 9
- Solid-phase reversible immobilization for the isolation of PCR productsNucleic Acids Res 23:4742–4743
- A versatile in vivo system for directed dissection of gene expression patternsNature Methods 8:231–237https://doi.org/10.1038/nmeth.1561
- Construction of transgenic Drosophila by using the site-specific integrase from phage phiC31Genetics 166:1775–82https://doi.org/10.1534/genetics.166.4.1775
- Quick preparation of genomic DNA from DrosophilaCold Spring Harb Protoc 2009https://doi.org/10.1101/pdb.prot5198
- A cell atlas of the adult Drosophila midgutProceedings of the National Academy of Sciences of the United States of America 117:1514–1523https://doi.org/10.1073/pnas.1916820117
- The molecular genetics of embryonic pattern formation in DrosophilaNature 335:25–34https://doi.org/10.1038/335025a0
- Droplet barcoding for single-cell transcriptomics applied to embryonic stem cellsCell 161:1187–1201https://doi.org/10.1016/j.cell.2015.04.044
- Transcriptional Programs of Circuit Assembly in the Drosophila Visual SystemNeuron 108:1045–1057https://doi.org/10.1016/J.NEURON.2020.10.006
- Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissuesNature Protocols 10:442–458https://doi.org/10.1038/nprot.2014.191
- The promise of spatial transcriptomics for neuroscience in the era of molecular cell typingScience 358:64–69https://doi.org/10.1126/science.aan6827
- Fly Cell Atlas: A single-nucleus transcriptomic atlas of the adult fruit flyScience 375https://doi.org/10.1126/science.abk2432
- A transcriptomic taxonomy of drosophila circadian neurons around the clockeLife 10:1–19https://doi.org/10.7554/eLife.63056
- Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter DropletsCell 161:1202–1214https://doi.org/10.1016/j.cell.2015.05.002
- mRNA localization: gene expression in the spatial dimensionCell 136:719–30https://doi.org/10.1016/j.cell.2009.01.044
- Cutadapt removes adapter sequences from high-throughput sequencing readsEMBnet.journal 17:10–12
- Intra-tumour heterogeneity: a looking glass for cancer?Nature Reviews Cancer 12:323–334https://doi.org/10.1038/nrc3261
- Whole-organism lineage tracing by combinatorial and cumulative genome editingScience 353https://doi.org/10.1126/science.aaf7907
- The functional organization of descending sensory-motor pathways in drosophilaeLife 7
- Tools for neuroanatomy and neurogenetics in DrosophilaProceedings of the National Academy of Sciences of the United States of America 105:9715–20https://doi.org/10.1073/pnas.0803697105
- Refinement of tools for targeted gene expression in DrosophilaGenetics 186:735–55https://doi.org/10.1534/genetics.110.119917
- Quantitative Models of Developmental Pattern FormationDevelopmental Cell 11:289–300https://doi.org/10.1016/J.DEVCEL.2006.08.006
- Genetic transformation of Drosophila with transposable element vectorsScience (New York, NY) 218:348–53https://doi.org/10.1126/science.6289436
- Hexameric GFP and mCherry reporters for the Drosophila GAL4, Q, and LexA transcription systemsGenetics 196:951–960https://doi.org/10.1534/GENETICS.113.161141/-/DC1/GENETICS.113.161141-5.PDF
- Quantitative phenotyping via deep barcode sequencingGenome research 19:1836–42https://doi.org/10.1101/gr.093955.109
- Simultaneous epitope and transcriptome measurement in single cellsNature Methods 14:865–868https://doi.org/10.1038/nmeth.4380
- Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomicsGenome Biology 19:1–12https://doi.org/10.1186/s13059-018-1603-1
- Dpp gradient formation in the drosophila wing imaginal discCell 103:971–980https://doi.org/10.1016/S0092-8674(00)00199-9
- A High-Resolution Spatiotemporal Atlas of Gene Expression of the Developing Mouse BrainNeuron 83:309–323https://doi.org/10.1016/J.NEURON.2014.05.033
- From Louvain to Leiden: guaranteeing well-connected communitiesSci Rep 9https://doi.org/10.1038/s41598-019-41695-z
- Tn-seq: high-throughput parallel sequencing for fitness and genetic interaction studies in microorganismsNature methods 6:767–72https://doi.org/10.1038/nmeth.1377
- MiMIC: a highly versatile transposon insertion resource for engineering Drosophila melanogaster genesNature methods 8:737–43
- SCANPY: large-scale single-cell gene expression data analysisGenome Biology 19https://doi.org/10.1186/s13059-017-1382-0
- Massively parallel digital transcriptional profiling of single cellsNature Communications 8:1–12https://doi.org/10.1038/ncomms14049