Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Understanding how gene expression programs are controlled requires identifying regulatory relationships between transcription factors and target genes. Gene regulatory networks are typically constructed from gene expression data acquired following genetic perturbation or environmental stimulus. Single-cell RNA sequencing (scRNAseq) captures the gene expression state of thousands of individual cells in a single experiment, offering advantages in combinatorial experimental design, large numbers of independent measurements, and accessing the interaction between the cell cycle and environmental responses that is hidden by population-level analysis of gene expression. To leverage these advantages, we developed a method for scRNAseq in budding yeast (Saccharomyces cerevisiae). We pooled diverse transcriptionally barcoded gene deletion mutants in 11 different environmental conditions and determined their expression state by sequencing 38,285 individual cells. We benchmarked a framework for learning gene regulatory networks from scRNAseq data that incorporates multitask learning and constructed a global gene regulatory network comprising 12,228 interactions.

eLife digest

Organisms switch their genes on and off to adapt to changing environments. This takes place thanks to complex networks of regulators that control which genes are actively ‘read’ by the cell to create the RNA molecules that are needed at the time. Piecing together these networks is key to fully understand the inner workings of living organisms, and how to potentially modify or artificially create them.

Single-cell RNA sequencing is a powerful new tool that can measure which genes are turned on (or ‘expressed’) in an individual cell. Datasets with millions of gene expression profiles for individual cells now exist for organisms such as mice or humans. Yet, it is difficult to use these data to reconstruct networks of regulators; this is partly because scientists are not sure if the computational methods normally used to build these networks also work for single-cell RNA sequencing data.

One way to check if this is the case is to use the methods on single-cell datasets from organisms where the networks of regulators are already known, and check whether the computational tools help to reach the same conclusion. Unfortunately, the regulatory networks in the organisms for which scientists have a lot of single-cell RNA sequencing data are still poorly known. There are living beings in which the networks are well characterised – such as yeast – but it has been difficult to do single-cell sequencing in them at the scale seen in other organisms.

Jackson, Castro et al. first adapted a system for single-cell sequencing so that it would work in yeast. This generated a gene expression dataset of over 40,000 yeast cells. They then used a computational method (called the Inferelator) on these data to construct networks of regulators, and the results showed that the method performed well. This allowed Jackson, Castro et al. to start mapping how different networks connect, for example those that control the response to the environment and cell division. This is one of the benefits of single-cell RNA methods: cell division for example is not a process that can be examined at the level of a population, since the cells may all be at different life stages. In the future, the dataset will also be useful to scientists to benchmark a variety of single cell computational tools.

Introduction

Elucidating relationships between genes, and the products they encode, remains one of the central challenges in experimental and computational biology. A gene regulatory network (GRN) is a directed graph in which regulators of gene expression are connected to target gene nodes by interaction edges. Regulators of gene expression include transcription factors (TF) which can act as activators and repressors, RNA binding proteins, and regulatory RNAs. Identifying regulatory relationships between transcriptional regulators and their targets is essential for understanding biological phenomena ranging from cell growth and division to cell differentiation and development (Davidson, 2012). Reconstruction of GRNs is required to understand how gene expression dysregulation contributes to cancer and complex heritable diseases (Barabási et al., 2011; Hu et al., 2016).

Genome-scale methods provide an efficient means of identifying gene regulatory relationships. Efforts of the past two decades have resulted in the development of a variety of experimental and computational methods that leverage advances in technology and machine learning for constructing GRNs. Previously, we developed a method for inferring transcriptional regulatory networks based on regression with regularization that we have called the Inferelator (Bonneau et al., 2006; Ciofani et al., 2012). This method takes as inputs gene expression data and sources of prior information, and outputs regulatory relationships between transcription factors and their target genes that explain the observed gene expression levels. Subsequent work has enhanced this approach by selecting regulators for each gene more effectively (Madar et al., 2010), incorporating orthogonal data types that can be used to generate constraints on network structure (Greenfield et al., 2013), and explicitly estimating latent biophysical parameters including transcription factor activity (Arrieta-Ortiz et al., 2015; Fu et al., 2011) and mRNA decay rates (Tchourine et al., 2018). We have successfully applied this approach to construct GRNs from gene expression data acquired from variation across time, conditions, and genotypes in microbes (Arrieta-Ortiz et al., 2015; Tchourine et al., 2018), plants (Wilkins et al., 2016), and mammalian cells (Ciofani et al., 2012; Miraldi et al., 2019).

Recently, single-cell RNA sequencing (scRNAseq) has exploded in popularity with the development of droplet systems for rapid encapsulation and labeling of thousands of cells in parallel. The DROP-seq system (Macosko et al., 2015) based on bead capture, and the inDrop (Zilionis et al., 2017) and 10x Genomics (Zheng et al., 2017) systems based on hydrogel beads, provide a facile means of generating RNA sequencing data for tens of thousands of individual cells. Although scRNAseq has primarily been used for defining cell types and states, this technology holds great potential for efficient construction of GRNs (Hwang et al., 2018). By combining genetic perturbation of transcriptional regulators using CRISPR/Cas9 with scRNAseq, mixtures of genetic perturbations can be assayed in a single reaction (Adamson et al., 2016; Dixit et al., 2016; Jaitin et al., 2016). This approach, known as Perturb-seq, presents a new opportunity for efficiently inferring GRNs from thousands of individual cells in which different regulators have been disrupted. There are considerable advantages in both scalability and detection of intra-sample heterogeneity with Perturb-seq, but quantifying the the effectiveness of CRISPR/Cas9 targeting in individual cells and distinguishing gene expression variability from noise inherent to mRNA undersampling in scRNAseq (Brennecke et al., 2013; Grün et al., 2014) present technical challenges. Computational methods to take advantage of scRNAseq data for inferring GRNs are under active development (Aibar et al., 2017; Chan et al., 2017; van Dijk et al., 2018). However, benchmarking these methods is difficult; in the absence of a known GRN, model performance is often estimated using simulated data (Chen and Mar, 2018), and issues regarding the appropriate experimental and computational approaches to GRN construction from scRNAseq data remain unresolved.

The budding yeast Saccharomyces cerevisiae is ideally suited to constructing GRNs from experimental data and benchmarking computational methods. Decades of work have provided a plethora of transcriptional regulatory data comprising functional and biochemical information (de Boer and Hughes, 2012; Teixeira et al., 2018). As a result, yeast is well suited to constructing GRNs using methods that leverage the rich available information and for assessing the performance of those methods by comparison to experimentally validated interactions (Ma et al., 2014; Tchourine et al., 2018). Budding yeast presents several technical challenges for single cell analysis, and as a result scRNAseq methods for budding yeast reported to date (Gasch et al., 2017; Nadal-Ribelles et al., 2019) yield far fewer individual cells (~10²) than are now routinely generated for mammalian studies (>10⁴). The limitations of existing scRNAseq methods for budding yeast cells limits our ability to investigate eukaryotic cell biology as many signaling and regulatory pathways are highly conserved in yeast (Carmona-Gutierrez et al., 2010; Gray et al., 2004), including the Ras/protein kinase A (PKA), AMP Kinase (AMPK) and target of rapamycin (TOR) pathways (González and Hall, 2017; Loewith and Hall, 2011). However, recent work has successfully established single cell sequencing in the fission yeast Schizosaccharomyces pombe (Saint et al., 2019).

In budding yeast, the TOR complex 1 (TORC1 or mTORC1 in human) coordinates the transcriptional response to changes in nitrogen sources (Godard et al., 2007; Rødkaer and Faergeman, 2014). Controlling this response are four major TF groups, which are regulated by diverse post-transcriptional processes. The Nitrogen Catabolite Repression (NCR) pathway, which is regulated principally by TORC1, consists of the TFs GAT1, GLN3, DAL80, and GZF3 (Hofman-Bang, 1999), and is responsible for suppressing the utilization of non-preferred nitrogen sources when preferred nitrogen sources are available. Gat1 and Gln3 are localized to the cytoplasm until activation results in relocalization to the nucleus (Cox et al., 2000), where they then compete with Dal80 and Gzf3 for DNA binding motifs (Georis et al., 2009). The General Amino Acid Control (GAAC) pathway consists of the TF GCN4 (Hinnebusch, 2005), and is responsible for activating the response to amino acid starvation, as detected by increases in uncharged tRNA levels. Gcn4 activity is translationally controlled by ribosomal pausing at upstream open reading frames in the 5’ untranslated region (Mueller and Hinnebusch, 1986). The retrograde pathway, consisting of the TF heterodimer RTG1 and RTG3, is responsible for altering expression of metabolic and biosynthetic genes in response to mitochondrial stress (Jia et al., 1997; Liao and Butow, 1993) or environmental stress (Ruiz-Roig et al., 2012). The Rtg1/Rtg3 complex is localized to the cytoplasm until activation, upon which they translocate into the nucleus (Komeili et al., 2000). The Ssy1-Ptr3-Ssy5-sensing (SPS) pathway (Ljungdahl, 2009), consists of the TFs STP1 and STP2, and is responsible for altering transporter expression (Didion et al., 1998; Iraqui et al., 1999) in response to changes in extracellular environment. Stp1 and Stp2 are anchored to the plasma membrane until the SPS sensor triggers proteolytic cleavage of their anchoring domain and releases them for nuclear import (Andréasson and Ljungdahl, 2002).

Construction of GRNs based on the transcription factors in these pathways has had mixed success; the high redundancy of the NCR pathway has proven challenging to deconvolute (Milias-Argeitis et al., 2016). The GAAC pathway is more straightforward, although separating direct and indirect regulation remains difficult, even with high-quality experimental data (Mittal et al., 2017). As a result, a comprehensive GRN for nitrogen metabolism has remained elusive, despite successes in identifying genes that respond to changes in environmental nitrogen (Airoldi et al., 2016) and identification of post-transcriptional control mechanisms that underlie these changes (Miller et al., 2018).

Many signalling regulators involved in environmental response interact with cell cycle programs (Johnston et al., 1977; Talarek et al., 2017), including the TOR pathway (Zinzalla et al., 2007); however, how regulation of the mitotic cell cycle and environmentally responsive gene expression is coordinated is unknown. The regulation of nitrogen responsive gene expression in yeast is well-suited to the development of generalizable methods as the degree of TF redundancy, post-transcriptional regulation of TF activity, which precludes straightforward relationships between TF abundance and target expression, and multifactorial impact on gene expression, including intrinsic and extrinsic processes and stimuli, provide a tractable model system for addressing these challenges in higher eukaryotes.

Here, we have developed a method for scRNAseq in budding yeast using Chromium droplet-based single cell encapsulation (10x Genomics). We engineered TF gene deletions by precisely excising the entire TF open reading frame and introducing a unique transcriptional barcode that enables multiplexed analysis of genotypes using scRNAseq. We pooled 72 different strains, corresponding to 12 different genotypes, and determined their gene expression profiles in 11 conditions using scRNAseq analysis of 38,000 cells. We show that our method enables identification of cells from complex mixtures of genotypes in asynchronous cultures that correspond to specific mutants, and to specific stages of the cell cycle. Identification of mutants can be used to identify differentially expressed genes between genotypes providing an efficient means of multiplexed gene expression analysis. We used scRNAseq data in yeast to benchmark computational aspects of GRN reconstruction, and show that multi-task learning integrates information across environmental conditions without requiring complex normalization, resulting in improved GRN reconstruction. We find that imputation of missing data does not improve GRN reconstruction and can lead to prediction of spurious interactions. Using scRNAseq data, we constructed a global GRN for budding yeast comprising 12,228 regulatory interactions. We discover novel regulatory relationships, including previously unknown connections between regulators of cell cycle gene expression and nitrogen responsive gene expression. Our study provides a generalizable framework for GRN reconstruction from scRNAseq, a rich data set that will enable benchmarking of future computational methods, and establishes the use of droplet-based scRNAseq analysis of multiplexed genotypes in yeast.

Results

Engineering a library of Prototrophic, Transcriptionally-Barcoded Gene Deletion Strains

The yeast gene knockout collection (Giaever et al., 2002) facilitates pooled analysis of mutants using unique DNA barcode sequences that identify each gene deletion strain, but these barcodes are only present at the DNA level, precluding their use with scRNAseq. Therefore, we constructed an array of prototrophic, diploid yeast strains with homozygous deletions of TFs that control distinct regulatory modules: 1) NCR, 2) GAAC, 3) SPS-sensing, and 4) the retrograde pathway that coordinately control nitrogen-related gene expression in yeast. We engineered eleven different TF knockout genotypes, using six independently constructed biological replicates for each genotype. In addition, we constructed six biological replicates of the wild-type control in which we deleted the neutral HO locus. Genes were deleted using a modified kanMX cassette such that each of the 72 strains contains a unique transcriptional barcode in the 3’ untranslated region (UTR) of the G418 resistance gene, that can be recovered by RNA sequencing (Figure 1, Figure 1—figure supplement 1A). Homozygous diploids were constructed by mating to a strain containing the same TF knockout marked with a nourseothricin drug resistance cassette. On rich media plates, the 72 strains have an approximately wild-type growth; under nutritional stress, some TF knockouts exhibit growth advantages or disadvantages (Figure 1—figure supplement 1B).

Figure 1 with 1 supplement see all

Download asset Open asset

Single-Cell RNA-Seq Experimental Workflow in *Saccharomyces Cerevisiae.*

Schematic workflows for: (A) Growth of a transcriptionally-barcoded pool of 11 nitrogen metabolism transcription factor (TF) knockout strains and a wild-type control strain each analyzed with six biological replicates (B) Synthesis in microfluidic droplets of single-cell cDNA with a cell-specific index sequence (IDX) attached to the oligo-dT primer, and a common template switch oligo (TSO). cDNA is processed for whole-transcriptome libraries, to quantify gene expression. In parallel, PCR products are amplified containing the genotype-specific transcriptional barcode (BC) encoded on the Kan^R antibiotic resistance marker mRNA, to identify cell genotype. Expression DNA libraries and PCR products are separately indexed for multiplexed sequencing (C) Processing of single-cell sequencing data using Unique Molecular Identifiers (UMI) into a count matrix which is used to learn a gene regulatory network using multi-task network inference from several different growth conditions.

Single-Cell RNA sequencing of pooled libraries in diverse growth conditions

ScRNAseq in yeast presents several challenges: cells are small (40–90 µm³), enclosed in a polysaccharide-rich cell wall, and contain fewer mRNAs per cell (40 k-60k) than higher eukaryotes. We developed and validated a protocol using the droplet-based 10x genomics chromium platform, and it used it to perform scRNAseq of the pool of TF knockouts in eleven growth conditions that provide a range of metabolic challenges (Table 1). In addition to variable nitrogen sources in minimal media with excess [MM] and limiting [NLIM-NH₄, NLIM-GLN, NLIM-PRO, NLIM-UREA] nitrogen, some conditions result in fermentative metabolism of glucose in rich media [YPD], and inhibition of the TOR signaling pathway in rich media by the small molecule rapamycin [RAPA]. We also studied conditions that require respiratory metabolism of ethanol in rich [YPEtOH] and minimal media [MMEtOH], and in rich media after sugars had been fully metabolized to ethanol and cells have undergone the diauxic shift [DIAUXY]. We tested two different starvation conditions, carbon [CSTARVE] and nitrogen starvation [NSTARVE]; however, the latter condition did not pass quality control during single-cell transcriptome library preparation and was discarded.

Table 1

Environmental Growth Conditions.

Environmental growth conditions are listed with their respective nitrogen and carbon sources. Yeast Extract + Peptone (YP) is a rich, complex nitrogen source. YP + Dextrose [YPD] is standard yeast rich media. Minimal media contains a standard base of trace metals, vitamins, and salts. All cultures were harvested 4 hr after inoculation, except for the culture harvested after the diauxic shift [DIAUXY], which was harvested 10 hr after inoculation. Rapamycin was added to YPD in the [RAPA] culture 30 min prior to harvest. Specific media formulations are listed in Supplementary file 1-Supplemental Table 4.

Growth condition	Abbrv.	Nitrogen source	Carbon source
Yeast Extract, Peptone, Glucose	YPD	YP	D-Glucose
YPD (Harvested after Post-Diauxic Shift)	DIAUXY	YP	D-Glucose
YPD + 200 ng/mL Rapamycin	RAPA	YP	D-Glucose
Yeast Extract, Peptone, Ethanol	YPEtOH	YP	Ethanol
Minimal Media (Glucose)	MMD	20 mM (NH₄)₂SO₄	D-Glucose
Minimal Media (Ethanol)	MMEtOH	20 mM (NH₄)₂SO₄	Ethanol
Nitrogen Limited Minimal Media (with Glutamine)	NLIM-GLN	0.8 mM L-Glutamine	D-Glucose
Nitrogen Limited Minimal Media (with Proline)	NLIM-PRO	0.8 mM L-Proline	D-Glucose
Nitrogen Limited Minimal Media (with NH₄)	NLIM-NH₄	0.8 mM (NH₄)₂SO₄	D-Glucose
Nitrogen Limited Minimal Media (with Urea)	NLIM-UREA	0.8 mM Urea	D-Glucose
Carbon Starvation	CSTARVE	1 mM (NH₄)₂SO₄	None

Cells from the eleven different conditions were sequenced and processed using cellranger (10x genomics) and our custom analysis pipeline (fastqToMat0), yielding a digital expression matrix (Source code 1) in which each cell is annotated with the environmental growth condition and genotype. Genotype-specific barcodes facilitate identification and removal of droplets that have multiple cells (doublets) by determining cell IDs that have more than one annotated genotype. Using our pool of 72 strains, we detect and remove 98.5% of doublets. PCR artifacts and duplicates are removed using Unique Molecular Identifiers (UMIs) (Kivioja et al., 2012) to quantify gene expression as unique transcript reads (counts). Following sequence processing, quality control, removal of doublets, and assigning metadata, we recovered 83,703,440 transcript counts from a total of 38,225 individual cells.

To initially assess the quality of our data, we examined the expression of genes that are characteristic of different metabolic states. Consistent with our expectations, the core fermentative (anaerobic) genes PDC1 and ENO2 are expressed in cells in fermenting culture conditions only, and the core respirative (aerobic) gene ADH2 is expressed in cells in respiring culture conditions (Figure 2A). The number of cells recovered varies by over an order of magnitude between conditions; stressful conditions of low nitrogen have lower cell yields overall. The yeast stress response is linked to increased resistance to zymolyase digestion (Nagarajan et al., 2014), which may be reflected in decreased cell yield during single-cell sequencing. Each of the 72 strains is found in each of the 11 conditions, although the number of each strain and genotype varies by environmental condition (Figure 2—figure supplement 1A), and some strains are disproportionately affected. However, the number of transcripts per cell is generally equivalent between strains even when they differ in representation within libraries (Figure 2—figure supplement 1B). By contrast, we find that total transcript counts per cell are highly linked to environmental growth conditions (Figure 2—figure supplement 1C), which is consistent with decreased total transcriptome pool size in suboptimal conditions (Athanasiadou et al., 2019). For cells growing in rich medium (YPD) we recover a median of 2250 unique transcripts per cell, from a median of 695 distinct genes, indicating a capture rate of approximately 3–5% of total transcripts from each cell. The strain genotype does not strongly influence transcript counts per cell (Figure 2—figure supplement 1D). There is a high correlation between single-cell expression data and bulk RNA expression data (spearman correlation 0.941) for wild-type cells grown in YPD (Figure 2—figure supplement 2) indicating that the effect of technical bias caused by single-cell processing is minimal. We also find good correlation to other published single-cell yeast data sets, and a comparable published bulk RNAseq experiment,.

Figure 2 with 4 supplements see all

Download asset Open asset

Gene expression of single Yeast Cells Cluster Based on Environmental Growth Condition.

(A) Normalized density histograms of raw UMI counts of the core glycolytic genes *ENO2* and *PDC1*, and the alcohol respiration gene *ADH2* in each environmental growth condition. Mean UMI count for each of the 12 different strain genotypes within each growth condition are plotted as dots on the X axis. (**B–C**) Uniform Manifold Approximation and Projection (UMAP) projection of log-transformed and batch-normalized scRNAseq data. Axes are dimensionless variables V1 and V2 with arbitrary units, here omitted. Individual cells are colored by environmental growth condition (B) or by strain genotype (C). Growth conditions are abbreviated as in Table 1.

Mapping the digital expression matrix into two-dimensional space with a Uniform Manifold Approximation and Projection [UMAP] results in clear separation of individual cells into groups based on environmental condition (Figure 2B). Cells from different minimal media or nitrogen-limited growth conditions localize near each other, and cells grown in different rich nitrogen sources are clearly separate from each other. Within environmentally-determined grouping there appears to be no strong ordering by genotype (Figure 2C). These clusters are not driven by sequencing depth (Figure 2—figure supplement 3B), although there are some stress conditions which have subpopulations which are downregulated for ribosomal genes and upregulated for induced environmental stress response (iESR) genes (Figure 2—figure supplement 3C–F). The increased relative abundance of ribosomal related gene expression in rich media conditions is consistent with previously-observed correlation of ribosomal gene expression and cellular growth rate (Brauer et al., 2008). Some measures of variance per-gene differ in different growth conditions (Figure 2—figure supplement 4). Interactive figures are provided (http://shiny.bio.nyu.edu/YeastSingleCell2019/) facilitating exploration of expression levels for all genes.

The mitotic cell cycle underlies heterogeneity in single cell gene expression

To identify sources of gene expression differences between cells within environments, we clustered single cells within each environmental condition separately by constructing a Shared Nearest Neighbor graph (Xu and Su, 2015) and clustering using the Louvain method (Blondel et al., 2008). Genes with known roles in mitotic cell cycle are highly represented among the most differentially expressed genes between clusters (Figure 3—figure supplement 1A). Overlaying the expression of three of these genes (PIR1, DSE2, and HTB1/HTB2) on UMAP plots illustrates cell cycle effects on single cell gene expression (Figure 3A and Figure 3B). PIR1 expression, a marker for early G1 (Spellman et al., 1998), is diagnostic of a distinct cluster. DSE2 is expressed only in daughter cells (Colman-Lerner et al., 2001), which allows daughter cells in G1 to be distinguished from mother cells in G1. Cells that have high expression of the histone 2B genes, which are upregulated in S-phase (Eriksson et al., 2012), are localized together in the UMAP plots (Figure 3B).

Figure 3 with 2 supplements see all

Download asset Open asset

Cells Within Conditions Cluster According to Cell Cycle Genes.

(A) Cells from each growth condition were separately normalized and transformed into 2-dimensional space using UMAP. The log-transformed, normalized expression for each cell of (i) the G1-phase specific marker *PIR1*, (ii) the G1-phase daughter-cell specific marker *DSE2*, (iii) the S-phase specific marker histone 2B (*HTB*) is shown; (iv) the genotype and (v) the cluster membership of each cell. (B) Summary of clustered single cell expression within the YPD and RAPA growth conditions (i) Proportion of cells from a specific strain genotype within each cluster (ii) The mean log-transformed, normalized expression of the G1- and S-phase marker genes, as well as a hexokinase gene *HXK2* for each cluster (C) Schematic of the mitotic cell cycle with expression of *DSE2*, *PIR1*, and *HTB* genes annotated.

For each cluster of cells within a growth condition we plotted the proportion of cells belonging to each TF deletion genotype, and the mean expression of several cell cycle genes (Figure 3B). Some clusters predominantly contain cells from a single TF deletion genotype; for example, cells deleted for GLN3 (gln3Δ) form a separate cluster in YPD and RAPA conditions, as do cells deleted for one of the RTG heterodimer components (rtg1Δ and rtg3Δ). However, differences in expression due to genotype do not appear to be a primary source of expression differences within conditions, as most clusters show a uniform distribution of genotypes (Figure 3B, Figure 3—figure supplement 1B). Similarly, we do not find that differences in expression of metabolic genes underlie overall differences in expression (e.g. HXK2) suggesting that the yeast metabolic cycle (Silverman et al., 2010; Tu et al., 2005) is not readily identifiable in single cells using scRNAseq. Three of the high-stress growth conditions (NLIM-GLN, NLIM-PRO, and MMEtOH) have clusters that are separate from the majority of the cells analyzed in those conditions. We find that these clusters have higher levels of stress response genes and lower levels of ribosomal genes than other cells in these conditions (Figure 3—figure supplement 2B–C) These clusters may reflect cells undergoing early entry into quiescence and provide evidence for a heterogeneous response to stressful conditions.

Deletion of Transcription Factors causes gene expression changes that differ between growth conditions

To assess our ability to determine differential gene expression between TF knockout strains, we examine the expression of genes known to respond to nitrogen signalling. GAP1 (General Amino acid Permease) is a transporter responsible for importing amino acids under conditions of nitrogen limitation; GAP1 expression is regulated by the NCR activators GAT1 and GLN3, the NCR repressors GZF3 and DAL80 (Stanbrough and Magasanik, 1995), and potentially GCN4 (Natarajan et al., 2001). We identify differing degrees of dysregulation of GAP1 expression when these TFs are deleted (Figure 4A). The effect of deleting TFs varies by condition: GAP1 is not expressed in YPD and its expression increases in nitrogen-limited media and in response to rapamycin. Deletion of GAT1 results in decreased expression in nitrogen limiting media, but deletion of GLN3 does not affect GAP1 expression. By contrast, in the presence of rapamycin deletion of GLN3 results in reduced GAP1 expression. Deletion of GCN4 only impacts GAP1 expression in the presence of urea. MEP2 and GLN1 are also responsive to nitrogen TFs, and are dysregulated when certain TFs are deleted; expression of the glycolytic gene HXK2 decreases when GLN3, GCN4, or RTG1/RTG3 are deleted, but only in conditions of nitrogen limitation (Figure 4—figure supplement 1A). These environmentally dependent impacts of genotype on gene expression demonstrate the importance of exploration of variable conditions for studying genotypic effects on expression.

Figure 4 with 1 supplement see all

Download asset Open asset

Impact of Deleting Transcription Factors on Gene Expression.

(A) Violin plots of the log₂ batch-normalized expression of the general amino acid permease gene *GAP1* in YPD, RAPA, ammonium-limited media, and urea-limited media. (B) Count of differentially expressed genes in each combination of growth condition and strain genotype. Data were transformed to pseudobulk values by summing all counts for each the six biological replicates for each genotype and then analyzed for differential gene expression using DESeq2 [1.5-fold change; p.adj <0.05]. (C) Log₂(fold change) of genes differentially expressed in TF knockout strains compared to wildtype, when grown in YPD. Asterisks denote statistically significant differences in gene expression [1.5-fold change; p.adj <0.05].

A variety of statistical methods have been proposed and benchmarked for testing different expression of scRNAseq data (Soneson and Robinson, 2018). Our experimental design allows single-cell measurements to be collapsed into a total count (pseudobulk) measurement by summing counts across all cells that correspond to each of the six individual replicates of each genotype within a condition. When we analyze this data using standard approaches to RNAseq analysis (DESeq2) we detect several genes with significant (adjusted p-value<0.05) differences in expression (fold change >1.5) between wild-type and TF deletion strains (Figure 4B) that are consistent with known regulatory pathways. There are considerably fewer changes in gene expression as a result of TF deletions compared to the hundreds of genes that change expression between different conditions (Figure 4—figure supplement 1B). However, in cells grown in rich media [YPD], we found 96 genes that are differentially expressed in TF deletion strains compared to wildtype (Figure 4C), and expression of 160 genes are perturbed in TF deletion strains compared to wildtype when exposed to rapamycin [RAPA] (Figure 4—figure supplement 1C). Many of these differentially expressed genes are annotated as functioning in amino acid metabolism and biosynthesis.

Optimal modeling parameters for network inference from Single-Cell yeast data

Differential gene expression in a TF knockout strain is not sufficient evidence of a direct regulatory relationship as many significant changes in gene expression upon deleting a TF are indirect, and many direct effects may be subtle. Therefore, we constructed a gene regulatory network using the Inferelator, a regression-based network inference method which is based on three main modeling assumptions. First, we assume that Transcription Factor Activity (TFA) is a latent biophysical parameter that represents the effect of a TF binding to DNA and modulating its transcription activity (Arrieta-Ortiz et al., 2015; Fu et al., 2011). The TFA values are not directly measured, and instead must be estimated as a relative value based on prior knowledge of a regulatory network of TF and target relationships. This TFA estimation is essential as many TFs are post-transcriptionally regulated, or are expressed at levels that are not reliably detected by scRNAseq (Filtz et al., 2014). Second, we assume that expression of a gene can be described as a weighted sum of the activities of TFs (Bonneau et al., 2006) using an additive model in which activators and repressors increase or decrease the expression of targets linearly. Finally, we assume that each gene is regulated by a small number of TFs, and that regularization of gene expression models is required to enforce this biologically relevant property of target regulation. Saccharomyces cerevisiae, as a preeminent model organism in systems biology, has a well defined set of known interactions that are of considerably higher quality than is available for more complex eukaryotes providing a validated gold standard for testing model performance (Tchourine et al., 2018).

To evaluate the performance of data processing methods and model parameter selections within the Inferelator on scRNAseq data, we perform ten cross-validations using the existing gold standard network. During cross-validation, we infer a GRN using half of the gold standard target genes as priors, then evaluate performance based on recovery of TF-target gene interactions for gold standard interactions that are left out of the priors. We tested preprocessing and prior selection options by inferring networks using gene expression models that are regularized by best subset regression to minimize Bayesian Information Criterion (Arrieta-Ortiz et al., 2015; Greenfield et al., 2013) and quantified performance in predicting TF-target interactions using the area under the precision-recall curve (AUPR). As negative controls, we employed the same procedure after shuffling priors and after simulating scRNAseq data in which all variance is due to sampling noise. The negative control with shuffled priors establishes a random classifier baseline AUPR of 0.02; the negative control with simulated data establishes a circular recovery baseline AUPR of 0.06 (Figure 5A). Performance of the Inferelator on our scRNAseq data far exceeds these baselines, with a mean AUPR of 0.20. This performance from our single dataset is comparable to that of a GRN constructed from 2577 experimental observations using bulk gene expression data (Tchourine et al., 2018).

Figure 5 with 1 supplement see all

Download asset Open asset

Model Performance and Impact of Data Imputation, Prior Selection, and Multitask learning on Network Inference using the Inferelator.

(A) Model performance of Inferelator (TFA-BBSR) network inference after shuffling priors [Neg.Shuffled], on a simulated negative data set [Neg. Data], on the unaltered count matrix [No Imputation], and after imputing missing data from the count matrix using the MAGIC, ScImpute, and VIPER packages. Model performance is shown using area under the precision-recall curve [AUPR], as well as the number of network edges using a precision (>0.5) cutoff, and the number of network edges using a confidence (>0.95) cutoff. Each point plotted in gray is a separate cross-validation analysis, with mean +/- one standard deviation plotted in black (n = 10). (B) Median AUPR after cross-validation (n = 10) and resampling to different numbers of cells, for priors extracted from the gold standard [GS], the YEASTRACT database, Bussemaker et al, priors predicted from ATAC-seq data and motif searching, and no prior data. (C) AUPR of separate cross-validation network inference using cells from all growth conditions, or from individual conditions separately. Each cross-validation (n = 10) was downsampled to the same number of cells. (D) Cross-validation (n = 10) using the YEASTRACT prior data. Networks are learned for all conditions together [BBSR (ALL) ●], for all conditions individually with TFA-BBSR followed by combination [BBSR (BY TASK) ▲], and for all conditions together in multi-task learning followed by combination [AMuSR (MTL) ◆]. Models are evaluated by (i) AUPR on the aggregate, final network and (ii) AUPR for each task-specific subnetwork from BBSR (BY TASK) (▲) and AMuSR (MTL) (◆).

The sparsity of data for each cell acquired using scRNAseq may negatively impact its utility in GRN construction. A commonly used technique to address missing data is data imputation. We tested the impact of several imputation packages on network inference: MAGIC (van Dijk et al., 2018), ScImpute (Li and Li, 2018), and VIPER (Chen and Zhou, 2018). Whereas these methods can enhance separation of gene expression states in low-dimensionality projections (Figure 5—figure supplement 1A), we find that they are either ineffective or detrimental to network inference (Figure 5A). When the GRN is reconstructed from interactions selected at a precision threshold of 0.5, which takes into account how many interactions are correct according to the gold standard, no imputation method increases the number of recovered interactions compared to unmodified data. Data imputation with MAGIC increases the total number of confidently predicted (confidence >0.95) interactions, but recovers fewer interactions that are correct according to the gold standard.

Selection of priors for inference from Single-Cell yeast data

Algorithms for network inference perform poorly when making predictions based only on expression data (Greenfield et al., 2010). Including prior knowledge of regulatory relationships and network topology improves model selection, and allows approximation of latent variables like TFA. Priors can be generated from regulatory interactions defined using methods such as chromatin immunoprecipitation sequencing (ChIP-seq) or analysis of transposase-accessible chromatin (ATAC-seq) and TF binding motifs, or from curated databases of interactions derived from literature. The source and processing of prior knowledge has a substantial effect on the size and accuracy of the learned network (Azizi et al., 2018; Siahpirani and Roy, 2017). We tested the impact on GRN reconstruction of prior data derived from literature, and from high-throughput experimental assays that encompass interactions between the entire yeast genome and the majority of known TFs (Figure 5B). The best performance is obtained using a curated set of known TF-gene interactions obtained from YEASTRACT (Teixeira et al., 2018). Generating priors using motif searching within open chromatin regions determined by ATAC-seq (Castro et al., 2019; Miraldi et al., 2019), and by modeling TF-DNA affinities in promoters (Ward and Bussemaker, 2008) provides a considerable improvement over GRN reconstruction from TF expression without priors, but have lower performance than priors derived from curated data.

Multi-task learning improves network inference and enables reconstruction of a unified Gene Regulatory Network from multiple conditions

Numerous methods exist for integrating information across different conditions and experiments that aim to reduce technical variation while retaining biologically meaningful differences (Hicks et al., 2018; Leek et al., 2010). The appropriate approach to integrating scRNAseq data for the purpose of GRN reconstruction remains unknown. We find that when we separate data based on environmental conditions and infer GRNs we obtain unique networks of differing quality (Figure 5C). Learning a single network from all conditions by first combining the data can be compromised by technical variability and imbalance in the number of cells between conditions. Furthermore, normalizing batches to equal transcript depth risks suppressing differences which are true biological variability. An alternative approach is to treat the cells from each environmental condition as separate tasks. Separate tasks can be learned independently, without sharing information between tasks (implemented as BBSR (BY TASK)). This entails learning networks from each task, and then combining task-specific networks into a global network. Alternatively, networks can be learned together in a multitask learning (MTL) framework (Lam et al., 2016), sharing information between tasks while they are learned, which we have implemented as Adaptive Multiple Sparse Regression (AMuSR) (Castro et al., 2019). We find that, compared to network inference using all data simultaneously [BBSR (ALL)], treating conditions as separate network inference tasks provides a considerable improvement in performance (Figure 5Di). This is likely due to the retention of environmentally specific interactions that would otherwise be obscured using methods for normalizing data prior to GRN construction. The performance of the information sharing network inference approach [AMuSR (MTL)] and the non-sharing network inference approach [BBSR (BY TASK)] are very similar overall. We find that some individual tasks had modest improvements in model performance with AMuSR and others with BBSR (Figure 5Dii).

We constructed a global gene regulatory network using the YEASTRACT priors (as determined above) and our multi-task network inference (AMuSR) procedure. Eleven GRNs were jointly learned from each of the eleven environmental growth conditions; for each task a confidence score for each regulator-target interaction was calculated. GRNs learned for each condition were combined by rank summing condition-specific confidence scores to create a global confidence score for each potential interaction. All potential interactions are ranked by global confidence score, and a global GRN is constructed from interactions that meet the precision threshold of 0.5, as measured by recovery of known interactions (Figure 6A, Source code 2). The resulting GRN comprises 6114 new interactions and 6114 interactions present in the priors, resulting in a total of 12,228 regulator-target interactions. We find that 5372 interactions from the priors are not recovered (recall of 0.532). The global GRN comprises an identified regulator for approximately half of all known genes (Figure 6—figure supplement 1A). There is a positive correlation between expression level for a gene and the number of regulators for that gene (Figure 6—figure supplement 1B) and 90% of the identified interactions are predicted to have activating effects (Figure 6—figure supplement 1C). Many condition-specific networks have uniquely identified interactions (Figure 6—figure supplement 1D), but more than 75% of the final network is composed of TF-gene interactions found in multiple conditions (Figure 6—figure supplement 1E). Of the novel learned interactions (i.e. those not in the prior data), 60% have evidence of a TF-gene regulatory relationship when compared to the YEASTRACT database (Figure 6—figure supplement 1F). 573 learned TF-gene interactions have evidence for physical localization of the TF to the target gene, and 2957 learned TF-gene interactions have evidence of expression changes when the TF is perturbed.

Figure 6 with 1 supplement see all

Download asset Open asset

Reconstruction of a Gene Regulatory Network Identifies New Regulatory Relationships.

A network inferred from the single-cell expression data using multi-task learning and the YEASTRACT TF-gene interaction prior, with a cutoff at precision >0.5. (A) Network graph with known interaction edges from the prior in gray and new inferred interaction edges in red (B) Network graph of the 11 nitrogen-responsive transcription factors with known edges from the prior in gray and new edges in red (C) The number of interactions for each TF; interaction edges present in the prior that are not in the final network are included in black. The nitrogen TFs knocked out in this work are labeled in blue, and TFs with gene ontology annotations for mitotic cell cycle are annotated in green (D) Gene ontology classification of network interactions by the GO slim biological process terms annotated for the target gene and the regulatory TF (the GO term transcription from RNA pol II is omitted from the annotations for regulatory TFs).

Within the nitrogen-regulated TF subnetwork comprising the 11 deleted TFs (Figure 6B) we identify 885 regulator-target interactions, of which 447 are novel, and 438 are present in the priors. This subnetwork contains many features consistent with expectations including co-regulation of targets by the NCR TFs. Overall, the global GRN has the largest number of target genes for general TFs (including ABF1, RAP1, CBF1, and SFP1), but we also define regulatory relationships for a total of 129 of the predicted 207 yeast TFs (Figure 6C). The poorest recovery of prior data is found for TFs that regulate environmental responses not included in our experimental design, such as the stress response TF MSN2 and the mating TF STE12, highlighting the necessity of exploration of condition space for complete network reconstruction. Regulators and target genes can be mapped to Gene Ontology (GO) biological process slim terms, which are broad categorizations that facilitate pathway analysis. Ordering GO slim terms by the number of interactions in the learned GRN, we find that for target genes eight of the top ten GO slim terms are metabolism-related (Figure 6D i); in contrast, for regulatory TFs, five of the top ten GO slim terms are stress response related (Figure 6D ii).

Identification of coregulation by cell cycle and environmental response TFs

Analysis of single cell expression in asynchronous cultures allows detection of cell cycle regulated relationships. The learned global GRN contains 257 genes that are regulated both by nitrogen TFs and by cell cycle TFs (Figure 7A). Many of these regulatory connections are novel; likely due to the fact that identifying interactions between metabolism and cell cycle are challenging in asynchronous cultures without single-cell techniques. Of these genes, 38 are annotated with the amino acid metabolic biological process GO term and 20 are annotated with the ion or transmembrane transport biological process GO term. Only 11 are annotated with the mitotic cell cycle biological process GO term, indicating that the majority of the interconnection between cell cycle and nitrogen response genes is due to regulation of metabolism-related genes by cell cycle TFs.

Figure 7 with 1 supplement see all

Download asset Open asset

Coordinated regulation of Nitrogen Response and Cell Cycle.

(A) A gene regulatory network showing target genes that are regulated by at least one nitrogen TF (blue) and at least one cell cycle TF (green). Target gene nodes are colored by GO slim term. Newly inferred regulatory edges are red and known regulatory edges from the prior are in gray. Transcription factor activity (TFA) is calculated from the learned network and then scaled to a z-score over all cells which do not have that TF deleted (e.g. *gcn4Δ* cells are omitted from the calculation for *GCN4* TFA). The mean TFA z-score for four selected conditions is inset for GAAC and NCR TFs (B) TFA for cell cycle TFs for each cell in the YPD growth condition.

We estimated the TFA for every TF in each cell, using the learned GRN and the single-cell expression matrix. The TFA of nitrogen responsive TFs is principally linked to growth condition as these TFs vary in activity between conditions (Figure 7A), but are generally similar within condition (Figure 7—figure supplement 1A). As expected, we find that cells grown in rich media (YPD) have low TFA for the NCR TFs GLN3 and the GAAC TF GCN4. The TFA for these TFs increases substantially upon treatment with rapamycin. By contrast, the estimated TFAs of cell cycle TFs varies within condition (Figure 7B); and are concordant with cell cycle responsive gene expression (Figure 7—figure supplement 1B–D).

Discussion

A robust scRNAseq and transcriptional barcoding method in yeast

Since the inception of single-cell RNA sequencing (Tang et al., 2009), technological advances have resulted in the scale of datasets increasing from tens of cells to tens of thousands in a diversity of organisms. However, the number of cells recovered during scRNAseq in budding yeast has been comparatively limited in studies published to date (Gasch et al., 2017; Nadal-Ribelles et al., 2019). We present here the first report of droplet-based scRNAseq in this widely used model eukaryotic cell. Using a diverse library of transcriptionally barcoded gene deletion strains we were able to efficiently analyze the gene expression state of 38,255 cells using 11 experiments. In addition to facilitating multiplexed analysis of genotypes, transcriptional barcoding provides a facile means of identifying doublet cells within droplets thereby increasing the accuracy of single cell analysis.

Consistent with our understanding of global gene expression variation first characterized in foundational studies of the transcriptome (DeRisi et al., 1997; Gasch et al., 2000), we find that environmental condition is the primary determinant of the gene expression state of individual yeast cells. However, we observe significant heterogeneity in individual cell gene expression within conditions. Much of this variation can be explained by the mitotic cell-cycle. It is important to note that we do not remove or suppress this cell-cycle driven variance. The cell cycle is itself driven by transcriptional regulators, and our goal is to build a network that integrates cell-cycle regulation with regulated responses to the environment. The ability to access the crosstalk between signalling pathways and the cell cycle program is a key advantage to performing single-cell sequencing in asynchronous cultures, which bypasses many of the limitations of synchronized bulk sequencing experiments. It is also important to note that in several stressful growth conditions, we see heterogeneous cellular responses; some cells appear to be proliferative, while other cells have downregulated translational machinery and upregulated stress response genes. This is an interesting outcome by itself, as it is further evidence of bet-hedging strategies (Levy et al., 2012), and we expect that the presence of multiple distinct transcriptional states between cells in the same environmental condition is advantageous for network inference. Model performance, as measured by AUPR, can vary considerably when learning networks from any single growth condition (Figure 5C). Cells in rich YPD media do not require many anabolic pathways to be active, and primarily express genes required for the cell-cycle, translation, and glycolysis; in contrast, cells in minimal MMD media must express these pathways plus many anabolic pathways to synthesize nitrogenous bases, cofactors and amino acids. We find that this increased transcriptional diversity results in better overall performance. Nonetheless, the largest performance gain comes from aggregating networks from cells in different conditions (Figure 5D), which demonstrates a general advantage to learning GRNs from heterogeneous data.

Deletion of specific transcription factors results in changes in single cell gene expression for some TFs in some conditions. However, genotypic effects are comparatively minor. We believe that this is due to multiple factors including functional redundancy between TFs, physiological adaptation to the genetic perturbation and the conditional specificity of TFs. It is likely that perturbations that are transiently induced, and result in increased TF activity (McIsaac et al., 2013) may be effective in eliciting detectable responses in gene expression, facilitating causal inference. The use of precise gene deletions does provide several advantages over the use of CRISPR/Cas9-based perturbations as engineered deletions are unambiguous whereas the efficiency of perturbation by CRISPR/Cas9 varies for different guide RNAs.

A generalizable framework for GRN construction using scRNAseq

Constructing GRNs from single cell gene expression data is a universal goal in all organisms. A yeast single-cell expression matrix has several beneficial properties for design and testing of gene regulatory network inference models as there exist high quality known interactions and TF binding motifs. The issues of data sparsity and low sampling rates are likely to be problems common in experiments in any organism using scRNAseq. We find that techniques that have been developed for normalization and imputation do not improve performance of the additive linear model-based inference of the Inferelator algorithm (Figure 5). However, there are significant opportunities for development of smoothing techniques that would enhance network inference, perhaps targeting latent biophysical parameters like transcription factor activity. It seems reasonable to assume that these biophysical parameters should be stable within the local neighborhood of samples, and the activity calculation that we have used is ill-conditioned and potentially unstable. This is of particular concern when working with undersampled single-cell data and we are actively addressing this issue.

We find that the application of multitask learning is well suited to GRN reconstruction from scRNAseq data. Jointly learning multiple related tasks improves generalization accuracy, especially in scenarios in which datasets are undersampled (Caruana, 1998), and has the desirable side benefit of mitigating the need for complex batch-correction techniques that aim to address technical variation between experiments. Removing batch-effect technical noise from data without suppressing interbatch biological variability remains an unsolved problem, and therefore application of multitask learning approaches to network inference from single-cell data is likely to be generally applicable to integrating scRNAseq data from different cell types and conditions.

A global GRN for budding yeast

Using our scRNAseq dataset, we reconstructed a global GRN with several novel regulatory relationships. Among the most novel of these interactions are those between cell-cycle associated TFs and targets and nitrogen TFs and target genes. The cell cycle and metabolism are, by necessity, interconnected, and the mechanism of rapamycin in arresting cell cycle through TOR is well-established (Heitman et al., 1991). Several studies have identified metabolic cycling patterns which are believed to be driven by the cell cycle (Burnetti et al., 2016; Slavov and Botstein, 2011; Tu et al., 2005). Although regulatory connections between environmental sensors, metabolism, and the cell cycle have been previously reported, a comprehensive regulatory network does not exist, in large part because of the difficulty of experimentally perturbing cell cycle without confounding metabolic changes. Our study provides a valuable first step in identifying specific regulatory connections that were previously inaccessible, and which are necessary to create a complete map of the yeast regulome.

Incorporation of additional information into the network inference process, including information about interactions between transcription factors such as functional redundancy and heterodimerization, would likely improve learning of the network. We note that several TFs have few learned targets reflecting the requirement for surveying conditions in which particular TFs are active. For example, STE12 and TEC1 are mating-related TFs that we expect to be entirely inactive in our diploid cells; MSN2 and YAP1 are stress-responsive TFs that respond to specific stimuli that were not tested in our study. Targeted analysis of the GRN with rationally designed genetic perturbations and environmental conditions will maximize the additional information that can be recovered from future experiments.

Conclusion

Single-cell sequencing is a transformative method for systems biology. To date, scRNAseq has been widely applied to the problem of defining different cell types. However, the ability to simultaneously study the expression of hundreds of genotypes in different conditions, and sample the expression state of thousands of cells, presents a rich source of information for the purpose of GRN reconstruction. Our study implements this approach in budding yeast, the workhorse of systems biology, and establishes a generalizable framework for GRN reconstruction from scRNAseq data in any organism.

Materials and methods

Requests for strains and reagents should be directed to David Gresham (dgresham@nyu.edu). Requests related to computational analysis and code should be directed to Richard Bonneau (rb133@nyu.edu). There are no restrictions on the materials or the code used in this work. All materials are released under CC-BY 4.0 and all code is available under the permissive MIT or BSD licenses.

Yeast strain construction and growth

All yeast strains were generated from the prototrophic FY4 (MATa) or FY5 (MATɑ) background strains. Yeast were transformed using the standard lithium acetate transformation protocol (Gietz and Schiestl, 2007). E. coli were transformed using the standard chemically competent transformation protocol. Plasmid constructions were confirmed by sanger sequencing. Yeast genotypes, plasmid sequences, and oligonucleotide sequences are provided as Supplementary file 1-supplemental tables 1-3. Media formulations are provided as Supplementary file 1-supplemental table 4.

Share this article

Cite this article

Single-Cell RNA-Seq Experimental Workflow in Saccharomyces Cerevisiae.

Environmental Growth Conditions.

Gene expression of single Yeast Cells Cluster Based on Environmental Growth Condition.

Cells Within Conditions Cluster According to Cell Cycle Genes.

Impact of Deleting Transcription Factors on Gene Expression.

Model Performance and Impact of Data Imputation, Prior Selection, and Multitask learning on Network Inference using the Inferelator.

Reconstruction of a Gene Regulatory Network Identifies New Regulatory Relationships.

Coordinated regulation of Nitrogen Response and Cell Cycle.

Author details

Christopher A Jackson

Contribution

Contributed equally with

Competing interests

Dayanne M Castro

Contribution

Contributed equally with

Competing interests

Giuseppe-Antonio Saldi

Contribution

Competing interests

Richard Bonneau

Contribution

Contributed equally with

For correspondence

Competing interests

David Gresham

Contribution

Contributed equally with

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism