1. Computational and Systems Biology
Download icon

Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

  1. Christopher A Jackson
  2. Dayanne M Castro
  3. Giuseppe-Antonio Saldi
  4. Richard Bonneau  Is a corresponding author
  5. David Gresham  Is a corresponding author
  1. New York University, United States
  2. Simons Foundation, United States
Research Article
  • Cited 5
  • Views 6,620
  • Annotations
Cite this article as: eLife 2020;9:e51254 doi: 10.7554/eLife.51254

Abstract

Understanding how gene expression programs are controlled requires identifying regulatory relationships between transcription factors and target genes. Gene regulatory networks are typically constructed from gene expression data acquired following genetic perturbation or environmental stimulus. Single-cell RNA sequencing (scRNAseq) captures the gene expression state of thousands of individual cells in a single experiment, offering advantages in combinatorial experimental design, large numbers of independent measurements, and accessing the interaction between the cell cycle and environmental responses that is hidden by population-level analysis of gene expression. To leverage these advantages, we developed a method for scRNAseq in budding yeast (Saccharomyces cerevisiae). We pooled diverse transcriptionally barcoded gene deletion mutants in 11 different environmental conditions and determined their expression state by sequencing 38,285 individual cells. We benchmarked a framework for learning gene regulatory networks from scRNAseq data that incorporates multitask learning and constructed a global gene regulatory network comprising 12,228 interactions.

eLife digest

Organisms switch their genes on and off to adapt to changing environments. This takes place thanks to complex networks of regulators that control which genes are actively ‘read’ by the cell to create the RNA molecules that are needed at the time. Piecing together these networks is key to fully understand the inner workings of living organisms, and how to potentially modify or artificially create them.

Single-cell RNA sequencing is a powerful new tool that can measure which genes are turned on (or ‘expressed’) in an individual cell. Datasets with millions of gene expression profiles for individual cells now exist for organisms such as mice or humans. Yet, it is difficult to use these data to reconstruct networks of regulators; this is partly because scientists are not sure if the computational methods normally used to build these networks also work for single-cell RNA sequencing data.

One way to check if this is the case is to use the methods on single-cell datasets from organisms where the networks of regulators are already known, and check whether the computational tools help to reach the same conclusion. Unfortunately, the regulatory networks in the organisms for which scientists have a lot of single-cell RNA sequencing data are still poorly known. There are living beings in which the networks are well characterised – such as yeast – but it has been difficult to do single-cell sequencing in them at the scale seen in other organisms.

Jackson, Castro et al. first adapted a system for single-cell sequencing so that it would work in yeast. This generated a gene expression dataset of over 40,000 yeast cells. They then used a computational method (called the Inferelator) on these data to construct networks of regulators, and the results showed that the method performed well. This allowed Jackson, Castro et al. to start mapping how different networks connect, for example those that control the response to the environment and cell division. This is one of the benefits of single-cell RNA methods: cell division for example is not a process that can be examined at the level of a population, since the cells may all be at different life stages. In the future, the dataset will also be useful to scientists to benchmark a variety of single cell computational tools.

Introduction

Elucidating relationships between genes, and the products they encode, remains one of the central challenges in experimental and computational biology. A gene regulatory network (GRN) is a directed graph in which regulators of gene expression are connected to target gene nodes by interaction edges. Regulators of gene expression include transcription factors (TF) which can act as activators and repressors, RNA binding proteins, and regulatory RNAs. Identifying regulatory relationships between transcriptional regulators and their targets is essential for understanding biological phenomena ranging from cell growth and division to cell differentiation and development (Davidson, 2012). Reconstruction of GRNs is required to understand how gene expression dysregulation contributes to cancer and complex heritable diseases (Barabási et al., 2011; Hu et al., 2016).

Genome-scale methods provide an efficient means of identifying gene regulatory relationships. Efforts of the past two decades have resulted in the development of a variety of experimental and computational methods that leverage advances in technology and machine learning for constructing GRNs. Previously, we developed a method for inferring transcriptional regulatory networks based on regression with regularization that we have called the Inferelator (Bonneau et al., 2006; Ciofani et al., 2012). This method takes as inputs gene expression data and sources of prior information, and outputs regulatory relationships between transcription factors and their target genes that explain the observed gene expression levels. Subsequent work has enhanced this approach by selecting regulators for each gene more effectively (Madar et al., 2010), incorporating orthogonal data types that can be used to generate constraints on network structure (Greenfield et al., 2013), and explicitly estimating latent biophysical parameters including transcription factor activity (Arrieta-Ortiz et al., 2015; Fu et al., 2011) and mRNA decay rates (Tchourine et al., 2018). We have successfully applied this approach to construct GRNs from gene expression data acquired from variation across time, conditions, and genotypes in microbes (Arrieta-Ortiz et al., 2015; Tchourine et al., 2018), plants (Wilkins et al., 2016), and mammalian cells (Ciofani et al., 2012; Miraldi et al., 2019).

Recently, single-cell RNA sequencing (scRNAseq) has exploded in popularity with the development of droplet systems for rapid encapsulation and labeling of thousands of cells in parallel. The DROP-seq system (Macosko et al., 2015) based on bead capture, and the inDrop (Zilionis et al., 2017) and 10x Genomics (Zheng et al., 2017) systems based on hydrogel beads, provide a facile means of generating RNA sequencing data for tens of thousands of individual cells. Although scRNAseq has primarily been used for defining cell types and states, this technology holds great potential for efficient construction of GRNs (Hwang et al., 2018). By combining genetic perturbation of transcriptional regulators using CRISPR/Cas9 with scRNAseq, mixtures of genetic perturbations can be assayed in a single reaction (Adamson et al., 2016; Dixit et al., 2016; Jaitin et al., 2016). This approach, known as Perturb-seq, presents a new opportunity for efficiently inferring GRNs from thousands of individual cells in which different regulators have been disrupted. There are considerable advantages in both scalability and detection of intra-sample heterogeneity with Perturb-seq, but quantifying the the effectiveness of CRISPR/Cas9 targeting in individual cells and distinguishing gene expression variability from noise inherent to mRNA undersampling in scRNAseq (Brennecke et al., 2013; Grün et al., 2014) present technical challenges. Computational methods to take advantage of scRNAseq data for inferring GRNs are under active development (Aibar et al., 2017; Chan et al., 2017; van Dijk et al., 2018). However, benchmarking these methods is difficult; in the absence of a known GRN, model performance is often estimated using simulated data (Chen and Mar, 2018), and issues regarding the appropriate experimental and computational approaches to GRN construction from scRNAseq data remain unresolved.

The budding yeast Saccharomyces cerevisiae is ideally suited to constructing GRNs from experimental data and benchmarking computational methods. Decades of work have provided a plethora of transcriptional regulatory data comprising functional and biochemical information (de Boer and Hughes, 2012; Teixeira et al., 2018). As a result, yeast is well suited to constructing GRNs using methods that leverage the rich available information and for assessing the performance of those methods by comparison to experimentally validated interactions (Ma et al., 2014; Tchourine et al., 2018). Budding yeast presents several technical challenges for single cell analysis, and as a result scRNAseq methods for budding yeast reported to date (Gasch et al., 2017; Nadal-Ribelles et al., 2019) yield far fewer individual cells (~102) than are now routinely generated for mammalian studies (>104). The limitations of existing scRNAseq methods for budding yeast cells limits our ability to investigate eukaryotic cell biology as many signaling and regulatory pathways are highly conserved in yeast (Carmona-Gutierrez et al., 2010; Gray et al., 2004), including the Ras/protein kinase A (PKA), AMP Kinase (AMPK) and target of rapamycin (TOR) pathways (González and Hall, 2017; Loewith and Hall, 2011). However, recent work has successfully established single cell sequencing in the fission yeast Schizosaccharomyces pombe (Saint et al., 2019).

In budding yeast, the TOR complex 1 (TORC1 or mTORC1 in human) coordinates the transcriptional response to changes in nitrogen sources (Godard et al., 2007; Rødkaer and Faergeman, 2014). Controlling this response are four major TF groups, which are regulated by diverse post-transcriptional processes. The Nitrogen Catabolite Repression (NCR) pathway, which is regulated principally by TORC1, consists of the TFs GAT1, GLN3, DAL80, and GZF3 (Hofman-Bang, 1999), and is responsible for suppressing the utilization of non-preferred nitrogen sources when preferred nitrogen sources are available. Gat1 and Gln3 are localized to the cytoplasm until activation results in relocalization to the nucleus (Cox et al., 2000), where they then compete with Dal80 and Gzf3 for DNA binding motifs (Georis et al., 2009). The General Amino Acid Control (GAAC) pathway consists of the TF GCN4 (Hinnebusch, 2005), and is responsible for activating the response to amino acid starvation, as detected by increases in uncharged tRNA levels. Gcn4 activity is translationally controlled by ribosomal pausing at upstream open reading frames in the 5’ untranslated region (Mueller and Hinnebusch, 1986). The retrograde pathway, consisting of the TF heterodimer RTG1 and RTG3, is responsible for altering expression of metabolic and biosynthetic genes in response to mitochondrial stress (Jia et al., 1997; Liao and Butow, 1993) or environmental stress (Ruiz-Roig et al., 2012). The Rtg1/Rtg3 complex is localized to the cytoplasm until activation, upon which they translocate into the nucleus (Komeili et al., 2000). The Ssy1-Ptr3-Ssy5-sensing (SPS) pathway (Ljungdahl, 2009), consists of the TFs STP1 and STP2, and is responsible for altering transporter expression (Didion et al., 1998; Iraqui et al., 1999) in response to changes in extracellular environment. Stp1 and Stp2 are anchored to the plasma membrane until the SPS sensor triggers proteolytic cleavage of their anchoring domain and releases them for nuclear import (Andréasson and Ljungdahl, 2002).

Construction of GRNs based on the transcription factors in these pathways has had mixed success; the high redundancy of the NCR pathway has proven challenging to deconvolute (Milias-Argeitis et al., 2016). The GAAC pathway is more straightforward, although separating direct and indirect regulation remains difficult, even with high-quality experimental data (Mittal et al., 2017). As a result, a comprehensive GRN for nitrogen metabolism has remained elusive, despite successes in identifying genes that respond to changes in environmental nitrogen (Airoldi et al., 2016) and identification of post-transcriptional control mechanisms that underlie these changes (Miller et al., 2018).

Many signalling regulators involved in environmental response interact with cell cycle programs (Johnston et al., 1977; Talarek et al., 2017), including the TOR pathway (Zinzalla et al., 2007); however, how regulation of the mitotic cell cycle and environmentally responsive gene expression is coordinated is unknown. The regulation of nitrogen responsive gene expression in yeast is well-suited to the development of generalizable methods as the degree of TF redundancy, post-transcriptional regulation of TF activity, which precludes straightforward relationships between TF abundance and target expression, and multifactorial impact on gene expression, including intrinsic and extrinsic processes and stimuli, provide a tractable model system for addressing these challenges in higher eukaryotes.

Here, we have developed a method for scRNAseq in budding yeast using Chromium droplet-based single cell encapsulation (10x Genomics). We engineered TF gene deletions by precisely excising the entire TF open reading frame and introducing a unique transcriptional barcode that enables multiplexed analysis of genotypes using scRNAseq. We pooled 72 different strains, corresponding to 12 different genotypes, and determined their gene expression profiles in 11 conditions using scRNAseq analysis of 38,000 cells. We show that our method enables identification of cells from complex mixtures of genotypes in asynchronous cultures that correspond to specific mutants, and to specific stages of the cell cycle. Identification of mutants can be used to identify differentially expressed genes between genotypes providing an efficient means of multiplexed gene expression analysis. We used scRNAseq data in yeast to benchmark computational aspects of GRN reconstruction, and show that multi-task learning integrates information across environmental conditions without requiring complex normalization, resulting in improved GRN reconstruction. We find that imputation of missing data does not improve GRN reconstruction and can lead to prediction of spurious interactions. Using scRNAseq data, we constructed a global GRN for budding yeast comprising 12,228 regulatory interactions. We discover novel regulatory relationships, including previously unknown connections between regulators of cell cycle gene expression and nitrogen responsive gene expression. Our study provides a generalizable framework for GRN reconstruction from scRNAseq, a rich data set that will enable benchmarking of future computational methods, and establishes the use of droplet-based scRNAseq analysis of multiplexed genotypes in yeast.

Results

Engineering a library of Prototrophic, Transcriptionally-Barcoded Gene Deletion Strains

The yeast gene knockout collection (Giaever et al., 2002) facilitates pooled analysis of mutants using unique DNA barcode sequences that identify each gene deletion strain, but these barcodes are only present at the DNA level, precluding their use with scRNAseq. Therefore, we constructed an array of prototrophic, diploid yeast strains with homozygous deletions of TFs that control distinct regulatory modules: 1) NCR, 2) GAAC, 3) SPS-sensing, and 4) the retrograde pathway that coordinately control nitrogen-related gene expression in yeast. We engineered eleven different TF knockout genotypes, using six independently constructed biological replicates for each genotype. In addition, we constructed six biological replicates of the wild-type control in which we deleted the neutral HO locus. Genes were deleted using a modified kanMX cassette such that each of the 72 strains contains a unique transcriptional barcode in the 3’ untranslated region (UTR) of the G418 resistance gene, that can be recovered by RNA sequencing (Figure 1, Figure 1—figure supplement 1A). Homozygous diploids were constructed by mating to a strain containing the same TF knockout marked with a nourseothricin drug resistance cassette. On rich media plates, the 72 strains have an approximately wild-type growth; under nutritional stress, some TF knockouts exhibit growth advantages or disadvantages (Figure 1—figure supplement 1B).

Figure 1 with 1 supplement see all
Single-Cell RNA-Seq Experimental Workflow in Saccharomyces Cerevisiae.

Schematic workflows for: (A) Growth of a transcriptionally-barcoded pool of 11 nitrogen metabolism transcription factor (TF) knockout strains and a wild-type control strain each analyzed with six biological replicates (B) Synthesis in microfluidic droplets of single-cell cDNA with a cell-specific index sequence (IDX) attached to the oligo-dT primer, and a common template switch oligo (TSO). cDNA is processed for whole-transcriptome libraries, to quantify gene expression. In parallel, PCR products are amplified containing the genotype-specific transcriptional barcode (BC) encoded on the KanR antibiotic resistance marker mRNA, to identify cell genotype. Expression DNA libraries and PCR products are separately indexed for multiplexed sequencing (C) Processing of single-cell sequencing data using Unique Molecular Identifiers (UMI) into a count matrix which is used to learn a gene regulatory network using multi-task network inference from several different growth conditions.

Single-Cell RNA sequencing of pooled libraries in diverse growth conditions

ScRNAseq in yeast presents several challenges: cells are small (40–90 µm3), enclosed in a polysaccharide-rich cell wall, and contain fewer mRNAs per cell (40 k-60k) than higher eukaryotes. We developed and validated a protocol using the droplet-based 10x genomics chromium platform, and it used it to perform scRNAseq of the pool of TF knockouts in eleven growth conditions that provide a range of metabolic challenges (Table 1). In addition to variable nitrogen sources in minimal media with excess [MM] and limiting [NLIM-NH4, NLIM-GLN, NLIM-PRO, NLIM-UREA] nitrogen, some conditions result in fermentative metabolism of glucose in rich media [YPD], and inhibition of the TOR signaling pathway in rich media by the small molecule rapamycin [RAPA]. We also studied conditions that require respiratory metabolism of ethanol in rich [YPEtOH] and minimal media [MMEtOH], and in rich media after sugars had been fully metabolized to ethanol and cells have undergone the diauxic shift [DIAUXY]. We tested two different starvation conditions, carbon [CSTARVE] and nitrogen starvation [NSTARVE]; however, the latter condition did not pass quality control during single-cell transcriptome library preparation and was discarded.

Table 1
Environmental Growth Conditions.

Environmental growth conditions are listed with their respective nitrogen and carbon sources. Yeast Extract + Peptone (YP) is a rich, complex nitrogen source. YP + Dextrose [YPD] is standard yeast rich media. Minimal media contains a standard base of trace metals, vitamins, and salts. All cultures were harvested 4 hr after inoculation, except for the culture harvested after the diauxic shift [DIAUXY], which was harvested 10 hr after inoculation. Rapamycin was added to YPD in the [RAPA] culture 30 min prior to harvest. Specific media formulations are listed in Supplementary file 1-Supplemental Table 4.

Growth conditionAbbrv.Nitrogen sourceCarbon source
Yeast Extract, Peptone, GlucoseYPDYPD-Glucose
YPD (Harvested after Post-Diauxic Shift)DIAUXYYPD-Glucose
YPD + 200 ng/mL RapamycinRAPAYPD-Glucose
Yeast Extract, Peptone, EthanolYPEtOHYPEthanol
Minimal Media (Glucose)MMD20 mM (NH4)2SO4D-Glucose
Minimal Media (Ethanol)MMEtOH20 mM (NH4)2SO4Ethanol
Nitrogen Limited Minimal Media (with Glutamine)NLIM-GLN0.8 mM L-GlutamineD-Glucose
Nitrogen Limited Minimal Media (with Proline)NLIM-PRO0.8 mM L-ProlineD-Glucose
Nitrogen Limited Minimal Media (with NH4)NLIM-NH40.8 mM (NH4)2SO4D-Glucose
Nitrogen Limited Minimal Media (with Urea)NLIM-UREA0.8 mM UreaD-Glucose
Carbon StarvationCSTARVE1 mM (NH4)2SO4None

Cells from the eleven different conditions were sequenced and processed using cellranger (10x genomics) and our custom analysis pipeline (fastqToMat0), yielding a digital expression matrix (Source code 1) in which each cell is annotated with the environmental growth condition and genotype. Genotype-specific barcodes facilitate identification and removal of droplets that have multiple cells (doublets) by determining cell IDs that have more than one annotated genotype. Using our pool of 72 strains, we detect and remove 98.5% of doublets. PCR artifacts and duplicates are removed using Unique Molecular Identifiers (UMIs) (Kivioja et al., 2012) to quantify gene expression as unique transcript reads (counts). Following sequence processing, quality control, removal of doublets, and assigning metadata, we recovered 83,703,440 transcript counts from a total of 38,225 individual cells.

To initially assess the quality of our data, we examined the expression of genes that are characteristic of different metabolic states. Consistent with our expectations, the core fermentative (anaerobic) genes PDC1 and ENO2 are expressed in cells in fermenting culture conditions only, and the core respirative (aerobic) gene ADH2 is expressed in cells in respiring culture conditions (Figure 2A). The number of cells recovered varies by over an order of magnitude between conditions; stressful conditions of low nitrogen have lower cell yields overall. The yeast stress response is linked to increased resistance to zymolyase digestion (Nagarajan et al., 2014), which may be reflected in decreased cell yield during single-cell sequencing. Each of the 72 strains is found in each of the 11 conditions, although the number of each strain and genotype varies by environmental condition (Figure 2—figure supplement 1A), and some strains are disproportionately affected. However, the number of transcripts per cell is generally equivalent between strains even when they differ in representation within libraries (Figure 2—figure supplement 1B). By contrast, we find that total transcript counts per cell are highly linked to environmental growth conditions (Figure 2—figure supplement 1C), which is consistent with decreased total transcriptome pool size in suboptimal conditions (Athanasiadou et al., 2019). For cells growing in rich medium (YPD) we recover a median of 2250 unique transcripts per cell, from a median of 695 distinct genes, indicating a capture rate of approximately 3–5% of total transcripts from each cell. The strain genotype does not strongly influence transcript counts per cell (Figure 2—figure supplement 1D). There is a high correlation between single-cell expression data and bulk RNA expression data (spearman correlation 0.941) for wild-type cells grown in YPD (Figure 2—figure supplement 2) indicating that the effect of technical bias caused by single-cell processing is minimal. We also find good correlation to other published single-cell yeast data sets, and a comparable published bulk RNAseq experiment,.

Figure 2 with 4 supplements see all
Gene expression of single Yeast Cells Cluster Based on Environmental Growth Condition.

(A) Normalized density histograms of raw UMI counts of the core glycolytic genes ENO2 and PDC1, and the alcohol respiration gene ADH2 in each environmental growth condition. Mean UMI count for each of the 12 different strain genotypes within each growth condition are plotted as dots on the X axis. (B–C) Uniform Manifold Approximation and Projection (UMAP) projection of log-transformed and batch-normalized scRNAseq data. Axes are dimensionless variables V1 and V2 with arbitrary units, here omitted. Individual cells are colored by environmental growth condition (B) or by strain genotype (C). Growth conditions are abbreviated as in Table 1.

Mapping the digital expression matrix into two-dimensional space with a Uniform Manifold Approximation and Projection [UMAP] results in clear separation of individual cells into groups based on environmental condition (Figure 2B). Cells from different minimal media or nitrogen-limited growth conditions localize near each other, and cells grown in different rich nitrogen sources are clearly separate from each other. Within environmentally-determined grouping there appears to be no strong ordering by genotype (Figure 2C). These clusters are not driven by sequencing depth (Figure 2—figure supplement 3B), although there are some stress conditions which have subpopulations which are downregulated for ribosomal genes and upregulated for induced environmental stress response (iESR) genes (Figure 2—figure supplement 3C–F). The increased relative abundance of ribosomal related gene expression in rich media conditions is consistent with previously-observed correlation of ribosomal gene expression and cellular growth rate (Brauer et al., 2008). Some measures of variance per-gene differ in different growth conditions (Figure 2—figure supplement 4). Interactive figures are provided (http://shiny.bio.nyu.edu/YeastSingleCell2019/) facilitating exploration of expression levels for all genes.

The mitotic cell cycle underlies heterogeneity in single cell gene expression

To identify sources of gene expression differences between cells within environments, we clustered single cells within each environmental condition separately by constructing a Shared Nearest Neighbor graph (Xu and Su, 2015) and clustering using the Louvain method (Blondel et al., 2008). Genes with known roles in mitotic cell cycle are highly represented among the most differentially expressed genes between clusters (Figure 3—figure supplement 1A). Overlaying the expression of three of these genes (PIR1, DSE2, and HTB1/HTB2) on UMAP plots illustrates cell cycle effects on single cell gene expression (Figure 3A and Figure 3B). PIR1 expression, a marker for early G1 (Spellman et al., 1998), is diagnostic of a distinct cluster. DSE2 is expressed only in daughter cells (Colman-Lerner et al., 2001), which allows daughter cells in G1 to be distinguished from mother cells in G1. Cells that have high expression of the histone 2B genes, which are upregulated in S-phase (Eriksson et al., 2012), are localized together in the UMAP plots (Figure 3B).

Figure 3 with 2 supplements see all
Cells Within Conditions Cluster According to Cell Cycle Genes.

(A) Cells from each growth condition were separately normalized and transformed into 2-dimensional space using UMAP. The log-transformed, normalized expression for each cell of (i) the G1-phase specific marker PIR1, (ii) the G1-phase daughter-cell specific marker DSE2, (iii) the S-phase specific marker histone 2B (HTB) is shown; (iv) the genotype and (v) the cluster membership of each cell. (B) Summary of clustered single cell expression within the YPD and RAPA growth conditions (i) Proportion of cells from a specific strain genotype within each cluster (ii) The mean log-transformed, normalized expression of the G1- and S-phase marker genes, as well as a hexokinase gene HXK2 for each cluster (C) Schematic of the mitotic cell cycle with expression of DSE2, PIR1, and HTB genes annotated.

For each cluster of cells within a growth condition we plotted the proportion of cells belonging to each TF deletion genotype, and the mean expression of several cell cycle genes (Figure 3B). Some clusters predominantly contain cells from a single TF deletion genotype; for example, cells deleted for GLN3 (gln3Δ) form a separate cluster in YPD and RAPA conditions, as do cells deleted for one of the RTG heterodimer components (rtg1Δ and rtg3Δ). However, differences in expression due to genotype do not appear to be a primary source of expression differences within conditions, as most clusters show a uniform distribution of genotypes (Figure 3B, Figure 3—figure supplement 1B). Similarly, we do not find that differences in expression of metabolic genes underlie overall differences in expression (e.g. HXK2) suggesting that the yeast metabolic cycle (Silverman et al., 2010; Tu et al., 2005) is not readily identifiable in single cells using scRNAseq. Three of the high-stress growth conditions (NLIM-GLN, NLIM-PRO, and MMEtOH) have clusters that are separate from the majority of the cells analyzed in those conditions. We find that these clusters have higher levels of stress response genes and lower levels of ribosomal genes than other cells in these conditions (Figure 3—figure supplement 2B–C) These clusters may reflect cells undergoing early entry into quiescence and provide evidence for a heterogeneous response to stressful conditions.

Deletion of Transcription Factors causes gene expression changes that differ between growth conditions

To assess our ability to determine differential gene expression between TF knockout strains, we examine the expression of genes known to respond to nitrogen signalling. GAP1 (General Amino acid Permease) is a transporter responsible for importing amino acids under conditions of nitrogen limitation; GAP1 expression is regulated by the NCR activators GAT1 and GLN3, the NCR repressors GZF3 and DAL80 (Stanbrough and Magasanik, 1995), and potentially GCN4 (Natarajan et al., 2001). We identify differing degrees of dysregulation of GAP1 expression when these TFs are deleted (Figure 4A). The effect of deleting TFs varies by condition: GAP1 is not expressed in YPD and its expression increases in nitrogen-limited media and in response to rapamycin. Deletion of GAT1 results in decreased expression in nitrogen limiting media, but deletion of GLN3 does not affect GAP1 expression. By contrast, in the presence of rapamycin deletion of GLN3 results in reduced GAP1 expression. Deletion of GCN4 only impacts GAP1 expression in the presence of urea. MEP2 and GLN1 are also responsive to nitrogen TFs, and are dysregulated when certain TFs are deleted; expression of the glycolytic gene HXK2 decreases when GLN3, GCN4, or RTG1/RTG3 are deleted, but only in conditions of nitrogen limitation (Figure 4—figure supplement 1A). These environmentally dependent impacts of genotype on gene expression demonstrate the importance of exploration of variable conditions for studying genotypic effects on expression.

Figure 4 with 1 supplement see all
Impact of Deleting Transcription Factors on Gene Expression.

(A) Violin plots of the log2 batch-normalized expression of the general amino acid permease gene GAP1 in YPD, RAPA, ammonium-limited media, and urea-limited media. (B) Count of differentially expressed genes in each combination of growth condition and strain genotype. Data were transformed to pseudobulk values by summing all counts for each the six biological replicates for each genotype and then analyzed for differential gene expression using DESeq2 [1.5-fold change; p.adj <0.05]. (C) Log2(fold change) of genes differentially expressed in TF knockout strains compared to wildtype, when grown in YPD. Asterisks denote statistically significant differences in gene expression [1.5-fold change; p.adj <0.05].

A variety of statistical methods have been proposed and benchmarked for testing different expression of scRNAseq data (Soneson and Robinson, 2018). Our experimental design allows single-cell measurements to be collapsed into a total count (pseudobulk) measurement by summing counts across all cells that correspond to each of the six individual replicates of each genotype within a condition. When we analyze this data using standard approaches to RNAseq analysis (DESeq2) we detect several genes with significant (adjusted p-value<0.05) differences in expression (fold change >1.5) between wild-type and TF deletion strains (Figure 4B) that are consistent with known regulatory pathways. There are considerably fewer changes in gene expression as a result of TF deletions compared to the hundreds of genes that change expression between different conditions (Figure 4—figure supplement 1B). However, in cells grown in rich media [YPD], we found 96 genes that are differentially expressed in TF deletion strains compared to wildtype (Figure 4C), and expression of 160 genes are perturbed in TF deletion strains compared to wildtype when exposed to rapamycin [RAPA] (Figure 4—figure supplement 1C). Many of these differentially expressed genes are annotated as functioning in amino acid metabolism and biosynthesis.

Optimal modeling parameters for network inference from Single-Cell yeast data

Differential gene expression in a TF knockout strain is not sufficient evidence of a direct regulatory relationship as many significant changes in gene expression upon deleting a TF are indirect, and many direct effects may be subtle. Therefore, we constructed a gene regulatory network using the Inferelator, a regression-based network inference method which is based on three main modeling assumptions. First, we assume that Transcription Factor Activity (TFA) is a latent biophysical parameter that represents the effect of a TF binding to DNA and modulating its transcription activity (Arrieta-Ortiz et al., 2015; Fu et al., 2011). The TFA values are not directly measured, and instead must be estimated as a relative value based on prior knowledge of a regulatory network of TF and target relationships. This TFA estimation is essential as many TFs are post-transcriptionally regulated, or are expressed at levels that are not reliably detected by scRNAseq (Filtz et al., 2014). Second, we assume that expression of a gene can be described as a weighted sum of the activities of TFs (Bonneau et al., 2006) using an additive model in which activators and repressors increase or decrease the expression of targets linearly. Finally, we assume that each gene is regulated by a small number of TFs, and that regularization of gene expression models is required to enforce this biologically relevant property of target regulation. Saccharomyces cerevisiae, as a preeminent model organism in systems biology, has a well defined set of known interactions that are of considerably higher quality than is available for more complex eukaryotes providing a validated gold standard for testing model performance (Tchourine et al., 2018).

To evaluate the performance of data processing methods and model parameter selections within the Inferelator on scRNAseq data, we perform ten cross-validations using the existing gold standard network. During cross-validation, we infer a GRN using half of the gold standard target genes as priors, then evaluate performance based on recovery of TF-target gene interactions for gold standard interactions that are left out of the priors. We tested preprocessing and prior selection options by inferring networks using gene expression models that are regularized by best subset regression to minimize Bayesian Information Criterion (Arrieta-Ortiz et al., 2015; Greenfield et al., 2013) and quantified performance in predicting TF-target interactions using the area under the precision-recall curve (AUPR). As negative controls, we employed the same procedure after shuffling priors and after simulating scRNAseq data in which all variance is due to sampling noise. The negative control with shuffled priors establishes a random classifier baseline AUPR of 0.02; the negative control with simulated data establishes a circular recovery baseline AUPR of 0.06 (Figure 5A). Performance of the Inferelator on our scRNAseq data far exceeds these baselines, with a mean AUPR of 0.20. This performance from our single dataset is comparable to that of a GRN constructed from 2577 experimental observations using bulk gene expression data (Tchourine et al., 2018).

Figure 5 with 1 supplement see all
Model Performance and Impact of Data Imputation, Prior Selection, and Multitask learning on Network Inference using the Inferelator.

(A) Model performance of Inferelator (TFA-BBSR) network inference after shuffling priors [Neg.Shuffled], on a simulated negative data set [Neg. Data], on the unaltered count matrix [No Imputation], and after imputing missing data from the count matrix using the MAGIC, ScImpute, and VIPER packages. Model performance is shown using area under the precision-recall curve [AUPR], as well as the number of network edges using a precision (>0.5) cutoff, and the number of network edges using a confidence (>0.95) cutoff. Each point plotted in gray is a separate cross-validation analysis, with mean +/- one standard deviation plotted in black (n = 10). (B) Median AUPR after cross-validation (n = 10) and resampling to different numbers of cells, for priors extracted from the gold standard [GS], the YEASTRACT database, Bussemaker et al, priors predicted from ATAC-seq data and motif searching, and no prior data. (C) AUPR of separate cross-validation network inference using cells from all growth conditions, or from individual conditions separately. Each cross-validation (n = 10) was downsampled to the same number of cells. (D) Cross-validation (n = 10) using the YEASTRACT prior data. Networks are learned for all conditions together [BBSR (ALL) ●], for all conditions individually with TFA-BBSR followed by combination [BBSR (BY TASK) ▲], and for all conditions together in multi-task learning followed by combination [AMuSR (MTL) ◆]. Models are evaluated by (i) AUPR on the aggregate, final network and (ii) AUPR for each task-specific subnetwork from BBSR (BY TASK) (▲) and AMuSR (MTL) (◆).

The sparsity of data for each cell acquired using scRNAseq may negatively impact its utility in GRN construction. A commonly used technique to address missing data is data imputation. We tested the impact of several imputation packages on network inference: MAGIC (van Dijk et al., 2018), ScImpute (Li and Li, 2018), and VIPER (Chen and Zhou, 2018). Whereas these methods can enhance separation of gene expression states in low-dimensionality projections (Figure 5—figure supplement 1A), we find that they are either ineffective or detrimental to network inference (Figure 5A). When the GRN is reconstructed from interactions selected at a precision threshold of 0.5, which takes into account how many interactions are correct according to the gold standard, no imputation method increases the number of recovered interactions compared to unmodified data. Data imputation with MAGIC increases the total number of confidently predicted (confidence >0.95) interactions, but recovers fewer interactions that are correct according to the gold standard.

Selection of priors for inference from Single-Cell yeast data

Algorithms for network inference perform poorly when making predictions based only on expression data (Greenfield et al., 2010). Including prior knowledge of regulatory relationships and network topology improves model selection, and allows approximation of latent variables like TFA. Priors can be generated from regulatory interactions defined using methods such as chromatin immunoprecipitation sequencing (ChIP-seq) or analysis of transposase-accessible chromatin (ATAC-seq) and TF binding motifs, or from curated databases of interactions derived from literature. The source and processing of prior knowledge has a substantial effect on the size and accuracy of the learned network (Azizi et al., 2018; Siahpirani and Roy, 2017). We tested the impact on GRN reconstruction of prior data derived from literature, and from high-throughput experimental assays that encompass interactions between the entire yeast genome and the majority of known TFs (Figure 5B). The best performance is obtained using a curated set of known TF-gene interactions obtained from YEASTRACT (Teixeira et al., 2018). Generating priors using motif searching within open chromatin regions determined by ATAC-seq (Castro et al., 2019; Miraldi et al., 2019), and by modeling TF-DNA affinities in promoters (Ward and Bussemaker, 2008) provides a considerable improvement over GRN reconstruction from TF expression without priors, but have lower performance than priors derived from curated data.

Multi-task learning improves network inference and enables reconstruction of a unified Gene Regulatory Network from multiple conditions

Numerous methods exist for integrating information across different conditions and experiments that aim to reduce technical variation while retaining biologically meaningful differences (Hicks et al., 2018; Leek et al., 2010). The appropriate approach to integrating scRNAseq data for the purpose of GRN reconstruction remains unknown. We find that when we separate data based on environmental conditions and infer GRNs we obtain unique networks of differing quality (Figure 5C). Learning a single network from all conditions by first combining the data can be compromised by technical variability and imbalance in the number of cells between conditions. Furthermore, normalizing batches to equal transcript depth risks suppressing differences which are true biological variability. An alternative approach is to treat the cells from each environmental condition as separate tasks. Separate tasks can be learned independently, without sharing information between tasks (implemented as BBSR (BY TASK)). This entails learning networks from each task, and then combining task-specific networks into a global network. Alternatively, networks can be learned together in a multitask learning (MTL) framework (Lam et al., 2016), sharing information between tasks while they are learned, which we have implemented as Adaptive Multiple Sparse Regression (AMuSR) (Castro et al., 2019). We find that, compared to network inference using all data simultaneously [BBSR (ALL)], treating conditions as separate network inference tasks provides a considerable improvement in performance (Figure 5Di). This is likely due to the retention of environmentally specific interactions that would otherwise be obscured using methods for normalizing data prior to GRN construction. The performance of the information sharing network inference approach [AMuSR (MTL)] and the non-sharing network inference approach [BBSR (BY TASK)] are very similar overall. We find that some individual tasks had modest improvements in model performance with AMuSR and others with BBSR (Figure 5Dii).

We constructed a global gene regulatory network using the YEASTRACT priors (as determined above) and our multi-task network inference (AMuSR) procedure. Eleven GRNs were jointly learned from each of the eleven environmental growth conditions; for each task a confidence score for each regulator-target interaction was calculated. GRNs learned for each condition were combined by rank summing condition-specific confidence scores to create a global confidence score for each potential interaction. All potential interactions are ranked by global confidence score, and a global GRN is constructed from interactions that meet the precision threshold of 0.5, as measured by recovery of known interactions (Figure 6A, Source code 2). The resulting GRN comprises 6114 new interactions and 6114 interactions present in the priors, resulting in a total of 12,228 regulator-target interactions. We find that 5372 interactions from the priors are not recovered (recall of 0.532). The global GRN comprises an identified regulator for approximately half of all known genes (Figure 6—figure supplement 1A). There is a positive correlation between expression level for a gene and the number of regulators for that gene (Figure 6—figure supplement 1B) and 90% of the identified interactions are predicted to have activating effects (Figure 6—figure supplement 1C). Many condition-specific networks have uniquely identified interactions (Figure 6—figure supplement 1D), but more than 75% of the final network is composed of TF-gene interactions found in multiple conditions (Figure 6—figure supplement 1E). Of the novel learned interactions (i.e. those not in the prior data), 60% have evidence of a TF-gene regulatory relationship when compared to the YEASTRACT database (Figure 6—figure supplement 1F). 573 learned TF-gene interactions have evidence for physical localization of the TF to the target gene, and 2957 learned TF-gene interactions have evidence of expression changes when the TF is perturbed.

Figure 6 with 1 supplement see all
Reconstruction of a Gene Regulatory Network Identifies New Regulatory Relationships.

A network inferred from the single-cell expression data using multi-task learning and the YEASTRACT TF-gene interaction prior, with a cutoff at precision >0.5. (A) Network graph with known interaction edges from the prior in gray and new inferred interaction edges in red (B) Network graph of the 11 nitrogen-responsive transcription factors with known edges from the prior in gray and new edges in red (C) The number of interactions for each TF; interaction edges present in the prior that are not in the final network are included in black. The nitrogen TFs knocked out in this work are labeled in blue, and TFs with gene ontology annotations for mitotic cell cycle are annotated in green (D) Gene ontology classification of network interactions by the GO slim biological process terms annotated for the target gene and the regulatory TF (the GO term transcription from RNA pol II is omitted from the annotations for regulatory TFs).

Within the nitrogen-regulated TF subnetwork comprising the 11 deleted TFs (Figure 6B) we identify 885 regulator-target interactions, of which 447 are novel, and 438 are present in the priors. This subnetwork contains many features consistent with expectations including co-regulation of targets by the NCR TFs. Overall, the global GRN has the largest number of target genes for general TFs (including ABF1, RAP1, CBF1, and SFP1), but we also define regulatory relationships for a total of 129 of the predicted 207 yeast TFs (Figure 6C). The poorest recovery of prior data is found for TFs that regulate environmental responses not included in our experimental design, such as the stress response TF MSN2 and the mating TF STE12, highlighting the necessity of exploration of condition space for complete network reconstruction. Regulators and target genes can be mapped to Gene Ontology (GO) biological process slim terms, which are broad categorizations that facilitate pathway analysis. Ordering GO slim terms by the number of interactions in the learned GRN, we find that for target genes eight of the top ten GO slim terms are metabolism-related (Figure 6D i); in contrast, for regulatory TFs, five of the top ten GO slim terms are stress response related (Figure 6D ii).

Identification of coregulation by cell cycle and environmental response TFs

Analysis of single cell expression in asynchronous cultures allows detection of cell cycle regulated relationships. The learned global GRN contains 257 genes that are regulated both by nitrogen TFs and by cell cycle TFs (Figure 7A). Many of these regulatory connections are novel; likely due to the fact that identifying interactions between metabolism and cell cycle are challenging in asynchronous cultures without single-cell techniques. Of these genes, 38 are annotated with the amino acid metabolic biological process GO term and 20 are annotated with the ion or transmembrane transport biological process GO term. Only 11 are annotated with the mitotic cell cycle biological process GO term, indicating that the majority of the interconnection between cell cycle and nitrogen response genes is due to regulation of metabolism-related genes by cell cycle TFs.

Figure 7 with 1 supplement see all
Coordinated regulation of Nitrogen Response and Cell Cycle.

(A) A gene regulatory network showing target genes that are regulated by at least one nitrogen TF (blue) and at least one cell cycle TF (green). Target gene nodes are colored by GO slim term. Newly inferred regulatory edges are red and known regulatory edges from the prior are in gray. Transcription factor activity (TFA) is calculated from the learned network and then scaled to a z-score over all cells which do not have that TF deleted (e.g. gcn4Δ cells are omitted from the calculation for GCN4 TFA). The mean TFA z-score for four selected conditions is inset for GAAC and NCR TFs (B) TFA for cell cycle TFs for each cell in the YPD growth condition.

We estimated the TFA for every TF in each cell, using the learned GRN and the single-cell expression matrix. The TFA of nitrogen responsive TFs is principally linked to growth condition as these TFs vary in activity between conditions (Figure 7A), but are generally similar within condition (Figure 7—figure supplement 1A). As expected, we find that cells grown in rich media (YPD) have low TFA for the NCR TFs GLN3 and the GAAC TF GCN4. The TFA for these TFs increases substantially upon treatment with rapamycin. By contrast, the estimated TFAs of cell cycle TFs varies within condition (Figure 7B); and are concordant with cell cycle responsive gene expression (Figure 7—figure supplement 1B–D).

Discussion

A robust scRNAseq and transcriptional barcoding method in yeast

Since the inception of single-cell RNA sequencing (Tang et al., 2009), technological advances have resulted in the scale of datasets increasing from tens of cells to tens of thousands in a diversity of organisms. However, the number of cells recovered during scRNAseq in budding yeast has been comparatively limited in studies published to date (Gasch et al., 2017; Nadal-Ribelles et al., 2019). We present here the first report of droplet-based scRNAseq in this widely used model eukaryotic cell. Using a diverse library of transcriptionally barcoded gene deletion strains we were able to efficiently analyze the gene expression state of 38,255 cells using 11 experiments. In addition to facilitating multiplexed analysis of genotypes, transcriptional barcoding provides a facile means of identifying doublet cells within droplets thereby increasing the accuracy of single cell analysis.

Consistent with our understanding of global gene expression variation first characterized in foundational studies of the transcriptome (DeRisi et al., 1997; Gasch et al., 2000), we find that environmental condition is the primary determinant of the gene expression state of individual yeast cells. However, we observe significant heterogeneity in individual cell gene expression within conditions. Much of this variation can be explained by the mitotic cell-cycle. It is important to note that we do not remove or suppress this cell-cycle driven variance. The cell cycle is itself driven by transcriptional regulators, and our goal is to build a network that integrates cell-cycle regulation with regulated responses to the environment. The ability to access the crosstalk between signalling pathways and the cell cycle program is a key advantage to performing single-cell sequencing in asynchronous cultures, which bypasses many of the limitations of synchronized bulk sequencing experiments. It is also important to note that in several stressful growth conditions, we see heterogeneous cellular responses; some cells appear to be proliferative, while other cells have downregulated translational machinery and upregulated stress response genes. This is an interesting outcome by itself, as it is further evidence of bet-hedging strategies (Levy et al., 2012), and we expect that the presence of multiple distinct transcriptional states between cells in the same environmental condition is advantageous for network inference. Model performance, as measured by AUPR, can vary considerably when learning networks from any single growth condition (Figure 5C). Cells in rich YPD media do not require many anabolic pathways to be active, and primarily express genes required for the cell-cycle, translation, and glycolysis; in contrast, cells in minimal MMD media must express these pathways plus many anabolic pathways to synthesize nitrogenous bases, cofactors and amino acids. We find that this increased transcriptional diversity results in better overall performance. Nonetheless, the largest performance gain comes from aggregating networks from cells in different conditions (Figure 5D), which demonstrates a general advantage to learning GRNs from heterogeneous data.

Deletion of specific transcription factors results in changes in single cell gene expression for some TFs in some conditions. However, genotypic effects are comparatively minor. We believe that this is due to multiple factors including functional redundancy between TFs, physiological adaptation to the genetic perturbation and the conditional specificity of TFs. It is likely that perturbations that are transiently induced, and result in increased TF activity (McIsaac et al., 2013) may be effective in eliciting detectable responses in gene expression, facilitating causal inference. The use of precise gene deletions does provide several advantages over the use of CRISPR/Cas9-based perturbations as engineered deletions are unambiguous whereas the efficiency of perturbation by CRISPR/Cas9 varies for different guide RNAs.

A generalizable framework for GRN construction using scRNAseq

Constructing GRNs from single cell gene expression data is a universal goal in all organisms. A yeast single-cell expression matrix has several beneficial properties for design and testing of gene regulatory network inference models as there exist high quality known interactions and TF binding motifs. The issues of data sparsity and low sampling rates are likely to be problems common in experiments in any organism using scRNAseq. We find that techniques that have been developed for normalization and imputation do not improve performance of the additive linear model-based inference of the Inferelator algorithm (Figure 5). However, there are significant opportunities for development of smoothing techniques that would enhance network inference, perhaps targeting latent biophysical parameters like transcription factor activity. It seems reasonable to assume that these biophysical parameters should be stable within the local neighborhood of samples, and the activity calculation that we have used is ill-conditioned and potentially unstable. This is of particular concern when working with undersampled single-cell data and we are actively addressing this issue.

We find that the application of multitask learning is well suited to GRN reconstruction from scRNAseq data. Jointly learning multiple related tasks improves generalization accuracy, especially in scenarios in which datasets are undersampled (Caruana, 1998), and has the desirable side benefit of mitigating the need for complex batch-correction techniques that aim to address technical variation between experiments. Removing batch-effect technical noise from data without suppressing interbatch biological variability remains an unsolved problem, and therefore application of multitask learning approaches to network inference from single-cell data is likely to be generally applicable to integrating scRNAseq data from different cell types and conditions.

A global GRN for budding yeast

Using our scRNAseq dataset, we reconstructed a global GRN with several novel regulatory relationships. Among the most novel of these interactions are those between cell-cycle associated TFs and targets and nitrogen TFs and target genes. The cell cycle and metabolism are, by necessity, interconnected, and the mechanism of rapamycin in arresting cell cycle through TOR is well-established (Heitman et al., 1991). Several studies have identified metabolic cycling patterns which are believed to be driven by the cell cycle (Burnetti et al., 2016; Slavov and Botstein, 2011; Tu et al., 2005). Although regulatory connections between environmental sensors, metabolism, and the cell cycle have been previously reported, a comprehensive regulatory network does not exist, in large part because of the difficulty of experimentally perturbing cell cycle without confounding metabolic changes. Our study provides a valuable first step in identifying specific regulatory connections that were previously inaccessible, and which are necessary to create a complete map of the yeast regulome.

Incorporation of additional information into the network inference process, including information about interactions between transcription factors such as functional redundancy and heterodimerization, would likely improve learning of the network. We note that several TFs have few learned targets reflecting the requirement for surveying conditions in which particular TFs are active. For example, STE12 and TEC1 are mating-related TFs that we expect to be entirely inactive in our diploid cells; MSN2 and YAP1 are stress-responsive TFs that respond to specific stimuli that were not tested in our study. Targeted analysis of the GRN with rationally designed genetic perturbations and environmental conditions will maximize the additional information that can be recovered from future experiments.

Conclusion

Single-cell sequencing is a transformative method for systems biology. To date, scRNAseq has been widely applied to the problem of defining different cell types. However, the ability to simultaneously study the expression of hundreds of genotypes in different conditions, and sample the expression state of thousands of cells, presents a rich source of information for the purpose of GRN reconstruction. Our study implements this approach in budding yeast, the workhorse of systems biology, and establishes a generalizable framework for GRN reconstruction from scRNAseq data in any organism.

Materials and methods

Requests for strains and reagents should be directed to David Gresham (dgresham@nyu.edu). Requests related to computational analysis and code should be directed to Richard Bonneau (rb133@nyu.edu). There are no restrictions on the materials or the code used in this work. All materials are released under CC-BY 4.0 and all code is available under the permissive MIT or BSD licenses.

Yeast strain construction and growth

All yeast strains were generated from the prototrophic FY4 (MATa) or FY5 (MATɑ) background strains. Yeast were transformed using the standard lithium acetate transformation protocol (Gietz and Schiestl, 2007). E. coli were transformed using the standard chemically competent transformation protocol. Plasmid constructions were confirmed by sanger sequencing. Yeast genotypes, plasmid sequences, and oligonucleotide sequences are provided as Supplementary file 1-supplemental tables 1-3. Media formulations are provided as Supplementary file 1-supplemental table 4.

Construction of barcoded deletion cassettes

Request a detailed protocol

The deletion cassette plasmid was constructed by amplifying pTEF::KANR from pUG6 (Euroscarf) and tTEF from pUG6, with an overlapping junction between KANR and tTEF containing two BbsI sites for golden-gate mediated barcode cloning. These pieces were assembled into pUC19 using gibson isothermal assembly to generate DGP304. This plasmid was then modified by linearizing with BamHI and XbaI, amplifying a bacterial GFP expression cassette from pWS158 (Addgene), and assembled using gibson isothermal assembly to generate DGP306.

Gene deletion barcodes were created by synthesizing an oligonucleotide containing flanking PCR handles (M13F and M13R), flanking BbsI sites for golden gate cloning, and the degenerate sequence caNNgNNgtNNgNNgtNNgNNgt. The mixture of oligonucleotides was double-stranded using E. coli DNA Polymerase I, Large (Klenow) fragment. Klenow buffer (1x NEB Buffer 2.1 [NEB #B7202S]) was mixed with 250 nM barcode oligonucleotide, 250 nM M13R primer, 200 nM/each dNTP [NEB #N0447S], incubated at 80°C and slowly cooled to room temperature. The DNA Polymerase I, Large (Klenow) Fragment (NEB #M0210S) was added to 0.1 U/µL and the reaction was incubated at 37°C for 30 min. The polymerase was heat-inactivated by placing the reaction at 75°C for 20 min. The resulting dsDNA cassette was used with no further cleanup.

The barcode was inserted into the 3’ untranslated region of the pTEF::KANR::tTEF yeast selection marker cassette in DGP306 by BbsI-mediated golden gate cloning. A golden gate reaction was prepared with 1x Thermo FastDigest Buffer [Thermo #ER1011], 1 mM ATP, 10 mM DTT, 2 U/µL T4 DNA Ligase [NEB #M0202S], 1 U/µL BpiI [Thermo #ER1011], 10 ng/µL DGP306, 25 nM barcode dsDNA, and incubated in a thermocycler using the following program: 37°C 20 min; 25x cycles of 37°C for 5 min and 16°C for 5 min; 37°C for 20 min; 80°C for 20 min. An additional 1 U/µL BpiI was then added to the reaction mix and incubated at 37°C for 30 min to linearize any remaining uncloned plasmid.

The golden gate cloning reaction was transformed into One Shot TOP10 E. coli (ThermoFisher #C404003). Cloning and transformation efficiency was estimated by plating 2% of the transformation onto LB + ampicillin plate and counting GFP+ and GFP- colonies. The remainder of the reaction was inoculated into 200 mL molten LB + ampicillin + 0.6% (w/v) SeaPrep Agarose (Lonza 50302) media, thoroughly mixed, snap cooled in an ice bucket, and incubated overnight at 37°C. The soft agar culture was then collected by centrifugation, washed with PBS, and resuspended in 2 mL 50% glycerol. 100 µL of this mixture was used to inoculate a culture of 100 mL LB + ampicillin and the remainder stored at −80°C in aliquots. The 100 mL culture was grown for 8 hr at 37°C, harvested, washed with PBS, and stored at −20°C until midiprepped (Qiagen) according to the manufacturer’s protocol.

Construction of a barcoded Transcription Factor deletion array

Request a detailed protocol

The degenerate barcoded plasmid was used as template for PCR using primers containing gene-specific targeting homology arms (1x NEB Q5 Master Mix #M0494S, 1 ng template plasmid, 250 nM/each oligo). The PCR amplicon was then transformed into FY4 and plated on YPD+G418 to select transformants. Transformants containing the gene deletion were confirmed using colony PCR and gene-specific primers and a KANR primer. PCR products of correct transformants were cleaned using silica spin columns (Qiagen) according to the manufacturer’s protocol and the barcode identified by Sanger sequencing. At least six uniquely barcoded strains (i.e. biological replicates) were generated for each genotype, with the criteria that each barcode had to differ by at least three bases, ensuring that the probability of barcode collisions is extremely low.

The plasmid DGP328 (pTEF::NATR::tTEF) was used as template for PCR using primers containing the same gene-specific targeting homology. The PCR amplicon was transformed into FY5 and plated on YPD+nourseothricin. Positive transformants were confirmed using colony PCR with gene-specific primers and a NATR primer.

FY4-derived MATa strains were arrayed in a 96-well plate (Corning 3788) and then pinned (V and P Scientific #VP407FP12) onto YPD in an OmniTray (Nunc 165218). FY5-derived MATɑ strains were arrayed in a 96-well plate so that the same gene was disrupted in matching wells of the MATa and MATɑ plates and then pinned onto YPD. These arrays are grown overnight at 30°C. The MATa array and MATɑ array were then pinned to the same YPD plate to create spots where MATa and MATɑ strains were overlaid. The plate layout was designed so that some locations had only MATa strains, only MATɑ strains, or no strains, to control for mating, contamination, and the efficacy of diploid selection. The mating array was grown overnight at 30°C to allow mating to occur and then pinned to a YPD+G418+nourseothricin plate to select for MATa/MATɑ diploids. This diploid selection plate was grown overnight and then pinned to a YPD+G418+nourseothricin plate for a second round of diploid selection. The second diploid selection plate was grown overnight at 30°C and then pinned to a YPD+G418+nourseothricin plate for a third round of diploid selection at 30°C. This plate was then pinned to several replicate 96 well round-bottom plates containing 200 µL YPD+G418+nourseothricin in each well. These plates were cultured with shaking overnight at 30°C, then centrifuged and the media aspirated. The cells were resuspended in 50% glycerol and the plates stored at −80°C.

Culturing and harvest

Request a detailed protocol

The barcoded deletion array was pinned from a frozen stock plate at −80°C onto a YPD plate for recovery and grown overnight at 30°C. The first recovery plate was then pinned to a second recovery YPD plate and grown overnight at 30°C. The second recovery plate was pinned to a 96 well round-bottom plate containing 200 µL YPD in each well and grown overnight at 30°C. The cultures from this plate were pooled, washed 2x with 50 mL PBS, and then resuspended in 1 mL PBS. 250 µL of the washed cells were used to inoculate 50 mL of the relevant media for the specific experimental condition in a shake flask. These flasks were grown for 4 hr. The experiment grown to diauxic shift was grown for 10 hr. We confirmed that glucose in the media was exhausted between hour 9 and hour 10 using a hexokinase-based assay (R-Biopharm #10716251035). All other steps of harvesting cells were identical to the 4 hr experiments. The experiment treated with rapamycin was grown for 3 hr and 30 min in YPD, and then 10 µL of rapamycin stock (1 mg/mL Millipore #553210 in ethanol) was added to a final concentration of 200 ng/mL. Cells were then harvested at 4 hr (after 30 min in rapamycin).

Cell count per mL at harvest was determined using a Beckman Coulter Z2 Particle Counter #6605700. Cell density (cells/mL) for each condition at harvest was as follows: (YPD 1.4e7; RAPA 1.2e7; YPEtOH 1.0e7; NLIM-GLN 0.5e7; NLIM-NH4 0.8e7; NLIM-PRO 0.4e7; NLIM-UREA 0.5e7; MMD 1.1e7; MMEtOH 0.7e7; CSTARVE 0.1e7) A volume of culture containing ~108 cells was collected and the cells pelleted by centrifugation. These cells were immediately resuspended in 1 mL RNALater (Qiagen #76104), washed 2x with 1 mL RNALater and resuspended in a final volume of 500 µL RNALater. This suspension was stored at −20°C for 12 to 72 hr.

Library preparation and sequencing

All steps below used RNAse-free reagents.

Single cell library preparation

Request a detailed protocol

Cells stored in RNALater were removed from −20°C and ~107 cells were washed 2x with 1 mL spheroplasting buffer (50 mM Sodium Phosphate pH 7.5, 1M Sorbitol, 10 mM EDTA, 2 mM DTT, 100 µg/mL BSA). Cells from fermentative phase growth cultures were then resuspended in 100 µL spheroplasting buffer + 0.1 U Zymolyase 100T (Zymo Research #E1004). Cells from respiratory phase growth cultures or starvation cultures were resuspended in 100 µL spheroplasting buffer + 0.25U Zymolyase 100T. The spheroplasting reaction was incubated at 37°C for exactly 20 min, and then the spheroplasted cells were pelleted and resuspended in 500 µL RNALater for 5 min on ice. After this incubation the spheroplasted cells were pelleted and washed 3x with 1 mL wash buffer (10 mM TRIS pH 8, 1M Sorbitol, 100 µg/mL BSA) and resuspended in 1 mL wash buffer. The cells were visualized to confirm spheroplasting and counted using a hemocytometer. A dilution equal to ~5×106 cells/mL in wash buffer was prepared and then immediately used for single cell isolation.

Single cell library preparation was done using the 10x Genomics Chromium 3’ v2 Single Cell Gene Expression Kit (10x Genomics #120237), following the kit protocol. 66.2 µL of single-cell master mix was prepared to which 27.7 µL H2O was added. The microfluidic Chromium Single Cell A Chip (10x Genomics #120236) was then prepared for use. 6 µL of prepared spheroplast cell suspension was added to the single-cell master mix, and then immediately transferred to the microfluidics chip. Hydrogel beads and partitioning oil were added according to the manufacturer’s protocol, and the cells were encapsulated with hydrogel beads using the 10x Genomics Chromium Controller. Following emulsification, reverse transcription and cleanup was performed according to the manufacturer’s protocol. Whole transcriptome amplification was performed using a total of 10 cycles of PCR. Cleanup, fragmentation, adapter ligation, and indexing was performed according to the manufacturer’s protocol, using 8–10 cycles of PCR for the indexing reaction.

Transcribed barcodes were amplified from the whole transcriptome amplification prior to fragmentation. The KANR transcript containing the genotype barcode was amplified in a reaction(1x KAPA HiFi Hotstart Readymix [Kapa #KK2602], 200 nM/each primer, 1 µL 10x whole-transcriptome DNA), using 6 cycles of PCR (98°C for 3:00; 6 cycles of 98°C for 0:20, 63°C for 0:20, and 72°C for 0:20 min; 72°C for 1:00 min). The amplicon pool was then purified with 1x volume of SPRIselect beads (Beckman Coulter #B23317) and eluted into 24 µL H2O. To this eluate, 25 µL of 2x KAPA HiFi Hotstart Readymix was added, as well as 200 nM/each indexing primers. The indexing reaction was cycled for 8–10 cycles of PCR, using the 10x Genomics indexing PCR reaction settings (98°C for 0:45; 8-10x cycles of 98°C for 0:20, 54°C for 0:30, and 72°C for 0:20; 72°C for 1:00).

Library fragment sizes were determined using a High Sensitivity D1000 Screentape (Agilent #5067–5584) and quantified with the KAPA illumina library quantification system (Roche #KK4953) on a Roche lightcycler 480. Libraries from each condition were pooled so that 99% of the pool consisted of the single-cell transcriptome library and 1% of the pool consisted of the genotype barcode amplicon. Samples were then pooled for multiplex sequencing on an Illumina NextSeq 500 with the NextSeq 500/550 v2.5 High Output 150 Cycle kit (Illumina #20024907), using the sequencing parameters recommended by 10x Genomics (Read 1: 26 bp, Read 2: 98 bp, Index 2: 8 bp) and standard illumina read and indexing primers.

Bulk RNA library preparation

Request a detailed protocol

Each of the six wild-type yeast strains (MAT a/ɑ Δho::KanMX/Δho::NatMX) were separately grown overnight in YPD at 30°C.~108 cells (0.5 mL) of overnight culture was subcultured into separate 50 mL flasks of pre-warmed YPD and cultured with shaking for 4 hr at 30°C. At 4 hr, for each culture flask,~108 cells were pelleted by centrifugation and immediately transferred to a microfuge tube, then snap-frozen in liquid nitrogen for storage at −80°C.

For each of six wild-type samples snap-frozen in liquid nitrogen and stored at −80°C, cell pellets were removed from −80°C storage and immediately resuspended in 1 mL TRIZOL (ThermoFisher #15596026), which is an organic extraction reagent with phenol and the chaotropic salt guanidinium thiocyanate (Chomczynski and Sacchi, 2006). After sitting at RT for 5 min, 200 µL chloroform was added and tubes were mixed by inversion. Organic and aqueous phases were separated by centrifugation at 4°C. The aqueous phase was re-extracted with 500 µL acid phenol:chloroform (ThermoFisher #AM9720), then the aqueous phase from that extraction was re-extracted with 500 µL chloroform. 1:10th volume 5M NH4OAc (ThermoFisher #AM9070G) and 2.5x volumes of ice-cold absolute ethanol were added to the aqueous phase from the chloroform extraction, and RNA was precipitated overnight at −80°C. After precipitation, the RNA pellet was washed with ice-cold 70% ethanol and dissolved into 100 µL RNA elution buffer (10 mM TRIS pH8, 0.05% TWEEN-20). RNA was quantified by Qbit (ThermoFisher #Q10210) and a working stock of 5 ng/µL RNA was prepared for each sample.

15 µL of reverse transcription mix (5 µL 5x Maxima RT Buffer, 5 µL 20% (w/v) Ficoll PM-400 [GE Life Sciences #17030010], 2.5 µL 10 mM/each dNTP [New England Biolabs #N0447S], 0.5 µL Lucigen NxGen RNase Inhibitor [Lucigen #30281–1], 0.5 µL 50 µM Template Switch Oligo [IDT], 0.5 µL 50 µM Barcode/UMI/poly-dT Oligo [IDT], 0.5 µL Maxima H Minus Reverse Transcriptase [ThermoFisher #EP0752], 0.5 µL H2O) was added to 50 ng (10 µL) of RNA. Each reaction contained a separate barcoded poly-dT oligo such that each of the six biological replicate samples contain a unique, identifiable barcode sequence. Reverse transcription was carried out at 53°C for 1 hr, followed by heat inactivation at 85°C for 5 min. 98 µL RLT Buffer [Qiagen] and 2 µL MyOne Silane beads [ThermoFisher #37002D] were added, mixed, and allowed to sit at RT for 10 min. cDNA was then isolated by magnetic separation of beads, followed by 2x washes with 200 µL 80% ethanol. Beads were pooled together and all cDNA was eluted into 40 µL of DNA elution buffer (10 mM TRIS pH8, 0.05% TWEEN-20, 1 mM DTT). 60 µL WTA master mix (50 µL 2x KAPA HiFi Hotstart Readymix, 1 µL 100 µM Forward Oligo, 1 µL 100 µM Reverse Oligo, 8 µL H2O) was added and whole transcriptomes were amplified using 12 cycles of PCR (98°C for 3:00; 12 cycles of 98°C for 0:20, 55°C for 0:20, and 72°C for 1:15 min; 72°C for 3:00 min). The amplified pool was then purified with 0.6x volume of SPRIselect beads and eluted into 25 µL DNA elution buffer. Amplified DNA was quantified using a high sensitivity D5000 ScreenTape (Agilent #5067–5592).

Amplified whole-transcriptome DNA was tagmented with a nextera XT kit (Illumina #FC-131–1096) as follows. 3 ng of DNA was diluted to a total volume of 10 µL with DNA elution buffer. 20 µL TD buffer and 10 µL ATM was added and DNA was tagemented at 55C for 10 min. The reaction was halted with 10 µL NT buffer, and the fragment pool was indexed by adding 30 µL NPM buffer, 5 µL illumina index 2 (i7) adapter primer, 5 µL 5 µM DG1954 (no-index primer) and amplifying using 12 cycles of PCR (95°C for 0:30; 12 cycles of 95°C for 0:10, 55°C for 0:30, and 72°C for 0:30 min; 72°C for 5:00 min). Libraries were purified by double-sided SPRI selection. 55 µL SPRIselect beads (0.55x) were added to the nextera indexing reaction, and the unbound supernatant was transferred to a clean tube. 20 µL SPRIselect beads were added (0.75x total), and after binding and washing, DNA was eluted into 20 µL DNA elution buffer. Libraries were checked for size with a High Sensitivity D1000 Screentape, and quantified with the KAPA illumina library quantification system on a Roche lightcycler 480. Libraries were sequenced on an Illumina NextSeq 500 with the NextSeq 500/550 v2.5 High Output 150 Cycle kit, using the sequencing parameters recommended by 10x Genomics (Read 1: 26 bp, Read 2: 98 bp, Index 2: 8 bp) and standard illumina read and indexing primers.

Processing sequencing data

Request a detailed protocol

Sequencing results were analyzed using the Cellranger pipeline (10x Genomics) v2.1.0 and custom python scripts written for this project, which are located in the fastqTomat0 GitHub repository (https://github.com/flatironinstitute/fastqToMat0). The reference genome was obtained from Ensembl (Version R64-1-1) as a FASTA file, and the reference annotations were obtained from Ensembl (Version R64-1-1.93) in GTF format. The reference transcript annotations were altered to incorporate 5’ and 3’ untranslated regions using data from generated using TIF-seq (Pelechano et al., 2013) and the gffAnnotate.py script from fastqTomat0. The antibiotic resistance marker cassettes was added to both the FASTA and GTF files using command line tools. A STAR reference genome was then created from the modified GTF and FASTA files using cellranger mkref.

Raw single-cell sequencing reads were converted into FASTQ files using cellranger mkfastq and a 10x Genomics Index CSV file. These FASTQ reads were then aligned to the reference genome and counted using cellranger count. The FASTQ files for indexes not corresponding to the 10x single-cell transcriptome library were processed with the fastqBCLinker.py script from fastqTomat0, which identifies the genotype for each single-cell read and creates a TSV file mapping cell barcodes to genotypes. The count data from cellranger count and the barcode data from fastqBCLinker.py was combined by the tenXtomatrix.py script from fastqTomat0. This processing step discards doublet single-cell reads, which are identified by removing ‘cells’ which map to more than one of the 72 genotype-specific barcodes. We expect that 1/72 of these doublets will have the same barcode, and so we expect that ~ 98.5% of doublets will be removed and ~1.5% will be retained. This script produces a dense TSV matrix of counts per gene per cell that can be imported with python’s pandas.read_table() or R’s read.table(). This matrix is provided as Source code 2. This final data matrix is assembled from 11 independent single-cell sequencing batches, each corresponding to a single shake flask with a different growth condition.

Raw bulk RNA sequencing reads were converted into FASTQ files using bcl2fastq. These FASTQ reads were then aligned to the reference genome and counted using cellranger count, after adding the appropriate custom chemistry configuration and barcode whitelist to cellranger. The count data from cellranger count was processed by the tenXtomatrix.py script from fastqTomat0 into a TSV matrix of counts per gene per sample, which is included in Source code 1.

Network inference

Inferelator

Request a detailed protocol

Network inference with the Inferelator consists of three major steps; data preprocessing and filtering, estimation of transcription factor activities, and regularized regression. Cross-validation of network inference parameters was performed by randomly selecting half of the genes in the gold standard network and removing them. To prevent circularity, any genes that were used in the gold standard were removed from the prior data during cross-validation; for tests where the gold standard network was also used as a prior, this meant that half of the genes in the gold standard network were retained and defined as the gold standard, and half of the genes in the gold standard network were used as priors. A summary table of the cross-validation results is provided as Supplementary file 1-Supplemental Table 5.

The randomized negative control was performed by randomly reassigning gene names in the prior data. Transcription factor labels and expression values were otherwise unchanged. The simulated negative control was performed using simulated data by constructing a probability distribution for the yeast transcriptome from estimates of absolute mRNA abundances (Lahtvee et al., 2017) and randomly sampling this distribution using the synthesize_data.py script from the fastqToMat0 package. Metadata and total UMI count for each cell were retained in this negative control; only the individual gene counts were synthesized from the simulated control probability distribution.

See below for details on each step of the network inference procedure.

Single-Cell preprocessing and filtering

Request a detailed protocol

Single cell data was loaded as an integer UMI count matrix (Cells x Genes). Genes with a variance of 0 for all cells were removed. The count matrix was then transformed by log scaling using log2(x+1). For data sets that had already undergone library normalization and transformation as a result of an imputation method, this transformation preprocessing step was skipped.

Single-Cell imputation

Request a detailed protocol

All imputation methods used the untransformed integer UMI count matrix (Cells x Genes) in which genes with a variance of 0 had been removed. For MAGIC, count data was library size normalized with the library.size.normalize() function from the Rmagic package, then transformed by square-root, and subjected to imputation with the magic() function from the Rmagic package. For VIPER, count data was subjected directly to the VIPER() function from the VIPER package, using the parameters recommended by the VIPER authors for 10x genomics UMI count data. For ScImpute, count data was normalized by the method included in the ScImpute package and then subjected to imputation with the imputation_wlabel_model8() function from the ScImpute package. The R script to perform these imputations is included with Source code 1.

Construction of known prior TF-Gene networks

Request a detailed protocol

Construction of the gold standard prior network has been previously described (Tchourine et al., 2018); this gold standard network consists of 1403 signed [−1, 0, 1] interactions, for which sign represents activation (+) or inhibition (-), in a 998 genes by 98 transcription factors regulatory matrix. YEASTRACT priors were retrieved from the YEASTRACT database (Teixeira et al., 2018) using the generate regulation matrix tool. Both activation and inhibition interactions were included, but only those that are supported by both DNA binding and expression evidence. The YEASTRACT prior network consists of 11486 unsigned [0, 1] interactions in a 3912 genes by 152 transcription factors regulatory matrix. Construction of the ATAC-motif priors has been previously described (Castro et al., 2019; Miraldi et al., 2019), and are built from chromatin accessibility data and known transcription factor binding motifs. The ATAC-motif prior network consists of 71,865 signed integer interactions with a range of [−11,. .., 26], for which sign represents activation (+) or inhibition (-) and absolute values represent the number of motif occurrences, in a 5551 genes by 138 transcription factors regulatory matrix. Bussemaker-priors were generated from modeling transcription factor affinities for regulatory DNA motifs (Ward and Bussemaker, 2008). The Bussemaker prior network consists of unsigned floating-point values [0, 20] that reflect estimated binding affinities in a dense 6516 genes by 123 transcription factors regulatory matrix.

Estimating transcription factor activities (TFA)

Request a detailed protocol

Log-transformed single-cell data was transposed into matrix X, in which columns are individual cells and rows are genes. P is the connectivity matrix of known prior regulatory interactions between transcription factors (in columns) and genes (in rows). Pi,k is zero if there is no known regulatory interaction between transcription factor k and gene i. A is the activity matrix, where the columns are the individual cells as in X and rows are the transcription factors. We model the expression of gene i in individual cell j as a linear combination of the activities of the a priori known regulators of gene i in individual cell j (1). In practice, this means that we use the known targets of a transcription factor to derive its activity.

(1) Xi,j=kTFsPi,kAk,j

In matrix form, Equation 1 can be written as X=PA. This is an overdetermined system, meaning that there are more equations than unknowns and therefore there is no solution if all equations are linearly independent. We approximate A by finding  that minimizes PÂ-X22. If a transcription factor has no prior targets present in P, we use the expression of that transcription factor as a proxy for its activity.

Inferring regulatory interactions, single-task (Bayesian Best Subset Regression)

Request a detailed protocol

We utilize a bayesian best-subset regression (BBSR) method, previously described (Greenfield et al., 2013), for single-task network inference. At steady state, we model the expression of a gene i in individual cell j as a linear combination of the activities of its regulators in individual cell j (2). For each gene i, we limit the number of potential regulators Ri to the ten with the highest context likelihood of relatedness, calculated from the mutual information between all regulators and the gene i (Madar et al., 2010), in addition to any a priori known regulator of gene i. Limiting the regulators is necessary before best subset regression, when we find the least squares solution to all possible combinations of predictors in set Ri. Because we expect a limited number of transcription factors to regulate a particular gene, our goal is to find a sparse solution for β, in which non-zero entries define both the strength and direction (activation or repression) of a regulatory relationship.

(2) Xi,j=kRiβi,kAk,j

Prior knowledge can be incorporated using Zellner’s g-prior on the regression parameters β; in this work, we include prior interactions in the set of predictors to be modeled by best subset regression, but we do not further bias the predictors chosen with a g-prior on the regression parameters. We select the model with the lowest Bayesian Information Criterion, which adds a theoretically derived penalty term to the training error to account for model complexity and thereby reduce generalization error. After this step, the output is a matrix of inferred regression parameters β, where each entry corresponds to a regulatory relationship between transcription factor k and gene i.

Inferring regulatory interactions, multi-task (AMuSR)

Request a detailed protocol

The multitask approach used here entails a joint inference of regulatory networks across multiple expression datasets. In addition to the linear assumption, in which gene expression is a linear function of the activities of regulators, we also assume that much of the underlying regulatory network is shared among related datasets (conditions). Here, we extend a previous version of the Inferelator that implements Adaptive Multiple Sparse Regression (AMuSR), which is designed to leverage cross-dataset commonalities while preserving relevant differences (Castro et al., 2019).

There are multiple ways of dividing the existing yeast data into multiple network data subsets, which we refer to as tasks. Within our experimental design, cells are processed and sequenced as batches, which are taken from separate environmental growth conditions. Differences between these batches are a combination of technical and biological variation. The technical variation can come from batch effect due to stress and energy-source differences associated with differing growth conditions (for example via direct effects on cell wall and thus cell lysis/yield), as well as from differences in sample preparation and sequencing. Differences in growth condition also generate biologically significant variation in gene expression due to differences in regulatory program activation. Removing technical variation while retaining biological variation through batch normalization is not feasible, and therefore these individual sample batches from separate growth conditions are taken as individual tasks for the network inference. Thus, the index ‘d’, below, ranges from 1 to 11 and is an index over the separate datasets corresponding to growth conditions. This separation into tasks results in the joint learning of 11 networks (one for each growth condition), followed by combination into a single global network.

Briefly, the network model is represented as a matrix W for each target gene (where columns are individual single-cell batches d and rows are potential regulators k) with signed entries corresponding to strength and type of regulation. We then decompose the model coefficient matrix W into a dataset-specific component S and a conserved component B to enable us to penalize dataset-unique and conserved interactions separately for each target gene; this separation captures differences in regulatory networks across datasets. Specifically, we apply an l1/l penalty to the B component to encourage similarity between network models, and an l1/ l1 penalty to the other to accommodate differences to S (Jalali et al., 2010). Regularization parameters ƛs and ƛb, representing the strength of each penalty, were chosen via Extended Bayesian Information Criterion (Chen and Chen, 2008). We set ƛb to cbd log pn, where d is the number of tasks, n is the mean number of samples per task, and p is the number of predictors. We then search for cb in the log interval [0.1, 10.0] with 20 steps. We then set ƛs such that 1d<ƛsƛb<1, where d is the number of tasks and csƛs=ƛb. We search for cs in the linear interval [1d+ 0.01, 0.99] with 10 steps.

We can incorporate prior knowledge by using adaptive weights (ɸƛs) when penalizing different coefficients in the l1/ l1 penalty (Zou, 2006). In this work, however, we chose not to bias predictors to the priors using adaptive weights, and set ɸ to 1. For each gene, we minimize the following function (Castro et al., 2019):

(3) argminS, B=d|| Xi(d)A^dT(S,d+B,d)||22+λsk,d|Φk,dSk,d|+λb||B||1,
output: W=S + B

Ranking interactions and data resampling

Request a detailed protocol

Interactions were ranked by both the overall performance of the model for each gene i and the proportion of variance explained by each βi,k. The output of this procedure is a matrix S where Si,k is the confidence score on the interaction between transcription factor k and gene i. In order to avoid overfitting and sampling biases, we repeat this procedure N times by resampling the input data matrix with replacement. Finally, we rank combine the confidence scores generated by running the above inference procedure on each of the N bootstrapped datasets and obtain a final matrix of combined confidence scores for the possible interactions between transcription factors (columns) and genes (rows).

Network combination

Request a detailed protocol

Individual task networks were assembled into a global network by rank combining the confidence scores generated for each possible interaction between transcription factors and genes, obtaining a final matrix of combined confidence scores for the global network. Global interactions were ordered by combined confidence score, and the top interactions were kept to a threshold defined by precision = 0.5, as determined by recovery of the priors.

Statistical analysis and differential gene expression

To analyze all growth conditions together, the raw single-cell count matrix was normalized using multiBatchNorm from the scater package in R (McCarthy et al., 2017). In short, this calculates size factors that are used to scale cells from different environmental condition batches so that each batch is of approximately the same mean UMI count. Cells were then library size normalized within batches and the normalized data was log-transformed with log2(x+1) to give a transformed and normalized count matrix.

Visualizing single cell expression data

Request a detailed protocol

This normalized count matrix was reduced to 50 principal components by principal component analysis (PCA) with multiBatchPCA from the scater package in R. MultiBatchPCA is standard PCA with the modification that each environmental condition batch contributes equally to the covariance matrix, even when batches are imbalanced in cell count. These principal components were projected into two dimensional space by Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018) and plotted.

To analyze each growth condition separately, the cells corresponding to a growth condition were selected from the raw count matrix, library-size normalized and log2(x+1) transformed, and reduced to 50 principal components with PCA. These principal components were projected into two dimensional space by UMAP for plotting, and also used to generate a shared nearest-neighbor (sNN) graph, which is used to cluster cells using the Louvain clustering method. Each growth condition was processed and plotted separately.

Pseudobulk differential gene expression

Request a detailed protocol

The raw, unmodified UMI counts of all cells from each biological replicate (with the same strain barcode) within a specific environmental growth condition were summed, resulting in 72 samples per condition (six biological replicates for each of the 12 transcription factor deletions). Summed pseudobulk expression data was then tested with DESeq2 (Love et al., 2014) for differential gene expression (testing against a null hypothesis of Fold Change < 1.5 and considering changes significant when p<0.05 at a false discovery rate of 0.1) with no additional processing or normalization.

Gene categorization

Request a detailed protocol

Cell-cycle associated genes are categorized using the Spellman annotations (Spellman et al., 1998). Ribosomal genes, ribosomal biogenesis genes, and induced environmental stress response genes are categorized using the Gasch annotations (Gasch et al., 2017). Gene category annotations are included as Supplementary file 1-Supplemental Table 6.

Gene Ontology

Request a detailed protocol

The number of interactions was determined for each gene and each transcription factor. Interactions are considered Learned (new) if they are present in the learned network and not in the prior network; Learned (In Prior) if they are present in the learned network and not in the prior; and, Not Learned (In Prior) if they are present in the prior and not in the learned network. Each gene was mapped to Gene Ontology (GO) slim terms using the YeastGenome slim mapping (https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab), which is a curated gene ontology mapping of high-level, broad GO terms. Interactions for all genes annotated with a GO term were summed. The generic terms ‘biological_process’, ‘not_yet_annotated’, and ‘other’ are removed from both target genes and regulatory transcription factors, and the common term ‘transcription from RNA polymerase II promoter’ was removed from regulatory transcription factors GO annotations. The 25 remaining terms with the highest number of learned (new) interactions were plotted separately for both the target genes and the regulatory transcription factors.

Correlation plots

Request a detailed protocol

Gene expression data was derived experimentally in this work (FY4/FY5) or obtained from GEO. All samples are from early-log phase growth in YPD. Single-cell yeast data sets are from GSE122392 (BY4741) (Nadal-Ribelles et al., 2019) and GSE102475 (BY4741) (Gasch et al., 2017). A comparable bulk RNA control is from GSE135430 (BY4741) (Scholes and Lewis, 2019). Genes were ranked by expression in each cell or sample. All cells or samples from a specific experiment were rank-combined and ranks were pairwise plotted for each experiment with GGally in R.

Variability plots

Request a detailed protocol

Coefficient of variation (mean over standard deviation) is calculated for each gene in each growth condition. Pearson residuals (model residual over expected standard deviation) are calculated for each gene in each cell and then the mean of the pearson residuals is taken for each growth condition. This calculation is done with the vst function from the sctransform package in R (Hafemeister and Satija, 2019). In short, this builds for each gene a regularized negative binomial model, which is then used to calculate pearson residuals for each cell compared to the model. This is done separately for each growth condition.

Data and software availability

Sequencing data

Request a detailed protocol

Raw sequencing data, the output from the cellranger pipeline to count reads, and the output from the fastqToMat0 pipeline to extract and attach genotype metadata to the count matrix are available in NCBI GEO under the accession number GEO: GSE125162.

Single-Cell processing pipeline

Request a detailed protocol

The cellranger pipeline is available from 10x Genomics under the MIT license (https://github.com/10XGenomics/cellranger). The fastqToMat0 pipeline is available from GitHub (https://github.com/flatironinstitute/fastqToMat0Jackson, 2020; copy archived at https://github.com/elifesciences-publications/fastqToMat0) and is released under the MIT license. Genome sequence and annotations are included as Source code 4.

Network inference

Request a detailed protocol

The Inferelator is implemented in Python, with dependencies on the widely-distributed scientific packages Numpy (van der Walt et al., 2011), Scipy (Virtanen et al., 2020), Pandas (McKinney, 2010), and Scikit-learn (Pedregosa et al., 2011). Scaling to a high-performance computing cluster is implemented with dask (Rocklin, 2015). All network inference in this work was performed with the inferelator v0.3.0, using Python v3.7.3, Numpy v1.16.2, Pandas v0.24.2, Scikit-learn v0.20.3, Scipy v1.2.1, and dask v1.1.4. The inferelator package is available under the Simplified BSD licence and can be installed from PyPI (https://pypi.org/project/inferelator/) or cloned from GitHub (https://github.com/flatironinstitute/inferelatorJackson and Gibbs, 2020; copy archived at https://github.com/elifesciences-publications/inferelator).

Figure construction

Request a detailed protocol

Figure 1 and Figure 1—figure supplement 1 are constructed using Adobe Illustrator. Figures 27 and accompanying supplementary figures are constructed with R. The R (v3.5.1) (R Development Core Team, 2018) packages used are as follows: for plotting, ggplot2 (v3.1.0) (Wickham, 2016), cowplot (v0.9.4) (Wilke, 2019), ggridges (v0.5.1) (Wilke, 2018), ggrastr (v0.1.7) (Petukhov, 2019), GGally (v1.4.0) (Schloerke et al., 2018), viridis (v0.5.1) (Garnier, 2018), RColorBrewer (v1.1–2) (Neuwirth, 2014), and scales (v1.0.0) (Wickham, 2018a); for data manipulation, dplyr (v0.7.8) (Wickham et al., 2018), data.table (v1.12.0) (Dowle and Srinivasan, 2019), reshape2 (v1.4.3) (Wickham, 2007), and stringr (v1.3.1) (Wickham, 2018b); and for single-cell analysis, scater (v1.10.1) (McCarthy et al., 2017), scran (v1.10.2) (Lun et al., 2016), umap (R: v0.2.0.0, python: v0.3.6) (Konopka, 2018), igraph (v1.2.2) (Csardi and Nepusz, 2006), DESeq2 (1.22.2) (Love et al., 2014), corpcor (v1.6.9) (Schafer et al., 2017), and sctransform (v0.2.0) (Hafemeister and Satija, 2019). The R scripts to generate these figures and all required data are included with Source code 1. Network illustrations in Figures 6 and 7 were generated using Gephi 0.9.2 from the inferelator output network (gefx formatted); the layouts used are Force Atlas 2, Noverlap and Label Adjust. Figures were minimally modified from R outputs to enhance layout and aesthetics using Adobe Illustrator.

Interactive figures

Request a detailed protocol

Interactive versions of several panels from Figures 14 are available as Shiny (Chang et al., 2018) apps online at http://shiny.bio.nyu.edu/YeastSingleCell2019/. Source code for the Shiny app is available upon request under the MIT license.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
    Multitask Learning
    1. R Caruana
    (1998)
    In: S Thrun, L Pratt, editors. Learning to Learn. Boston: Springer. pp. 95–133.
    https://doi.org/10.1007/978-1-4615-5529-2_5
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    The igraph software package for complex network research InterJournal
    1. G Csardi
    2. T Nepusz
    (2006)
    Complex Systems, 1695.
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
    data.table: Extension of `data.frame`
    1. M Dowle
    2. A Srinivasan
    (2019)
    data.table: Extension of `data.frame`.
  33. 33
  34. 34
  35. 35
  36. 36
    viridis: Default Color Maps from matplotlib
    1. S Garnier
    (2018)
    viridis: Default Color Maps from matplotlib.
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
    A dirty model for Multi-task learning
    1. A Jalali
    2. S Sanghavi
    3. C Ruan
    4. PK Ravikumar
    (2010)
    Advances in Neural Information Processing Systems. pp. 964–972.
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
    umap: Uniform Manifold Approximation and Projection
    1. T Konopka
    (2018)
    umap: Uniform Manifold Approximation and Projection.
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
    Data structures for statistical computing in Python
    1. WO McKinney
    (2010)
    Proceedings of the 9th Python in Science Conference. pp. 51–56.
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
    RColorBrewer: ColorBrewer Palettes
    1. E Neuwirth
    (2014)
    RColorBrewer: ColorBrewer Palettes.
  91. 91
    Scikit-learn: machine learning in Python
    1. F Pedregosa
    2. G Varoquaux
    3. A Gramfort
    4. V Michel
    5. B Thirion
    6. O Grisel
    7. M Blondel
    8. P Prettenhofer
    9. R Weiss
    10. V Dubourg
    11. J Vanderplas
    12. A Passos
    13. D Cournapeau
    14. M Brucher
    15. M Perrot
    16. D É
    (2011)
    Journal of Machine Learning Research : JMLR 12:2825–2830.
  92. 92
  93. 93
    ggrastr: Raster layers for ggplot2
    1. V Petukhov
    (2019)
    ggrastr: Raster layers for ggplot2.
  94. 94
    R: A Language and Environment for Statistical Computing
    1. R Development Core Team
    (2018)
    R: A Language and Environment for Statistical Computing, Vienna, Austria, http://www.r-project.org.
  95. 95
    Dask: parallel computation with blocked algorithms and task scheduling
    1. M Rocklin
    (2015)
    Proceedings of the 14th Python in Science Conference. pp. 126–132.
  96. 96
  97. 97
  98. 98
  99. 99
    corpcor: Efficient Estimation of Covariance and (Partial) Correlation
    1. J Schafer
    2. R Opgen-Rhein
    3. V Zuber
    4. M Ahdesmaki
    5. APD Silva
    6. K Strimmer
    (2017)
    corpcor: Efficient Estimation of Covariance and (Partial) Correlation.
  100. 100
    GGally: Extension to “ggplot2.
    1. B Schloerke
    2. J Crowley
    3. D Cook
    4. F Briatte
    5. M Marbach
    6. E Thoen
    7. A Elberg
    8. J Larmarange
    (2018)
    GGally: Extension to “ggplot2..
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107
  108. 108
  109. 109
  110. 110
  111. 111
  112. 112
  113. 113
  114. 114
  115. 115
  116. 116
  117. 117
  118. 118
    Ggplot2: Elegant Graphics for Data Analysis
    1. H Wickham
    (2016)
    Springer-Verlag.
  119. 119
    dplyr: A Grammar of Data Manipulation
    1. H Wickham
    2. R François
    3. L Henry
    4. K Müller
    (2018)
    dplyr: A Grammar of Data Manipulation.
  120. 120
    scales: Scale Functions for Visualization
    1. H Wickham
    (2018)
    scales: Scale Functions for Visualization.
  121. 121
    stringr: Simple Consistent Wrappers for Common String Operations
    1. H Wickham
    (2018)
    stringr: Simple Consistent Wrappers for Common String Operations.
  122. 122
    ggridges: Ridgeline Plots in ggplot2
    1. CO Wilke
    (2018)
    ggridges: Ridgeline Plots in ggplot2.
  123. 123
    cowplot: Streamlined Plot Theme and Plot Annotations for ggplot2
    1. CO Wilke
    (2019)
    cowplot: Streamlined Plot Theme and Plot Annotations for ggplot2.
  124. 124
  125. 125
  126. 126
  127. 127
  128. 128
  129. 129
    The adaptive lasso and its oracle properties
    1. H Zou
    (2006)
    Journal of the American Statistical Association 101:1418–1429.
    https://doi.org/10.1198/016214506000000735

Decision letter

  1. Naama Barkai
    Reviewing Editor; Weizmann Institute of Science, Israel
  2. Detlef Weigel
    Senior Editor; Max Planck Institute for Developmental Biology, Germany

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

In your article, you present individual transcriptomes of diploid yeast cells, applying your method to monitor transcriptional changes across an array of newly generated barcoded deletion mutants across a panel of stresses. Beyond the development of the method, the main and most original point of the manuscript is that you infer gene regulatory networks based on the transcriptomes retrieved from barcoded genotypes in diverse conditions. The manuscript is well written and together with related reports will become one of the golden standards for yeast single cells. You are commended for providing a complete and user-friendly dataset (deposited and interactive through a shiny app), which will be a valuable resource for the yeast community.

Decision letter after peer review:

Thank you for submitting your work entitled "Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The reviewers have opted to remain anonymous.

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below. As you will see, the reviewers found the paper to be a bit preliminary in the interpretation and the meaningfulness of the presented data. The reviewers did find the work to be potentially suitable for eLife and will be happy to look at a revised version, provided that you can fully address all comments raised in the individual reviews.

Reviewer #1:

Here, Jackson et al. apply for the first time a 10x Genomics protocol for scRNAseq in barcoded S. cerevisiae deletion strains. The work analyzed 11 transcription-factor deletion strains pooled together and responding to 11 different conditions related to nitrogen metabolism. The authors then applied a gene-regulatory network (GRN) inference method to predict a nitrogen metabolism regulatory network.

The pluses of the work are that it is a hot topic and, although not the first scRNAseq in fungi, the first to look very broadly at tens of thousands of yeast cells. The barcoding method is interesting, although unclear how it's different from the barcode sequencing that is part of the standard 10x genomics pipeline and has been used before in other systems.

The weaker points for me were in the analysis. The authors have expertise in GRN inference, but I had a hard time understanding from the main text what was novel here and specific for single cell data versus previously published methods applied to data pooled across cells/conditions. I will leave it to network inference modelers to dissect those details. But in a broader sense, I was left wanting more follow-up to show that the methods produced new insights. The authors predicted a network but as there is no biological validation and little computational validation it's unclear how big of an advance this is. I was also left wondering about the biological insights that can be gleaned from having single cell data (beyond variation in cell-cycle stage, which is readily identifiable in all scRNAseq studies). I suspect there is interesting biology in the heterogeneity in the response data but there was not much addressed on that topic. In my opinion, this is a great new method with a potentially powerful dataset. But since there are many GRN methods and this overall approach seems similar to Perturb-seq and other methods, the results and impact for me fall below the bar of eLife.

Some specific points are outlined below.

1) The authors report reads from 38,000 cells, which to date is the most cells studied in fungi. But the median number of genes covered is only <700. Since the paper focuses on re-bulked data (to call differentially expressed genes by DESeq2 and, I think, for their main GRN network inference?), I was left wondering how many genes are measured in the re-bulked data per condition. I was surprised how few genes were called by DESeq (Figure 4B), but it's unclear how many genes are actually measured in >1 cell condition/mutant. I was also curious what fraction of known targets (e.g. based on prior studies or ChIP-seq datasets) were called for measured genes.

2) It would be useful to know how well the 10x protocol works for cells and if the aggregated wild-type data recapitulates bulk profiles in conditions that have been previously measured. I was a little concerned at how the cells were collected, which appeared to take live cells and wash them several times in RNALater buffer – does that immediately kill cells? If not, I was wondering if that is inducing a response. I was also left wondering if the protocol captures only the most abundant transcripts. Perhaps I missed this on the supplement, but I was wondering how the% cells in which a transcript was measured compared to RPKM from bulk measurements. Clearly more abundant transcripts will be more easily captured, but some more analysis here would be useful for a new method.

3) The GRN modeling was not clear to me from the main text. It appears that the authors are using their published Inferelator method that takes priors based on ChIP-seq data, and re-bulks the scRNAseq data (at least for the multi-task inference). Perhaps their point is that the 10x approach allows pooling of many genotypes and conditions, but for me the analysis missed the potential power of having single cell data. The authors make statements on the networks in Figure 6 and Figure 7 about the number of "novel" regulatory connections – but I saw no validation of those predictions, including by computation. How do we know that any of these are real and that the method is producing new insights? AUROCs comparing to known data is not enough to say that new regulatory connections were discovered. This was especially true for Figure 7 – how do we know these new predictions mean anything about a connection to cell cycle without some validation?

4) I had some quibbles with part of the Discussion.

i) First, while the authors cite several recent S. cerevisiae scRNAseq datasets, saying that this is the "first report of large-scale scRNAseq" in yeast is not accurate – it's true they measured 38,000 cells but at a depth of only <700 genes per cell, which is far fewer than 2500-3000 of other studies in several hundred cells. A fairer sentence is required, also citations of recent Sz. pombe scRNAseq (Saint et al., 2019) should be included.

ii) "We observe significant heterogeneity in individual cells.… Much of this variation can be explained by the mitotic cell cycle" – that statement is not true, there does not seem to be an attempt to quantify heterogeneity over most genes. That they see heterogeneity in cell cycle stage as expected does not mean that cell cycle stage explains heterogeneity in the rest of the response, which was not reported on here.

Reviewer #2:

In the manuscript submitted by Jackson et al., the authors proposed to profile individual transcriptomes of diploid yeast cells optimizing the 10x genomics pipeline. The manuscript demonstrates the feasibility and robustness of the method and the authors applied it to monitor the transcriptional networks across an array of newly generated barcoded deletion mutants across a panel of stresses. The manuscript is scientifically sound and of outstanding quality, it's well written and together with previous reports this year will become one of the golden standards for yeast single cells.

The authors provide a complete and user-friendly dataset (deposited and interactive through a shiny app) that will be a valuable resource for the yeast community. All in all, I think this is a very solid manuscript that with minor corrections I would strongly support for publication in eLife.

Beyond the development of the method, the main and most original point of the manuscript is that the authors aim to generate infer gene regulatory networks based on the transcriptomes retrieved from barcoded genotypes in diverse conditions. I have some questions and comments (of varying levels of concern) that I feel should be addressed in the current version of the manuscript.

The authors leverage in their previous experience in GRN reconstruction and propose this approach has allowed them to discover novel regulatory relationships between cell cycle-regulated gene expression in response to changes in nitrogen source. In my opinion, this part of the manuscript needs to be reinforced and some of the conclusions driven from the GRN should be and some representative novel regulatory relationships experimentally demonstrated. As well the authors should provide context to their findings as they often read a bit disconnected.

In terms of data quality, the number of mitochondrial and ribosomal reads per genotype and condition should be plotted, as a quality metric and given that a lot of ribosomal gene expression is cell cycle regulated. This could be relevant to understand separate clusters within conditions which the authors do not mention in the results and/or discussion and given the fact that different zymolyase concentrations were used for cell lysis.

The optical density of the cells at the harvesting (besides the total cell number) should be provided to ease the reproducibility for other labs.

If I understand correctly, cell cycle clusters within conditions in Figure 3 is confusing. For example expression of DSE2 or PIR1 seem to be highest in the green and grey clusters respectively (Figure 3A) however in Figure 3B panel I the highest expression is assigned to the grey-yellow for DSE2 and PIR1 to the green cluster.

As well, the UMAPS from Figure 3A, the authors claim the clustering is mainly condition-dependent and genotype-independent. However, conditions like MMEtOH, CSTARVE, NLIM-PRO have clear clusters that do not seem cell cycle-dependent and these might be biologically relevant. The authors should at least comment or those or run a DE analysis to see what these are.

How do the newly generated data compare to Gasch et al., 2017 and Nadal-Ribelles et al., 2019?

The number of differentially expressed genes even in the YPD condition seems a bit low. How do these pseudobulk compare to the deletome data (Holstege lab) or other published datasets?

As well, the authors use DESeq in Figure 4B, but these results are contradictory with what is shown in Figure S4Bi which is done by Welch testing. This is a bit confusing and does not add much to the reader. I would suggest to run DE between conditions using DESeq or provide the reasoning as to why these two different approaches are used and done?

Why do transcription factor activities FKH1 and FKH2 and SWI4 SWI5 do not overlap (they almost seem mutually exclusive), one would expect them to have similar profiles. Similarly, NDD1 regulates S-phase genes but the TFA does not overlap with the HTB expression shown in Figure 3.

Reviewer #3:

This paper by Gresham, Bonneau and colleagues presents to date the largest single cell RNA-seq datasets in yeast using a novel deletion and barcoding strategy that enables them to measure individual cells under different genetic perturbations of transcription factors and environmental conditions. The study is focused on better understanding the regulatory network in nitrogen starvation however it could be broadly applied across multiple conditions. Some findings are: the genotypes tend to be generally uniformly present in all conditions except RTG1 and 3 and GLN3; there is a co-regulated set of genes regulated by cell cycle and nitrogen TFs; multi-task learning is a viable approach for network inference in scRNAseq data. The paper is a significant contribution to the field providing a novel dataset to yeast and general gene regulation community.

I have some comments, which I think are minor and can strengthen the messages of the paper.

1) In Figure 5C, is the single task network inferred by merging all the data and learning a single network or by learning separate networks with single tasking and aggregating the results? If not, how does the single task per condition followed by aggregation perform?

2) The authors don't get much into the context-specificity of the inferred networks. They interpret only the final aggregated network. It would be useful to know how similar the individual condition-specific networks are and if there is a conserved core used by multiple conditions. The only comparison of context specificity is being done at the level of AUPR, it might be helpful to do this comparison just by comparing the inferred networks.

3) Some discussion about the variation in the AUPRs would be helpful. Is it because the gold standard is biased towards the conditions on which the AUPR is high. It seems the AUPR for the MMEtOH network is close to what is inferred by the multi-task learning, and some explanation of why this might be is helpful.

4) A comparison of the AUPR of the single task condition-specific networks and the multi-task condition specific network could further show the advantage of the multi-task learning framework.

5) It would be helpful to emphasize if and how the multi-task learning approach used here was from extended from the Castro, 2019 paper.

6) The Discussion could be strengthened. The authors present some results about the interplay between cell cycle and nitrogen response. It was not clear why this is interesting to study beyond that there is a shared regulatory program. This might be worth bringing up in the Discussion to tie back to the initial goal of inferring a network for nitrogen metabolism and the TOR signaling pathway and the general role of cell cycle and stress response.

7) The authors don't find a substantial impact of TF knockout on gene expression under different conditions (I assume the comparisons were done while controlling for the conditions). How does this compare to bulk data? How much of this observation could be due to the sparsity of scRNAseq data versus the redundancy of TFs.

https://doi.org/10.7554/eLife.51254.sa1

Author response

We have revised the manuscript based on the feedback that we have received from reviewers. To address the reviewer concerns we have undertaken additional experimentation and analyses. Below, we summarize the major changes made to the text.

1) A common concern of the three reviewers was the quality and utility of the single-cell RNAseq data. To address this concern we performed bulk RNAseq on wild-type cells from the YPD rich media condition. Our analysis of this experiment used the same computational pipeline as used for our single-cell reads and indicates that scRNAseq in yeast produces data that are very comparable to that produced from bulk RNAseq. In addition, we have performed comparisons to other published single-cell data and to another bulk sequencing experiment from GEO, which support this conclusion. This analysis is included as Figure 2—figure supplement 2. We have added additional QC metrics suggested by the reviewers as Figure 2—figure supplement 3. We believe that these results validate the quality of data produced using this method and obviate any concerns that our sample processing introduces biases in the data.

2) A second common concern was validation of the network predictions made by our inference technique. To address this concern, we have compared the novel predictions made in this work to the YEASTRACT database. More than half of the predictions we make are supported by some type of evidence (either changes in target gene expression or physical localization by ChIP). As this evidence was not included in the modeling priors we believe that this provides orthogonal support for at least 50% of the interactions that we discovered in our study.

3) To address concerns regarding the computational approaches, we have regenerated the gene regulatory network results reported in Figures 5-7 incorporating the following changes:

a) We updated the Inferelator package for python to version 0.3.0 (some minor software bugs were fixed and the package is now compatible with Python 3; this upgrade did not change any modeling results).

b) We updated the YEASTRACT prior data to match the 2019 release of the YEASTRACT database; as a result, ~2000 TF-gene interactions were removed as being poorly supported, and ~3500 TF-gene interactions were added relative to the 2018 release used in our initial submission. Updating the modeling priors with these changes improved many of our modeling results.

c) While preparing this revision, we identified an issue that was causing our network inference to make spurious predictions due to genes that have a mean count value in at least one task group that is very close to zero (but not zero). We have mitigated this problem by filtering out genes that don’t have a minimum mean count of 0.05 (at least 1 read per 20 cells). For multi-task learning, this filter is applied task-wise; genes filtered from one task are still modeled in other tasks. This approach results in removal of the spurious interactions.

d)We have performed additional cross-validation for Figure 5B (now 20 runs).

4) The Materials and methods section has been substantially updated to incorporate these additions to the manuscript and several minor typos were corrected.

5) One reviewer identified a figure which had a labeling error introduced during panelling with Adobe Illustrator. This motivated us to include as a supplemental data file (Source code 5) an HTML document generated with Rmarkdown that contains Figures 2-7 (including all figure supplements). The figures produced in this way have not been subject to any subsequent adjustments in Illustrator and thus provide a reproducible record of the data analysis and presentation that can be regenerated by running the Rmarkdown script (which is included in Source code 1).

Reviewer #1:

[…] The pluses of the work are that it is a hot topic and, although not the first scRNAseq in fungi, the first to look very broadly at tens of thousands of yeast cells. The barcoding method is interesting, although unclear how it's different from the barcode sequencing that is part of the standard 10x genomics pipeline and has been used before in other systems.

The weaker points for me were in the analysis. The authors have expertise in GRN inference, but I had a hard time understanding from the main text what was novel here and specific for single cell data versus previously published methods applied to data pooled across cells/conditions. I will leave it to network inference modelers to dissect those details. But in a broader sense, I was left wanting more follow-up to show that the methods produced new insights. The authors predicted a network but as there is no biological validation and little computational validation it's unclear how big of an advance this is. I was also left wondering about the biological insights that can be gleaned from having single cell data (beyond variation in cell-cycle stage, which is readily identifiable in all scRNAseq studies). I suspect there is interesting biology in the heterogeneity in the response data but there was not much addressed on that topic. In my opinion, this is a great new method with a potentially powerful dataset. But since there are many GRN methods and this overall approach seems similar to Perturb-seq and other methods, the results and impact for me fall below the bar of eLife.

We thank the reviewer for recognizing the novelty of our study. As we emphasize in our manuscript, adopting the 10x Genomics systems for budding yeast (and microbes in general) is non-trivial and we have developed a robust methodology that we are convinced will be widely adopted. We are aware that the Perturb-seq method does express a barcode that is informative about the gRNA for CRISPR/Cas9 screens; however, our method expresses a barcode from the drug resistant cassette that is used for deleting the gene of interest and thus provides a more direct readout of the cells’ genotypes. The successful application of 10x Genomics to microbial cells and the method of multiplexing genotypes make this paper of broad interest to the readers of eLife.

With respect to the novelty of the computational approaches, we concur that there are indeed many network inference methods. However, we are convinced that there is a great deal of novelty in applying GRN inference techniques to real-world single-cell data in a testable way. We are aware of only one study that is similar in scope and goals to ours; it was released as a preprint around the same time as this work, and it has been recently accepted for publication in Science. (Norman, T.M., et al. (2019). Exploring genetic interaction manifolds constructed from rich single-cell phenotypes). The rigorous testing of unsolved aspects of GRN using scRNAseq data, such as imputation, and the application of multitask learning are key features of our manuscript that we are convinced will be of great value to the larger biological research community that aims to use scRNAseq for GRN reconstruction.

Some specific points are outlined below.

1) The authors report reads from 38,000 cells, which to date is the most cells studied in fungi. But the median number of genes covered is only <700. Since the paper focuses on re-bulked data (to call differentially expressed genes by DESeq2 and, I think, for their main GRN network inference?), I was left wondering how many genes are measured in the re-bulked data per condition. I was surprised how few genes were called by DESeq (Figure 4B), but it's unclear how many genes are actually measured in >1 cell condition/mutant. I was also curious what fraction of known targets (e.g. based on prior studies or ChIP-seq datasets) were called for measured genes.

It is critical to note that only Figure 4B, Figure 4C, and Figure 4—figure supplement 1B rely on bulked data. All other figures and analyses in our paper are based on single-cell data. The huge increase in the number of individual measurements that scRNAseq provides is particularly important for the GRN inference. We have clarified this point in the manuscript.

Cells cultured in YPD, the most commonly used rich media, are the best baseline for comparison to bulk methods. We have 11,037 cells present in this data set. A median of 684 genes are detected per cell, and 5,533 genes have at least one read across all cells (of 5,773 protein-coding genes). 5,403 genes are counted in more than 1 cell in this condition. As additional points of comparisons, 299 genes average at least one count/cell, and 1,831 genes average at least one count in every 10 cells. We believe that the low count/cell is the result of overall low sampling rates for transcripts (some of which may be technically addressable in the future), and is not the result of specific gene bias, which is supported by the high correlation to our new data using trizol extracted bulk RNAseq, which is included as Figure 2—figure supplement 2.

With respect to the fraction of known targets that are measured we compared the novel interactions identified in our study and find that about 50% of them are supported by prior studies included in the YEASTRACT database.

2) It would be useful to know how well the 10x protocol works for cells and if the aggregated wild-type data recapitulates bulk profiles in conditions that have been previously measured. I was a little concerned at how the cells were collected, which appeared to take live cells and wash them several times in RNALater buffer – does that immediately kill cells? If not, I was wondering if that is inducing a response. I was also left wondering if the protocol captures only the most abundant transcripts. Perhaps I missed this on the supplement, but I was wondering how the% cells in which a transcript was measured compared to RPKM from bulk measurements. Clearly more abundant transcripts will be more easily captured, but some more analysis here would be useful for a new method.

We thank the reviewer for raising this point; this is a very important methodological detail. It is necessary to fix cells prior to processing to prevent changes in the transcriptome. We chose to use RNAlater (saturated ammonium sulfate) in the initial development of our protocol as it is a widely used approach that has been successfully applied to yeast in the past. Snap freezing in liquid nitrogen is not a fixative but a storage solution, and it is not feasible with the single-cell workflow anyway, and we found in preliminary testing that methanol fixation introduced obvious bias after reverse transcription.

To specifically address the reviewer’s concern about transcript capture bias, we performed an RNA sequencing experiment using bulk RNA extracted from wildtype cells in YPD using a trizol-based protocol. For this bulk experiment, we used a 3’ end barcoding with UMI strategy that is directly comparable to the 10x genomics method for cDNA synthesis and library preparation, and which can be analyzed through the same computational pipeline as our single-cell data. The results of this experiment are presented in Figure 2—figure supplement 2. We find that, in aggregate, the single-cell expression data from WT cells in YPD correlates very well with the expression data from bulk RNA (spearman correlation of 0.94). In addition, despite considerable differences in technical protocols, as well as using different strains (auxotrophic haploids vs prototrophic diploids in our study), we find that the single-cell expression data from our 10x Genomics protocol also correlates well with the expression data from the yscRNAseq protocol reported by Nadal-Ribelles et al, 2019 (spearman correlation of 0.83). We believe that this convincingly shows that the 10x Genomics-based method we developed for scRNAseq does not introduce any biases in the data.

3) The GRN modeling was not clear to me from the main text. It appears that the authors are using their published Inferelator method that takes priors based on ChIP-seq data, and re-bulks the scRNAseq data (at least for the multi-task inference). Perhaps their point is that the 10x approach allows pooling of many genotypes and conditions, but for me the analysis missed the potential power of having single cell data. The authors make statements on the networks in Figure 6 and Figure 7 about the number of "novel" regulatory connections – but I saw no validation of those predictions, including by computation. How do we know that any of these are real and that the method is producing new insights? AUROCs comparing to known data is not enough to say that new regulatory connections were discovered. This was especially true for Figure 7 – how do we know these new predictions mean anything about a connection to cell cycle without some validation?

We have clarified in the text that all network modeling is based on single-cell data and does not use pseudo-bulked data. We now emphasize that a novel interaction in this context is defined as interactions we discover using our network inference method without prior knowledge. To validate these interactions we now include a comparison to data in the YEASTRACT database that was not included in our network inference method as priors. Of the 6114 novel interactions discovered in our network, YEASTRACT provides evidence of an existing regulatory relationship for more than half. We believe that this provides orthogonal validation of our newly discovered interactions.

As our data allows us to determine the expression of individual cells the true advantage is accessing expression at the single-cell level. While some high-quality work has been done to study the intersection of cell-cycle regulation and metabolic regulation, we believe that studying asynchronous cultures at the single-cell level will be immensely valuable. Studying these novel regulatory relationships will be the subject of future work.

4) I had some quibbles with part of the Discussion.

i) First, while the authors cite several recent S. cerevisiae scRNAseq datasets, saying that this is the "first report of large-scale scRNAseq" in yeast is not accurate – it's true they measured 38,000 cells but at a depth of only <700 genes per cell, which is far fewer than 2500-3000 of other studies in several hundred cells. A fairer sentence is required, also citations of recent Sz. pombe scRNAseq (Saint et al., 2019) should be included.

In response to this comment we have changed the phrase ‘large-scale’ to ‘droplet-based’ and changed instances of ‘yeast’ in this context to ‘budding yeast’. We have also noted that other work has higher read depth per cell than this work and cite the recent study by Saint et al., 2019.

ii) "We observe significant heterogeneity in individual cells.… Much of this variation can be explained by the mitotic cell cycle" – that statement is not true, there does not seem to be an attempt to quantify heterogeneity over most genes. That they see heterogeneity in cell cycle stage as expected does not mean that cell cycle stage explains heterogeneity in the rest of the response, which was not reported on here.

We provide plots of descriptive measures of variability as Figure 2—figure supplement 4. Our experimental design does not allow for a biological interpretation of these results as we are currently not able to distinguish technical variability from biological variability on a per-gene basis. We have included additional discussion about the potential for heterogeneity in some growth conditions; Figure 3—figure supplement 2 suggests that there are some cells undergoing different responses to stressful conditions. Although we are cautious about the capacity for our experimental to distinguish between biological heterogeneity and technical heterogeneity on a per-gene basis we provide these results as an additional example of the potential power of our approach.

Reviewer #2:

[…] The authors provide a complete and user-friendly dataset (deposited and interactive through a shiny app) that will be a valuable resource for the yeast community. All in all, I think this is a very solid manuscript that with minor corrections I would strongly support for publication in eLife.

Beyond the development of the method, the main and most original point of the manuscript is that the authors aim to generate infer gene regulatory networks based on the transcriptomes retrieved from barcoded genotypes in diverse conditions. I have some questions and comments (of varying levels of concern) that I feel should be addressed in the current version of the manuscript.

The authors leverage in their previous experience in GRN reconstruction and propose this approach has allowed them to discover novel regulatory relationships between cell cycle-regulated gene expression in response to changes in nitrogen source. In my opinion, this part of the manuscript needs to be reinforced and some of the conclusions driven from the GRN should be and some representative novel regulatory relationships experimentally demonstrated. As well the authors should provide context to their findings as they often read a bit disconnected.

We now include a comparison of the network which we have learned to regulatory interactions defined in the YEASTRACT database that we did not provide to our network inference method as priors. Of the 6,114 new interactions identified in our network, more than half have evidence of an existing regulatory relationship in YEASTRACT. While additional experimentation would certainly provide additional evidence for individual interactions, we feel that this result validates the overall network. Dissection of individual interactions and further functional characterization of novel regulatory relationships between the cell cycle and nitrogen metabolism is the subject of future work.

In terms of data quality, the number of mitochondrial and ribosomal reads per genotype and condition should be plotted, as a quality metric and given that a lot of ribosomal gene expression is cell cycle regulated. This could be relevant to understand separate clusters within conditions which the authors do not mention in the results and/or discussion and given the fact that different zymolyase concentrations were used for cell lysis.

This is an excellent suggestion. We performed this analysis and have added Figure 2—figure supplement 3 which contains total counts, ribosomal genes, ribosomal biogenesis genes, induced environmental stress response genes, and mitochondrial genome genes overlaid over the UMAP plots.

The optical density of the cells at the harvesting (besides the total cell number) should be provided to ease the reproducibility for other labs.

Cell concentrations were determined by cell counting (not OD); we have revised the Materials and methods section to include the concentrations at harvest (the DIAUXY cell density is not available; cells were harvested based on glucose concentration in media and the cell density was only measured after resuspension in RNAlater. Back calculation from this value suggests a culture density of ~1.0 x 108 cells/mL).

If I understand correctly, cell cycle clusters within conditions in Figure 3 is confusing. For example expression of DSE2 or PIR1 seem to be highest in the green and grey clusters respectively (Figure 3A) however in Figure 3B panel I the highest expression is assigned to the grey-yellow for DSE2 and PIR1 to the green cluster.

We thank the reviewer for pointing out this mistake. The labeling on Figure 3B was incorrect; the DSE2 and PIR1 labels were swapped (they were correct in Figure 3—figure supplement 1B). This error has been fixed.

As well, the UMAPS from Figure 3A, the authors claim the clustering is mainly condition-dependent and genotype-independent. However, conditions like MMEtOH, CSTARVE, NLIM-PRO have clear clusters that do not seem cell cycle-dependent and these might be biologically relevant. The authors should at least comment or those or run a DE analysis to see what these are.

This is an excellent suggestion. We have relabeled Figure 3A with expression of gene categories and included it as Figure 3—figure supplement 2. It appears that several stressful conditions have clusters that are upregulated for environmental stress response genes and downregulated for ribosomal and cell-cycle genes. The most likely explanation is that in higher-stress conditions, we are observing some cells that are in a quiescent state. We propose this possibility in the main text while cautioning that the static nature of our experimental design limits our ability to conclusively demonstrate this.

How do the newly generated data compare to Gasch et al., 2017 and Nadal-Ribelles et al., 2019?

We have compared our scRNAseq data to several published expression datasets in Figure 2—figure supplement 2. We find reasonable agreement with Nadal-Ribelles et al., 2019 despite differences in strain and experimental technique. The agreement with data form Gasch et al., 2017 is less similar, likely reflecting differences in genetic background, growth conditions, and experimental technique. These comparisons highlight the challenge of integrating single-cell data from separate experiments. However, several recent papers have explored integration of single-cell data (e.g. Butler et al., 2018, Stuart et al., 2019). One of the main long-term advantages of our work (that we will explore in the future) is that it could be applied to integrate multiple data sets that have been prepared differently into unified network models.

The number of differentially expressed genes even in the YPD condition seems a bit low. How do these pseudobulk compare to the deletome data (Holstege lab) or other published datasets?

The deletome data from Holstege 2014 was generated in synthetic complete media (the closest comparison in our data set is minimal media). The TFs which we have focused on are in pathways known to be dysregulated in auxotrophic strains like the BY4741 family used as a basis for the yeast deletion collection (the auxotrophies are all deletions in nitrogen anabolic pathways; our work does not use these strains, which makes direct comparisons to the standard yeast deletion collection more challenging).

As well, the authors use DESeq in Figure 4B, but these results are contradictory with what is shown in Figure S4Bi which is done by Welch testing. This is a bit confusing and does not add much to the reader. I would suggest to run DE between conditions using DESeq or provide the reasoning as to why these two different approaches are used and done?

We agree with the reviewer that this analysis adds nothing to this work and is a source of confusion. Therefore, we have removed the t-test based figures.

Why do transcription factor activities FKH1 and FKH2 and SWI4 SWI5 do not overlap (they almost seem mutually exclusive), one would expect them to have similar profiles. Similarly, NDD1 regulates S-phase genes but the TFA does not overlap with the HTB expression shown in Figure 3.

We thank the reviewer for identifying this inconsistency. While looking into this, we identified an issue with the modeling that disproportionately affected genes with extremely low average values in one modeling task resulting in them highly influencing certain TF activities. We have corrected this problem by adding a filter to remove genes with low average expression, and we have replaced NDD1 in this figure with MBP1, another cell-cycle TF.

SWI4 and SWI5 are not currently annotated on SGD as acting in the same stage of the cell cycle (SWI4 is annotated as late G1/S and SWI5 is annotated as early G1/M), and so we don’t expect them to overlap. TFs with functional redundancy like FHK1 and FHK2 would be expected to overlap, but the redundancy makes modeling regulatory relationships more challenging (this problem is one of the main reasons we have chosen nitrogen metabolism as our model, as it is also full of functional redundancies and cross-regulatory loops). Methodological approaches to functional redundancies is the subject of ongoing work.

Reviewer #3:

[…] I have some comments, which I think can strengthen the messages of the paper.

1) In Figure 5C, is the single task network inferred by merging all the data and learning a single network or by learning separate networks with single tasking and aggregating the results? If not, how does the single task per condition followed by aggregation perform?

4) A comparison of the AUPR of the single task condition-specific networks and the multi-task condition specific network could further show the advantage of the multi-task learning framework.

Both of these (point 1 and point 4) are excellent suggestions. Initially, Figure 5D was comparing merged data learning a single network to AMuSR. Figure 5D now compares all data combined to learn a single-task network [BBSR (ALL)], all data learned separately with single-tasking and then aggregated into a single network [BBSR (BY TASK)], and all data which is learned together with multitask learning [AMuSR (MTL)]. In addition, the second panel of 5D has been replaced with a comparison of the task-specific networks learned individually with BBSR (labeled as [BBSR (BY TASK)]), and the task-specific networks learned jointly with AMuSR (labeled as [AMuSR (MTL)]). We find that the majority of the performance gained by MTL is due to separating individual conditions; sharing information during regression improves performance on some task-specific networks, but decreases performance on others (overall the effect balances out on the final aggregate network).

2) The authors don't get much into the context-specificity of the inferred networks. They interpret only the final aggregated network. It would be useful to know how similar the individual condition-specific networks are and if there is a conserved core used by multiple conditions. The only comparison of context specificity is being done at the level of AUPR, it might be helpful to do this comparison just by comparing the inferred networks.

This is an excellent point. The composition of the final, aggregate network is examined in Figure 6—figure supplement 1C-E. We find that there does exist a common core of interaction which we recover in every task, most of which are in the prior. Much of the learned network corresponds to interactions found in multiple condition-specific networks, but we also find a large number of interactions that are unique to a single condition-specific network (when we examine individual condition networks we find a large number of TF-gene interactions which are present in only one network; most of these are not included in the final, aggregate network, but some are). We think that this observation provides a strong justification for learning networks from separate growth conditions with distinct transcriptional profiles and have emphasized this point in our revised manuscript.

3) Some discussion about the variation in the AUPRs would be helpful. Is it because the gold standard is biased towards the conditions on which the AUPR is high. It seems the AUPR for the MMEtOH network is close to what is inferred by the multi-task learning, and some explanation of why this might be is helpful.

This is a great point. We think that the variation is mainly due to cells in certain conditions requiring more of their transcriptome to be active. Cells in rich YPD media have mostly glycolytic, cell growth, and cell cycle expression programs active, whereas cells in minimal MMD media require all of these processes plus anabolic pathways for a number of essential compounds. Cells in MMEtOH can turn off glycolysis but need respiration and redox-related pathways to be active. We have added this interpretation to the main text.

5) It would be helpful to emphasize if and how the multi-task learning approach used here was from extended from the Castro, 2019 paper.

Although there are some changes in implementation, the multi-task learning approach here is the same conceptually as the technique introduced in Castro et al., 2019. The most important difference is that we’ve chosen not to weight prior interactions differently than interactions that are not in the prior during modeling; priors in this work are only used to calculate transcription factor activity. We have clarified this in the text.

6) The Discussion could be strengthened. The authors present some results about the interplay between cell cycle and nitrogen response. It was not clear why this is interesting to study beyond that there is a shared regulatory program. This might be worth bringing up in the Discussion to tie back to the initial goal of inferring a network for nitrogen metabolism and the TOR signaling pathway and the general role of cell cycle and stress response.

We have added a paragraph to our Discussion to more explicitly discuss the relationship between nitrogen metabolism, TOR signaling and the cell cycle.

7) The authors don't find a substantial impact of TF knockout on gene expression under different conditions (I assume the comparisons were done while controlling for the conditions). How does this compare to bulk data? How much of this observation could be due to the sparsity of scRNAseq data versus the redundancy of TFs.

Our new experiments show scRNAseq data is in good agreement with bulk RNAseq data. We think that the limited impact of TF knockouts on expression is consistent with earlier published studied (e.g. the Holstege deletome study). Therefore, we think that the primary limitations of TF deletions are 1) functional redundancy and 2) studying expression in static rather than dynamic conditions. We now include these points in our Discussion and propose alternative approaches that might elicit stronger transcriptional responses (e.g. inducible overexpression of TFs).

https://doi.org/10.7554/eLife.51254.sa2

Article and author information

Author details

  1. Christopher A Jackson

    1. Center For Genomics and Systems Biology, New York University, New York, United States
    2. Department of Biology, New York University, New York, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Dayanne M Castro
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8769-2710
  2. Dayanne M Castro

    Department of Biology, New York University, New York, United States
    Contribution
    Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Christopher A Jackson
    Competing interests
    No competing interests declared
  3. Giuseppe-Antonio Saldi

    Department of Biology, New York University, New York, United States
    Contribution
    Software, Formal analysis
    Competing interests
    No competing interests declared
  4. Richard Bonneau

    1. Center For Genomics and Systems Biology, New York University, New York, United States
    2. Department of Biology, New York University, New York, United States
    3. Courant Institute of Mathematical Sciences, Computer Science Department, New York University, New York, United States
    4. Center For Data Science, New York University, New York, United States
    5. Flatiron Institute, Center for Computational Biology, Simons Foundation, New York, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing
    Contributed equally with
    David Gresham
    For correspondence
    bonneau@nyu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4354-7906
  5. David Gresham

    1. Center For Genomics and Systems Biology, New York University, New York, United States
    2. Department of Biology, New York University, New York, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing - review and editing
    Contributed equally with
    Richard Bonneau
    For correspondence
    dgresham@nyu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4028-0364

Funding

National Institute of Diabetes and Digestive and Kidney Diseases (R01DK103358)

  • Richard Bonneau

National Institute of General Medical Sciences (R01GM107466)

  • David Gresham

National Science Foundation (MCB1818234)

  • David Gresham

Eunice Kennedy Shriver National Institute of Child Health and Human Development (R01HD096770)

  • Richard Bonneau

National Science Foundation (IOS1546218)

  • Richard Bonneau

National Cancer Institute (R01CA229235)

  • Richard Bonneau

Flatiron Institute

  • Richard Bonneau

Simons Foundation

  • Richard Bonneau

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank past and present members of the Gresham and Bonneau labs, as well as Christine Vogel’s lab, for discussions and feedback. We thank our undergraduate researchers, especially Juli Miller, for help constructing strains. We thank Tara Rock, Olivia Micci-Smith, and Hana Husic from the NYU Genomics Core facility for troubleshooting suggestions and DNA sequencing services. RB acknowledges support from the Flatiron Institute, the Simons Foundation, the NIH (R01DK103358, R01HD096770, and R01CA229235) and the NSF (IOS1546218). DG is funded by the NIH (R01GM107466) and NSF (MCB1818234).

Senior Editor

  1. Detlef Weigel, Max Planck Institute for Developmental Biology, Germany

Reviewing Editor

  1. Naama Barkai, Weizmann Institute of Science, Israel

Publication history

  1. Received: August 21, 2019
  2. Accepted: January 10, 2020
  3. Accepted Manuscript published: January 27, 2020 (version 1)
  4. Version of Record published: February 6, 2020 (version 2)
  5. Version of Record updated: February 27, 2020 (version 3)

Copyright

© 2020, Jackson et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 6,620
    Page views
  • 801
    Downloads
  • 5
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Human Biology and Medicine
    Michael R Garvin et al.
    Short Report Updated

    Neither the disease mechanism nor treatments for COVID-19 are currently known. Here, we present a novel molecular mechanism for COVID-19 that provides therapeutic intervention points that can be addressed with existing FDA-approved pharmaceuticals. The entry point for the virus is ACE2, which is a component of the counteracting hypotensive axis of RAS. Bradykinin is a potent part of the vasopressor system that induces hypotension and vasodilation and is degraded by ACE and enhanced by the angiotensin1-9 produced by ACE2. Here, we perform a new analysis on gene expression data from cells in bronchoalveolar lavage fluid (BALF) from COVID-19 patients that were used to sequence the virus. Comparison with BALF from controls identifies a critical imbalance in RAS represented by decreased expression of ACE in combination with increases in ACE2, renin, angiotensin, key RAS receptors, kinogen and many kallikrein enzymes that activate it, and both bradykinin receptors. This very atypical pattern of the RAS is predicted to elevate bradykinin levels in multiple tissues and systems that will likely cause increases in vascular dilation, vascular permeability and hypotension. These bradykinin-driven outcomes explain many of the symptoms being observed in COVID-19.

    1. Computational and Systems Biology
    2. Neuroscience
    Omer Mazar, Yossi Yovel
    Research Article Updated

    For decades, researchers have speculated how echolocating bats deal with masking by conspecific calls when flying in aggregations. To date, only a few attempts have been made to mathematically quantify the probability of jamming, or its effects. We developed a comprehensive sensorimotor predator-prey simulation, modeling numerous bats foraging in proximity. We used this model to examine the effectiveness of a spectral Jamming Avoidance Response (JAR) as a solution for the masking problem. We found that foraging performance deteriorates when bats forage near conspecifics, however, applying a JAR does not improve insect sensing or capture. Because bats constantly adjust their echolocation to the performed task (even when flying alone), further shifting the signals' frequencies does not mitigate jamming. Our simulations explain how bats can hunt successfully in a group despite competition and despite potential masking. This research demonstrates the advantages of a modeling approach when examining a complex biological system.