1. Computational and Systems Biology
  2. Developmental Biology
Download icon

Control of tissue development and cell diversity by cell cycle-dependent transcriptional filtering

  1. Maria Abou Chakra
  2. Ruth Isserlin
  3. Thinh N Tran
  4. Gary D Bader  Is a corresponding author
  1. The Donnelly Centre, University of Toronto, Canada
Research Article
  • Cited 1
  • Views 1,005
  • Annotations
Cite this article as: eLife 2021;10:e64951 doi: 10.7554/eLife.64951

Abstract

Cell cycle duration changes dramatically during development, starting out fast to generate cells quickly and slowing down over time as the organism matures. The cell cycle can also act as a transcriptional filter to control the expression of long gene transcripts, which are partially transcribed in short cycles. Using mathematical simulations of cell proliferation, we identify an emergent property that this filter can act as a tuning knob to control gene transcript expression, cell diversity, and the number and proportion of different cell types in a tissue. Our predictions are supported by comparison to single-cell RNA-seq data captured over embryonic development. Additionally, evolutionary genome analysis shows that fast-developing organisms have a narrow genomic distribution of gene lengths while slower developers have an expanded number of long genes. Our results support the idea that cell cycle dynamics may be important across multicellular animals for controlling gene transcript expression and cell fate.

Introduction

A fundamental question in biology is how a single eukaryotic cell (e.g., zygote, stem cell) produces the complexity required to develop into an organism. A single cell will divide and generate many progeny, diversifying in a controlled and timely manner (Mueller et al., 2015) to generate cells with very different functions than the parent, all with the same genome (Wilmut et al., 1997). Many regulatory mechanisms coordinate this process, but much remains to be discovered about how it works (Zoller et al., 2018). Here, we explore how cell cycle regulation can control gene transcript expression timing and cell fate during tissue development.

The canonical view of the cell cycle is a timely stepwise process. Typically, the complete cell cycle is divided into four phases: first gap phase (G1), synthesis phase (S), second gap phase (G2), and mitotic phase (M). The length of each phase determines how much time a cell allocates for processes associated with growth and division. However, the amount of time that is spent in each phase frequently differs from one cell type to another within the same organism. For example, some cells experience fast cell cycles, especially in early embryogenesis. Organisms such as the fruit fly (Drosophila melanogaster) and the worm (Caenorhabditis elegans) exhibit cell cycle durations as short as 8–10 min (Edgar et al., 1994; Foe, 1989). Cell cycle duration also changes over development (Figure 1 and Supplementary file 1). For example, it increases in mouse (Mus musculus) brain development from an average of 8 hr at embryonic day 11 (E11) to an average of 18 hr by E17 (Furutachi et al., 2015; Takahashi et al., 1995a).

Cell cycle duration changes during mouse development.

The data was curated from several publications (PubMed identifiers: 5859018, 14105210, 5760443, 5542640, 4041905, 7666188, 12151540, 18164540), shown in the legend as authors and (year). For other species and tissues, see Supplementary file 1.

Interestingly, cell cycle duration can act as a transcriptional filter that constrains transcription (Rothe et al., 1992; Shermoen and O'Farrell, 1991). In particular, if the cell cycle progresses relatively fast, transcription of long genes will be interrupted. In typical cells, the gene transcription rate is between 1.4 and 3.6 kb per minute (Ardehali and Lis, 2009). Thus, an 8 min cell cycle would only allow transcription of the shortest genes, on the order of 10 kb measured by genomic length, including introns and exons, whereas a 10 hr cell cycle would allow transcription of genes as long as a megabase on the genome.

Cell cycle-dependent transcriptional filtering has been proposed to be important in cell fate control (Bryant and Gardiner, 2016; Swinburne and Silver, 2008). Most multicellular eukaryotic animals start embryogenesis with short cell cycle durations and a limited transcription state (O'Farrell et al., 2004) with typically short zygotic transcripts (Heyn et al., 2014). These cells allocate the majority of their cycle time to S-phase (synthesis), where transcription is inhibited (Newport and Kirschner, 1982a), and M-phase (division), with little to no time for transcription in the gap phases. However, as the cell cycle slows down, time available for transcription increases (Edgar et al., 1986; Newport and Kirschner, 1982a; Newport and Kirschner, 1982b), enabling longer genes to be transcribed (Djabrayan et al., 2019; Shermoen and O'Farrell, 1991; Yuan et al., 2016).

We asked what effects cell cycle-dependent transcriptional filtering may have over early multicellular organism development. Through extensive mathematical simulations of developmental cell lineages, we identify the novel and unexpected finding that a cell cycle-dependent transcriptional filter can directly influence the generation of cell diversity and can provide fine-grained control of cell numbers and cell-type ratios in a developing tissue. Our computational model operates at single-cell resolution, enabling comparison to single-cell RNA-seq (scRNA-seq) data captured over development, supporting our model by showing similar trends. Our model also predicts genomic gene length distribution and gene transcript expression patterns that are consistent with a range of independent data. Our work provides new insight into how cell cycle parameters may be important regulators of cell-type diversity over development.

Results

Computational model of multicellular development

We model multicellular development starting from a single totipotent cell that gives rise to many progeny, each with its own transcriptome (Figure 2). We developed a single-cell resolution agent-based computational model to simulate this process (see Materials and methods). Each cell behaves according to a set of rules, and cells are influenced solely by intrinsic factors (e.g., number of genes in the genome, gene length, transcript levels, and transcription rate). We intentionally start with a simple set of rules, adding more rules as needed to test specific mechanisms. Our analysis is limited to pre-mRNA transcript expression, and we do not consider other gene expression-related factors, such as splicing, translation, or gene-gene interactions. We also omit external cues (e.g., intercellular signaling or environmental gradients) to focus on the effects of intrinsic factors.

A novel mathematical model of cell lineage generation.

(A) A single cell is defined by a given number of genes in its genome as well as their gene lengths (e.g., three genes, gene1 < gene2 < gene3). Cell cycle duration defines the time a cell has available to transcribe a gene. (B) For example, a cell with cell cycle duration = 1 hr will only enable transcription of gene1; cell cycle duration = 2 hr will enable transcription of gene1 and gene2; cell cycle duration = 3 hr enables transcription of all three genes. (C) Our model assumes that transcripts passed from parental cell to its progeny will be randomly distributed during division (M-phase). (D) Each cell is characterized by its transcriptome, represented as a vector.

In our model, each cell is characterized by a fixed genome containing a set of G genes (gene1,gene2,.geneG) shown in (Figure 2A). Each genei is defined by a length, Li (in kb), and in all our simulations each gene is assigned a different length (L1 < L2 <… LG). Since each genei has a unique length, Li, we label genes by their length (geneiLi=geneLi; e.g., gene3 is a gene of length 3 kb). We assume transcription time for genei is directly proportional to its length, Li. In the model, each cellj is initialized with a cell cycle duration (Γcell), which represents the total time available for gene transcription (see Materials and methods). For example, we can initialize cell1 with a three-gene genome (gene1,gene2,gene3),whereL=(1kb,2kb,3kb) and a cell cycle duration Γ1 of 1 hr. We fix transcription rate, λ, to 1 kb/hr for all genes (though this assumption can be relaxed without changing our results; Figure 3 and Figure 3—figure supplement 1). As transcription progresses for all genes, cell1 will only express gene1. Increasing cell cycle duration, Γcell, will allocate more time for transcription, allowing longer genes to be transcribed. For example, if we initialize cell2 with a cell cycle duration Γ= 3 hr, cell2 will express all three genes, with time to make three copies of gene1 (Figure 2B). We assume that RNA polymerase II re-initiation occurs along the gene, a distance Ω apart (Figure 3—figure supplement 1).

Figure 3 with 4 supplements see all
Short genes produce more transcripts than longer genes at multiple cell cycle duration lengths.

The transcriptome for each cell is subdivided into short, medium, and long gene bins, and transcript counts are averaged per bin per cell. (A) Simulations predict that short gene transcripts will be more highly expressed than long gene transcripts, irrespective of the genome size. Simulation results are shown for cell cycle durations of 1, 5, and 10 hr and gene lengths (geneL1-L10); see Figure 3—figure supplement 3 for additional simulations (other parameters ploidy = 1, one cell division, iterations = 5,000,000, genome G = 10, geneL1-L10, transcription rate, λ = 1 kb/hr, RNA polymerase II re-initiation, Ω=0.25kb). Bins are defined such that genes are evenly distributed across them. (B) Single-cell microglia data obtained from GSE134707 (Geirsdottir et al., 2019) displaying expected patterns where short genes (lengths <10 kb) have a higher transcript expression than both medium genes (lengths > 10 kb) and longer genes (lengths >25 kb) – Kolmogorov–Smirnov test p < 10−16, the upper bound p-value for all short-medium and short-long comparisons – across nine different species (age): Macaca fascicularis (3 years), Callithrix jacchus (7 years), Mus musculus (8–16 weeks), Rattus norvegicus (11–14 weeks), Mesocricetus auratus (8–16 weeks), Nannospalax galili (2-4 years), Ovis aries (18–20 months), Gallus gallus (24 weeks), and Danio rerio (4–5 months). The top part of the plot shows the total number of genes possible in each bin, given the gene length distribution of each genome. Bins are defined such that they are both consistent across all species and also approximately evenly filled with genes.

Once transcription is complete, the cell enters M-phase, during which it divides, and expressed transcripts are randomly distributed to the two progeny cells (Figure 2C). This is the main stochastic component in our model. We assume that transcription begins anew at the start of the cell cycle (i.e., all transcripts from a gene that cannot be finished in one cycle are eliminated), modeling the known degradation of incomplete nascent transcripts in M-phase (Shermoen and O'Farrell, 1991). Relaxing our assumption to include parental transcript inheritance and decay (Sharova et al., 2009), where a proportion of inherited parental transcripts remain after each cell division, does not change our overall results (Figure 3—figure supplement 2). All individual cells and their transcriptomes are tracked over the course of the simulation, enabling single-cell resolution analysis. Transcriptomes are stored as vectors containing the total number of transcripts per gene. For instance, cell2 may have a transcriptome of (3,1,1), indicating that three genes are expressed, with gene1 expressed at three transcripts per cell and the other two genes expressed at one transcript per cell (Figure 2D).

Model prediction: cell cycle duration influences transcript count – short genes generate more transcripts than longer genes

We begin by examining how a transcriptional filter impacts transcript counts, as controlled by cell cycle duration. Shorter cell cycles will interrupt long gene transcription, resulting in relatively high expression of short gene transcripts and low expression of long gene transcripts. Our computational simulations generate this expected pattern (Figure 3A). Each simulated cell transcriptome is divided into three bins containing short, medium, and long genes, and then each bin is summarized with an average transcript count. In simulations, bins with short genes exhibit the highest average transcript count levels. As cell cycle duration increases, more cells show an increase in transcript count of longer genes; the trend is consistent for various genome sizes and gene length distributions (Figure 3A and Figure 3—figure supplement 3).

scRNA-seq has recently been used to profile mRNA expression of thousands of cells for one cell type (microglia) across multiple species (Geirsdottir et al., 2019) or for multiple embryonic developmental time points in one species, such as Xenopus tropicalis (Briggs et al., 2018) and Danio rerio (Kimmel et al., 1995; Wagner et al., 2018), or tissue, such as mouse neural cortex (Yuzwa et al., 2017). We analyzed these data in the same manner as our model (Figure 3B and Figure 3—figure supplement 4) and found that, in general, short genes have a higher mRNA expression level than longer genes within a cell. Thus, gene mRNA expression patterns from a range of scRNA-seq data sets, including developmental time courses, are compatible with our model prediction.

Model prediction: cell cycle duration can control cell diversity

We next asked how three major model parameters (cell cycle duration, maximum gene length, and number of genes in the genome) can influence the generation and control of cell diversity observed during normal multicellular development. We conducted simulations for a single-cell division step for simplicity, but these were repeated thousands of times to model cell population effects. We compute cell diversity in two ways; first, as the number of distinct transcriptomes in the cell population (transcriptome diversity); and second, as the number of distinct transcriptomic clusters, as defined using standard single-cell transcriptomic analysis techniques (Satija et al., 2015) (see Materials and methods). Both measures model real cell types and states that are distinguished by their transcriptomes, with transcriptome diversity as an upper bound on cell-type number, and cluster number approximating a lower bound. We first ran simulations with an active transcriptional filter by varying only the cell cycle duration, Γ, for a genome with 10 genes, with genes ranging in size from 1 to 10 kb, such that it satisfies L1=1...Γ...LG=10. Short cell cycle duration parameter values generated a homogenous population of cells because only short transcripts can be transcribed. As cell cycle duration was increased, transcriptome diversity also increased. Longer cell cycle duration values generated heterogeneous populations because a range of transcripts can be expressed (Figure 4A, brown line). Interestingly, cell cluster diversity peaks at intermediate cell cycle duration parameter values (Figure 4B, brown line; Figure 4C) because new genes are introduced with increasing cell cycle lengths, but eventually long cell cycles provide sufficient time for cells to make all transcripts, which leads to reduced variance between the progeny. We next repeat this experiment by turning off the transcriptional filter by reducing the maximum gene length such that LG<Γ (Figure 4A, B, blue line). In this case, cell diversity can be generated, but it quickly saturates (Figure 4B, blue line), as all transcripts are expressed, given a cycle duration allowing the expression of the longest transcript. Thus, while cellular diversity can be generated with an active or inactive transcriptional filter, diversity is more easily controlled by cell cycle duration when the transcriptional filter is active.

Cell cycle duration can control cell diversity.

Simulations explore the effects of cell cycle duration, Γ, gene number, G, and gene length distribution. (A) Simulations show that cell diversity (transcriptome diversity) increases as a function of cell cycle duration. Short cell cycle durations can constrain the effects of gene number as long as a transcriptional filter is active (gene length distributions are broad, L1<(Γλ)<LG). When LG<(Γλ), cell cycle duration does not control cell diversity. Cell cycle duration effects are relative to the gene length distribution in the genome. (B) We use Seurat to cluster the simulated single-cell transcriptomes (10,000 cells) using default parameters and report the number of cell clusters over the simulations. This shows that cell diversity increases with gene number, but the number of clusters identified decreases when all the gene transcripts can be expressed similarly among all cells. (C) Representative examples (10,000 cells) of t-SNE visualizations (RunTSNE using Seurat version 3.1.2) are shown for simulations with cell cycle durations 2, 6, and 10 hr (genome G = 10, geneL1-L10, ploidy n = 1, and transcription rate, λ = 1 kb/hr, RNA polymerase II re-initiation, Ω=0.25kb).

In general, transcriptome diversity increases as a function of cell cycle duration (Γ), transcription rate (λ), and number of genes in the genome (G). In particular, transcriptome diversity = n i=1G(T/Li+1), where n is the genome ploidy level, T = a=0LiΩ1f(a),f(a)0,f(a)=ΓλaΩλ (i.e., the maximum transcribed gene length, T, is restricted by the product of cell cycle duration, Γ, transcription rate, λ, and RNA polymerase II re-initiation, Ω), and Li is the length of genei. This analytical solution of cell transcriptome diversity was validated by comparing it to simulations (Supplementary file 2). While the number of genes and their length distribution can change over the course of evolution, these numbers are constant for a given species, and transcription rate is likely highly constrained (Ardehali and Lis, 2009), leaving only cell cycle duration as a controllable parameter of cell diversity during development, according to our model.

Model prediction: varying cell cycle duration over developmental time controls tissue cell proportions and number

During multicellular organism development, it is essential to generate the correct numbers of cells and cell types. Cell cycle duration changes dramatically during development, generally starting out fast to generate cells quickly and slowing down over time as the organism matures (Supplementary file 1 and Figure 1; Farrell and O'Farrell, 2014; O'Farrell et al., 2004). Clearly, cells with short cell cycles generate more progeny compared to those with longer cell cycles. However, we propose that a tradeoff exists, balancing the generation of diversity (longer cell cycle durations) with the fast generation of cells (shorter cell cycle durations; Figure 4B). To study this tradeoff, we simulated cell propagation under a ‘mixed lineage’ scenario where, after the first division, one child cell and its progeny maintains a constant cell cycle duration (Γ1 = 1 hr) and the second child cell and its progeny maintains an equal or longer constant cell cycle duration over a lineage with 20 cell division events (Figure 5, gray and blue lineages, respectively). We initialize the starting cell with no prior transcripts (naïve theoretical state) and a genome containing five genes ranging from length 1 to 2 kb (gene1, gene1.25, gene1.5, gene1.75, gene2), setting cell cycle duration in the second lineage to range between 1 and 2, controlling the transcriptional filter threshold in this lineage only. We considered three scenarios: (1) both cell lineages cycle at the same rate (Fast-Fast, Figure 5A); (2) the first (blue) lineage is slower than the second (gray) (Slow-Fast, Figure 5B); and (3) both slow and fast lineages divide asymmetrically, producing one slow and one fast cell (Slow-Fast, Figure 5C).

Cell cycle duration can control the generation of cell proportions and cell types within a population.

Simulations start with two cells and run for 18 divisions (generating 219 cells when cell cycles are the same). Cell1 is initialized with cell cycle duration Γ1 = 1 hr, Cell2 has cell cycle duration, Γ2, ranging from 1 to 2 hr. All progeny are tracked based on their cell cycle duration (lineage Γ1 = 1 cell cycle duration, gray, or lineage Γ2 cell cycle duration, blue). Tree plot depicting lineages when the cell cycle duration (A) is the same, Γ= Γ2 (scenario 1), or (B, C) differs, Γ< Γ2 (scenarios 2 and 3). Scenario 2 captures a situation when the cell cycle is determined by the parental lineage, while scenario 3 captures a situation when a cell splits asymmetrically into a fast and slow cell, resulting with the fast lineage having just one cell. (D–F) Müller visualizations show that when the cell cycle duration is the same, both cells contribute the same number of progeny and cell proportions (%) are 50:50 (bottom left panel). The visualization is stacked, down-scaling the blue lineage slightly to reduce occlusion of the gray lineage. Cells with longer cell cycle duration (blue lineage) generate fewer progeny with respect to the cells with a short cell cycle duration of 1 hr (gray lineage). However, the slower cells contribute more to the diversity observed in the population, shown as the blue and gray transcriptome diversity bars. Thus, increasing cell cycle duration increases cell diversity, but also limits the number of progeny generated. The system can overcome the limit on cell number by using scenario 3, where more slow cells can be generated (other parameters G = 5, gene lengths (geneL1-L2), genome = {1,1.25,1.5,1.75,2} and ploidy = 1, RNA polymerase II re-initiation, Ω=0.25kb).

In the simulation where both cell lineages cycle at the same rate (Figure 5A), both lineages generate the same number of progeny with the same level of diversity (Figure 5D). When cell cycle duration for the second (blue) lineage is increased across simulations (Figure 5E), the transcriptional filter acts to generate more diverse progeny, but with fewer cell numbers and progressively smaller population proportions due to the slower cell cycle (Figure 5E, blue bars). Meanwhile, the short cell cycle lineage maintains a steady, low level of diversity generation (Figure 5E, gray bars). When a fast cell can divide asymmetrically, generating one slow and one fast cell at each division, the number of slow cells in the population can increase; however, this comes with a reduction of the number of fast cells in the population (Figure 5F). Thus, our simulations show how the cell cycle duration parameter can impose a tradeoff between cell proportion and diversity generation, and mixing lineages with different cell cycle durations can generate mixed cell populations each with their own diversity levels.

To more faithfully simulate multicellular animal development where cell cycle duration increases over time, we next allowed progeny cells to differ in their cell cycle duration from their parents in each generation (Figure 6A). Increasing the cell cycle duration over time reveals that cell cycle dynamics can alter the number and proportions of cells as a function of time (cell generations; Figure 6B and Figure 6—figure supplement 1). To compare with a real system, we explore single-cell transcriptomics data measured over four time points of mouse cortex development (Yuzwa et al., 2017). Average cell cycle duration over mouse neural cortex development is known to increase from 8 hr at embryonic day 11 (E11) to an average of 18 hr by E17 (Furutachi et al., 2015; Takahashi et al., 1995a). Within this range, progenitor cells are, in general, expected to be characterized by fast cycles with short G1 duration and neurons by slower cell cycles with long G1 duration (Calegari et al., 2005). In our analysis of the mouse cortex scRNA-seq data, we find that genes with increasing transcript expression across the time course (E11.5 < E13.5 < E15.5 < E17.5) are associated with neural developmental (maturing cell) pathways, whereas the genes with decreasing transcript expression across time (E11.5 > E13.5 > E15.5 > E17.5) are associated with transcription and proliferation (stem and progenitor cell) pathways (Figure 6—figure supplement 2). Furthermore, we observe an overall pattern of an increasing number of cells with long cell cycle duration and a decrease in fast cycling cells (Figure 6C) following the same general trend as observed in our simulations (Figure 6A), supporting the idea that cell cycle duration dynamics could play a role in controlling cell proportions and cell diversity in a developing tissue.

Figure 6 with 2 supplements see all
Varying cell cycle duration across time affects cell-type proportions.

(A) Cell cycle duration increases after each cell division, with amount of increase defined using a Gaussian distribution. (B) Simulation of gradually increasing cell cycle duration over time, such that Γ = Gaussian (mean Γparent ± 6, standard deviation σ = 0.06), affects the relative proportion of cells with different cell cycle durations (pie charts). All cell progeny are labeled based on their cell cycle duration (inherited from parent). See Figure 6—figure supplement 1 for results using other increment rates. Parameters: genome = 10, gene lengths (geneL1-L10), λ = 1 kb/hr, 18 cell divisions, iterations = 500, ploidy n = 1, RNA polymerase II re-initiation, Ω=0.25kb. (C) Single-cell transcriptomics data from GSE107122 (Yuzwa et al., 2017) for embryonic mouse cortex development, known to exhibit increasing cell cycle duration over time. This data includes identified cell types, is a time series, and we know the average cell cycle duration at each time point; at E11.5, the average cell cycle duration is 8 hr and by E17.5 it is 18 hr (Furutachi et al., 2015; Takahashi et al., 1995a). Cells were defined as relatively fast cycling cells (apical progenitors), relatively medium cycling (intermediate progenitors), and relatively slow cycling (neurons), with cell-type annotation based on cell clustering analyses conducted in (Yuzwa et al., 2017). We show how cell proportions (pie charts) change across time, with apical progenitors (relatively fast cycling cells) decreasing in frequency as the average cell cycle duration increases.

Hypothesis: a cell cycle-dependent transcriptional filter may help control cell proportion and diversity in tissue development

Our theoretical model and agreement with general trends in scRNA-seq data supports the hypothesis that a cell cycle-dependent transcriptional filter has the potential to control cell proportion and diversity in tissue development. In this section, we use the model to generate specific questions that can be checked in real data, further supporting our model.

Organismal level

Our model suggests that organisms with long genes will need to maintain long cell cycle durations during development. Cell cycle duration measurements are not widely available, which makes directly testing this hypothesis difficult. Instead, we explored related questions. We started by asking if organisms with longer genes would also take longer to develop. We analyze gene length distributions for 12 genomes spanning budding yeast to human with a diverse range of developmental durations, as shown in Figure 7 and Supplementary file 3 (Gilbert and Barresi, 2016; Jukam et al., 2017). Non-mammalian species that we analyze are relatively fast developing, ranging from approximately 2 hr (e.g., Saccharomyces cerevisiae) to a few days (e.g., X. tropicalis and D. rerio), while mammals (M. musculus, Sus scrofa, Macaca mulatta, and Home sapiens) are relatively slow developing (20, 114, 168, and 280 days, respectively, Supplementary file 3). These species also have different gene length distributions; to illustrate this quantitatively, using a typical transcription rate of 1.5 kb/min (Ardehali and Lis, 2009), a cell cycle duration of 1 hr can exclude up to 20% total genes found in relatively slow developers and not exclude any genes in fast developers (Figure 7A). In agreement with our hypothesis, the gene length distribution is narrower and left shifted (shorter genes) for fast developers and broader and right shifted (longer genes) for slower developing species. Interestingly, one seeming exception to the overall gene length distribution trend in multicellular animals is the tunicate Oikopleura dioica, which has relatively short genes, but also has a rapid gestation period of 4 hr to hatched tadpole (approximately twice as fast as C. elegans and six times faster than D. melanogaster), supporting our analysis. Broadening this analysis to 101 species, we again find an association (r = 0.74) between estimated developmental time and median gene length (Figure 7B and Figure 7—figure supplement 1).

Figure 7 with 1 supplement see all
Gene length distribution and developmental time are correlated.

(A) Model organisms exhibit a large diversity in gene length distributions over their genomes. Species that have narrower gene length distributions tend to develop faster, while slow developers (mammals) exhibit broad and right-shifted gene length distributions. Demarcating a 1 hr cell cycle duration using an average transcription rate of 1.5 kb/min illustrates the proportion of genes that would be interrupted before transcript completion for each organism. Saccharomyces cerevisiae (budding yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fruit fly), Oikopleura dioica (tunicate), Danio rerio (zebrafish), Takifugu rubripes (fugu), Xenopus tropicalis (frog), Gallus gallus (chicken), Mus musculus (mouse), Sus scrofa (pig), Macaca mulatta (monkey), and Homo sapiens (human). (B) There is a clear positive correlation between developmental time and median gene length (101 species, Figure 7—figure supplement 1). Estimated developmental time was curated from the Encyclopedia of Life or articles found in PubMed (Supplementary file 3). We used gestation time for mammals and hatching time for species who lay eggs (since it is difficult to accurately define a comparative stage for all species). We analyzed the data using a Pearson correlation test, shown as r. For each species, we calculated median gene length: all protein coding genes were downloaded from Ensembl version 95 (Yates et al., 2016) using the R Biomart package (Durinck et al., 2009; Durinck et al., 2005). The length of each gene was calculated using start_position and end_position for each gene as extracted from Ensembl data.

Our model suggests that short genes will be enriched in pathways that can function independently from long genes, and that long genes may be enriched in pathways related to mature, differentiated cell types with slower cell cycles (Figure 8B). We examined the functions of short and long genes by conducting a pathway enrichment analysis on all genes in a genome ranked by their length. In the human genome, the longest genes are enriched in processes such as neural development, muscle control, cytoskeleton, cell polarity, and extracellular matrix, and the shortest genes are enriched in processes that presumably need to be quickly activated transcriptionally (e.g., immune, translation, and environment sensing; Figure 8—figure supplement 1). We performed a similar pathway analysis for human (Figure 8 and Figure 8—figure supplements 2 and 3) and 12 other species (Figure 9) and found general agreement with these patterns, finding the longest genes (gene length in the 95% quantile) enriched for genes involved in mature cell-related processes (e.g., brain and muscle development), whereas the shortest genes (gene length in the 5% quantile) are enriched for genes involved in core processes (e.g., immune, RNA processing, and olfactory receptors).

Figure 8 with 4 supplements see all
Short genes and long genes participate in different pathways.

The plot shows the H. sapiens gene length distribution. We selected the shortest 5% quantile as a list of short genes and the 95% quantile as a list of long genes. Short genes < 1.6 kb (n=1124) are involved in immune defense, environment-sensing, and olfactory, and long genes >243 kb (n = 1125) are represented in processes involving muscle and brain development, as well as morphogenesis. For each gene group, we identified all corresponding Gene Ontology (Ashburner et al., 2000) biological process terms downloaded from the Ensembl genome database version 100 (Yates et al., 2016), grouped the terms into themes (Supplementary files 5 and 6), and visualized the resulting term frequencies as word clouds using Mathematica. Refer to Figure 8—figure supplements 2 and 3 for a more detailed analysis of the themes across all gene groups.

Short genes exhibit different pathways than long genes, and this trend is consistent across a wide species range.

We selected the shortest 5% quantile as a list of short genes (top panels in blue) and genes above the 95% quantile to define a list of long genes (bottom panels in gray). Saccharomyces cerevisiae (short < 0.24 kb, long > 3.5 kb), Ashbya gossypii (short < 0.36 kb, long > 3.5 kb), Komagataella pastoris (short < 0.37 kb, long > 3.3kb), Yarrowia lipolytica (short < 0.39 kb, long > 3.5 kb), Caenorhabditis elegans (short < 0.47 kb, long > 9.6 kb), Drosophila melanogaster (short < 0.56 kb, long > 29 kb), Danio rerio (short < 1.3 kb, long > 127 kb), Takifugu rubripes (short < 0.72 kb, long > 27 kb), Xenopus tropicalis (short < 0.93 kb, long > 83 kb), Gallus gallus (short < 0.67 kb, long > 104 kb), Mus musculus (short < 1.2 kb, long > 183 kb), and Sus scrofa (short < 0.57 kb, long > 197 kb). For each gene group, we identified all corresponding Gene Ontology biological process terms from the Ensembl genome database (100) and visualized the resulting term frequencies as word clouds using Mathematica.

Spatial level

Within an organism, cell cycle duration and transcript expression vary across many factors, including spatially. We hypothesize that spatial transcript expression patterns can be initially organized by gene length. To explore this, we study the developing fruit fly embryo (D. melanogaster) where the average cell cycle rates differ spatially (Foe, 1989). At the onset of cell cycle 14, cells in different embryo regions start to divide at different rates, caused by an increase in their gap phase length, varying from 30 min to 170 min (Foe, 1989; Foe and Alberts, 1983). Cell cycle duration lengthening is spatially organized, with anterior regions dividing faster than posterior regions, with the mid-ventral region being the slowest (Figure 10). The embryo also exhibits spatial segregation patterns due to Hoxd gene family transcript expression (Mallo and Alonso, 2013). Overlaying the spatial patterns of hox gene family transcript expression and cell cycle duration obtained from independent studies, we observe that fast cycling regions express the shortest hox genes (Dfd 10.6 kb, lab 17.2 kb) and slow cycling regions express the longest hox genes (Ubx 77.8 kb and Antp 103.0 kb) (Foe, 1989; Lemons and McGinnis, 2006) in agreement with our model.

Hox gene length is correlated with spatial expression and cell cycle duration in the D. melanogaster embryo.

Drosophila Hoxd family genes are each represented by a colored rectangle, containing the length of the gene in base pairs. Spatial expression of a gene transcript is marked by its corresponding color on the Drosophila embryo map. Hoxd gene length is correlated with the cell cycle duration of the embryo location where the gene transcript is expressed, with short Hox gene transcripts expressed in regions with short mitotic cycles and long Hox gene transcripts expressed in regions of long mitotic cycles. Spatial map of cell cycle duration from Foe, 1989; Foe and Alberts, 1983 and gene transcript expression from Mallo and Alonso, 2013.

Discussion

How cellular processes support the carefully orchestrated timing of tissue development that results in a viable multicellular organism is still unclear. While a combination of many potential cell autonomous and non-autonomous mechanisms, such as cytoplasmic molecules and gradients, cell-cell communication, microenvironment signals, and effective cell size (Edgar et al., 1986; Mukherjee et al., 2020; Tabansky et al., 2013; Yoon et al., 2017), are likely important, one hypothesis is that gene length can be used as a mechanism to control transcription time in this process (Artieri and Fraser, 2014; Gubb, 1986; Keane and Seoighe, 2016; Swinburne et al., 2008). Bryant and Gardiner further hypothesize that cell cycle duration may play a role in filtering genes that influence pattern formation and regeneration (Bryant and Gardiner, 2018; Ohsugi et al., 1997) as cell cycle lengthens over development (Figure 1 and Supplementary file 1; Foe, 1989; Foe and Alberts, 1983; Newport and Kirschner, 1982b; Takahashi et al., 1995b). Early experiments using embryos suggested that cell cycle duration has a role in transcription initiation; however, these experiments lacked the temporal resolution necessary to dissociate the effects of cell cycle duration and transcriptional control from other mechanisms (Edgar et al., 1986; Edgar et al., 1994; Kimelman et al., 1987; Newport and Kirschner, 1982b; Newport and Kirschner, 1982a). It is also well known that cell cycle length changes can control cell fate and development (Coronado et al., 2013; Mummery et al., 1987; Pauklin and Vallier, 2013; Singh et al., 2013); however, this has remained observational and not linked to a mechanism. To help address these limitations, we developed an in silico cell growth model to directly study the relationship between cell cycle duration and gene transcription in a developmental context. The new discovery we make is that a transcriptional filter can be controlled by cell cycle duration and used to simultaneously control the generation of cell diversity, the overall cell growth rate, and cellular proportions during development (defining an emergent property of our computational model – see Appendix 1). Genomic information (gene number and gene length distribution) and cell cycle duration are critical parameters in this model. Across evolutionary time scales, cell diversity can be achieved by altering gene length (Keane and Seoighe, 2016); however, in terms of developmental time scales, we propose that cell cycle duration is an important factor that may control cell diversity and proportions within a tissue.

We predict that increasing the gene length distribution across a genome over evolution can provide more cell cycle-dependent transcriptional control in a developing system, leading to increased cellular diversity. Examining a range of genomes and associated data provides support for this novel idea. We observe that fast-developing organisms have shorter median gene lengths relative to the broad distributions, including many long genes, exhibited by slow developers (mammals). This aspect of genome structure may help explain the observed rates of cell diversity and organism complexity, as measured by number of different cell types, over a wide range of species (Figure 7—figure supplement 1; Valentine et al., 1994; Vogel and Chothia, 2006).

While we hypothesize that a cell cycle-dependent transcriptional filter is a fundamental regulatory mechanism operating during development (because gene length is fixed in the genome and transcription rate is expected to lie in a narrow range), multiple other regulatory mechanisms could modulate its effects. Furthermore, exploring these mechanisms may even result in similar conclusions as it can be evolutionary advantageous to have multiple paths to the same outcome; these include, but are not limited to, silencing or deactivating genes, gene regulatory networks, blocking gene clusters, for example, Hoxd (Rodríguez-Carballo et al., 2019) changing the transcription or re-initiation rate of RNA polymerase II (Figure 3—figure supplement 1), or inheriting long transcripts maternally at the zygote stage (Figure 3—figure supplement 2). Our current model only explores the effects of transcription and re-initiation rates of RNA polymerase II, mRNA transcript degradation rates, and maternally introduced transcripts. For the latter mechanism, we expect longer transcripts to be major contributors during the early maternal phase (Jukam et al., 2017), which agrees with zebrafish (D. rerio) experiments showing that maternal transcripts are longer and have evolutionary conserved functions (Heyn et al., 2014). Indeed, if we add maternal transcript inheritance to our model, we see the same pattern of a small number of long transcripts present early, as expected (Figure 3—figure supplement 2). Future work would entail curating experimental data about more regulatory mechanisms in cell systems and testing their association with cell cycle duration.

Our analysis raises interesting directions for future work. We focus on development, but transcriptional filtering may be important in any process involving cell cycle dynamics, such as regeneration (Bryant and Gardiner, 2018), wound repair, immune activation, and cancer. We must also more carefully consider cell cycle phase as transcription mainly occurs in the gap phases (Bertoli et al., 2013; Newport and Kirschner, 1982b). Experiments indicate that a cell will have different fates depending on its phase (Dalton, 2013; Pauklin and Vallier, 2013; Vallier, 2015). This agrees with our model as a cell at the start of its cell cycle will have a different transcriptome in comparison to the end of the cell cycle. Induced pluripotent cell state is also associated with cell cycle phases (Dalton, 2015), and efficient reprogramming is only seen in cell subsets with fast cell cycles (Guo et al., 2014). Our model could explain these observations as slower cycling cells could express long genes that push a cell to differentiate rather than reprogram. However, our model is limited to total transcription duration for interphase (G1, S, and G2), thus a future direction would be to explore different durations for each cell cycle phase. Collecting more experimental data about cell phase in developing systems will help explore these effects. Further, it will be important to explore how cell cycle duration is controlled. Molecular mechanisms of cell cycle and cell size (Liu et al., 2018) control could be added to our model to provide a more biochemically realistic perspective on this topic. Ultimately, a better appreciation of the effects of cell cycle dynamics will help improve our understanding of a cell’s decision-making process during differentiation and may prove useful for the advancement of tools to control development, regeneration, and cancer. Finally, it is important to note that we have not provided experimental model support, only analyses that do not disagree with model predictions. We have also not proven the generality of the results across species. However, we hope that the hypotheses we explore here motivate new experimental studies to directly test the validity and generality of our model.

Materials and methods

Key resources table
Reagent type
(species) or
resource
DesignationSource or
reference
IdentifiersAdditional
information
Software, algorithmWolfram, 2017Mathematica (Wolfram Research Inc, Mathematica Versions 11.0–12, Champlain, IL, 2017) http://www.wolfram.com/mathematica/
Software, algorithmThis paperCell developmental modelhttps://github.com/BaderLab/Cell_Cycle_Theory
Software, algorithmPMID:25867923
Satija et al., 2015
Seurat (3.1.2) https://satijalab.org/seurat/
Software, algorithmPMID:26687719
Yates et al., 2016
Ensembl (95) and (100) https://useast.ensembl.org/index.html
Software, algorithmPMID:10802651
Ashburner et al., 2000
Gene Ontology http://geneontology.org/
Software, algorithmPMID:16082012
Durinck et al., 2005
BioMart (3.10) http://useast.ensembl.org/biomart/martview/
Software, algorithmPMID:21085593
Merico et al., 2010
Enrichment Map software (3.3.0) https://www.baderlab.org/Software/EnrichmentMap
Software, algorithmPMID:14597658
Kucera et al., 2016
AutoAnnotate App https://baderlab.org/Software/AutoAnnotate
Software, algorithmPMID:14597658
Shannon et al., 2003
Cytoscape (3.8.0) https://cytoscape.org/
Software, algorithmPMID:30664679
Reimand et al., 2019
Baderlab pathway resource (updated June 1, 2020) http://download.baderlab.org/EM_Genesets/

Mathematical model

Request a detailed protocol

Our mathematical model is agent and rule-based. A single cell behaves and interacts according to a fixed set of rules. Our major rule involves a gene length mechanism, where each cell is defined by a genome and a cell cycle duration. The cell cycle duration determines which gene transcripts are expressed within the cell, based on the transcription rate. All decisions are based on a cell’s autonomous information, and we omit external factors. We deliberately choose to consider this simple baseline setup to clarify the contribution of cell cycle duration to overall cell population growth.

Each cell is defined by a genome G (containing a set of genes), cell cycle duration in hours, and the transcripts inherited or recently transcribed. In the genome, each gene is defined by a length, geneLength. For example, in a genome with three genes, (gene1, gene2, gene3) represents genes of length 1, 2, and 3 kb, respectively.

Each cell can divide and make two progeny cells. This process can continue many times to simulate the growth of a cell population, and we keep track of the entire simulated cell lineage. For each cell division (one time step in the simulation): each Celli will transcribe its genes based on the time available, defined by the cell cycle duration. We assume that the time it takes to transcribe a gene depends on its length and a fixed transcription rate; although a simplification, there are examples where this occurs, for instance, the human dystrophin gene is 2,241,765 bp long and takes about 16 hr to transcribe (Tennyson et al., 1995). Once a cell cycle is finished, the cell divides. When cells are synchronized, the first cell division T = Γi. When the cells are asynchronized, then the algorithm identifies the time allocated as the shortest cell cycle duration in the population as the time step and each cell division will have a different duration. In this case, we keep track of the exact duration such that cells with short cell cycles, for example, Γ = 1 hr, will register 10 divisions in 10 hr while cells with long cell cycles, for example, Γ = 10 hr, will register one division in over the same time. We limited the model to two modes of division, symmetric (where the cell gives rise to identical cells, e.g., Figure 5A) and asymmetric (where the cell gives rise to a fast and slow cell, e.g., Figure 5C). We do not consider mechanisms that reduce cell numbers (cell death). For certain experiments (e.g., Figure 6), the cell cycle duration for each progeny is allowed to diverge from the parental duration using a monotonic function (increasing or decreasing) and a stochastic variable based on a Gaussian distribution with a mean equal to Γi (parental cell cycle duration). This models a more realistic noisy distribution of cell cycle durations in the simulated cell population. The cell cycle and division rules are repeated for all cells in the population until a set number of cell divisions have been reached.

During a cycle, each cell contains a certain number of transcripts. The number of transcripts for each gene is calculated by a function of cell cycle duration, Γ, transcription rate, λ, re-initiation distance, Ω, and gene length, L:a=0geneLiΩ1ΓλaΩλ. If the cell does not divide, then the number of transcripts reflects the current cell cycle phase, which is computed and stored. If the cell can divide within the time T = (Γλ), then it will randomly, according to a uniform distribution, assign its transcripts between its two progeny cells. Typically, simulations were conducted with λ=1, simplifying the analysis to (Γ-aΩ)/geneLi; however, we also explored the effects of transcript re-initiation and transcription rate on the system as shown in Figure 3—figure supplement 1.

Our model tracks single cells, with each cell identified by a transcriptome and cell cycle duration. The transcriptome data resemble a scRNA-seq matrix to aid comparison between simulation and experimental data. We allow cells without any transcripts, for example, (0,0,0) to exist – due to the low numbers of genes considered in our simplified model and results, and that parental transcripts are distributed between progeny, there is a probability of 2/(the total number of transcripts) that all the transcripts will end up in only one of the new cells, leaving the other one empty (Zhou et al., 2011). Theoretically we have no reason to omit these cells, and they may represent the most naïve theoretical state of a cell without any prior information. Early embryos, such as in Xenopus stages that lack zygotic transcription, may be similar real systems to such a state (Newport and Kirschner, 1982b).

Parameters tracked for each celli = (number of divisions, current cell cycle phase, current time in cell cycle, length until next division, relative time passed, total cell cycle duration, transcriptome list, cell name, and lineage history). All cells are set with the same genome, ploidy level, and RNA polymerase II transcription rate and RNA polymerase II re-initiation distance.

Our model was developed and simulated using Mathematica (Wolfram, 2017).

Quantification and statistical analysis

Gene length analysis

Request a detailed protocol

All protein coding genes were downloaded from Ensembl genome database version 95 or 100 (Yates et al., 2016) using the R (3.6.1) Biomart package version 3.10 (Durinck et al., 2005). The length of each gene was calculated using start_position and end_position for each gene, as extracted from the Ensembl database (Yates et al., 2016).

Single-cell analysis pipeline

Request a detailed protocol

Simulated data sets were preprocessed and clustered in R using the standard workflow implemented in the Seurat package version 3.1.2 (Satija et al., 2015). We used default parameters unless otherwise stated. Data were log-normalized and scaled before principal component analysis (PCA) was used to reduce the dimensionality of each data set. Due to the small number of simulated genes in our experiments, the maximum number of PCs (one fewer than the number of genes dims = 1:3) was calculated and used in clustering. FindVariableFeatures was used with loess.span set to 0.3 unless the number of genes was less than 5, then (0.4, 0.7, and 1 were used for simulations with 4, 3, and 2 genes, respectively). Cells were clustered using a shared nearest neighbor (SNN)-based ‘Louvain’ algorithm implemented in Seurat with reduction set as ‘pca.’ The clustering resolution was set to 1 for all experiments, and all calculated PCs were used in the downstream clustering process using the Louvain algorithm accessed via Seurat. Data was visualized with t-SNE after clustering.

Developmental time curation

Request a detailed protocol

Estimated developmental time was curated from the Encyclopedia of Life or PubMed accessible articles (Supplementary file 3). We used gestation time for mammals and hatching time for species who lay eggs (since it is difficult to accurately define a comparative stage for all species). Species were grouped based on their taxonomic class and their developmental time was estimated by calculating the average number of days from zygote to birth or hatching.

Pathway enrichment analysis

Request a detailed protocol

We used Gene Set Enrichment Algorithm (GSEA version 4.0.2), in pre-ranked analysis mode, to identify pathways enriched among all genes in a genome ranked by gene length (Subramanian et al., 2005). Gene ranks started at (number of genes)/2 to its negative equivalent and were normalized such that we generated a ranked list from 1 to −1, with 1 specifying the shortest gene and −1 the longest. The ranked gene length list was analyzed for pathway enrichment GSEA with parameters set to 1000 gene set permutations and gene set size between 15 and 200. Pathways used for the analysis were from Gene Ontology biological process (Ashburner et al., 2000), MSigDB c2 (Ashburner et al., 2000), WikiPathways (Slenter et al., 2017), Panther (Mi, 2004), Reactome (Croft et al., 2011), NetPath (Kandasamy et al., 2010), and Pathway Interaction database (Schaefer et al., 2009) downloaded from the Bader lab pathway resource (http://baderlab.org/GeneSets). An enrichment map, created using the EnrichmentMap Cytoscape app version 3.3.0 (Merico et al., 2010), was generated using Cytoscape (version 3.8.0) using only enriched pathways with p-value of 0.05 and FDR threshold of 0.01 (Reimand et al., 2019). Cross-talk (shared genes) between pathways was filtered by Jaccard similarity greater than 0.25. Pathways were automatically summarized using the AutoAnnotate App to assign pathways to themes (Kucera et al., 2016). Themes were further summarized by grouping pathways into more general themes with a mixture of automatic classification using key words and manual identification.

Pathway word cloud analysis

Request a detailed protocol

All Gene Ontology pathways (GO biological processes) were downloaded from the Ensembl genome database, version 100 (Yates et al., 2016), using the R Biomart package version 3.5 (Durinck et al., 2005). We restricted analysis to pathways with at least three genes. We grouped genes based on their gene length (see Gene length analysis for details) and identified the pathways associated with each gene. The description of each pathway was collected and the frequency of each word within the pathway name was calculated. We defined themes (Supplementary files 5 and 6) for all H. sapiens available pathways (using only GO biological processes). Common, generic, and uniformly distributed themes (such as cellular response, metabolic biosynthesis, protein processes, signaling, and transcription) were manually removed from the list. The frequencies were visualized as word clouds using Mathematica (Wolfram, 2017).

Data and code availability

Request a detailed protocol

Our simulation code is available at https://github.com/BaderLab/Cell_Cycle_Theory (Chakra, 2021 copy archived at swh:1:rev:7eb38b679e917ba8522b17edae5498990a221ffc).

Appendix 1

Why is cell cycle duration changing?

While defining a general mathematical representation of cell cycle kinetics for a developing system, we assembled available cell cycle length measurements from published studies for various species and tissues. Figure 1 shows measurements obtained from M. musculus. For other data, see Supplementary file 1. The data motivated us to ask ‘why is cell cycle duration changing over development?’ and propose that changes in cell cycle duration can be used to guide the progression of cell development.

Theoretically we devised a simple model that can test this idea by assuming

  • Cell cycle duration can change across developmental time

  • Gene length distribution is constant among all cells in the same organism, such that we can denote the length by L

  • The difference in cell cycle can affect the time a cell spends transcribing genes

  • All active genes are transcribed and transcription rate is constant in a cell

The novel aspect of our work is the proposal that a cell cycle-dependent transcriptional filter can control cellular diversity within a tissue over development. However, some of the concepts that we build on are known and are recognized in the community to varying degrees. We bring these together for the first time to support the model and generate predictions. In particular, we list these concepts below and clarify our novel contribution.

Prior contributions:

Our novel contributions

  • Our main novel claim: we are the first to link cell cycle duration to control of cell diversity and proportions of cells in tissues.

  • We are the first to support the idea that a cell cycle-dependent transcriptional filter is a mechanism for gene transcript expression regulation that affects development using quantitative modeling.

  • We are the first to link gene length distribution in genomes of multiple species to length of organism development.

  • We are the first to show major functional differences between short and long genes in animal genomes.

  • Our single-cell transcriptomic mathematical model is novel and shared as a community resource.

Data availability

All data generated during this study are included in the manuscript and supporting files. Source file for the code is available at https://github.com/BaderLab/Cell_Cycle_Theory (copy archived at https://archive.softwareheritage.org/swh:1:rev:7eb38b679e917ba8522b17edae5498990a221ffc).

The following previously published data sets were used
    1. Yuzwa SA
    2. Borrett MJ
    3. Innes BT
    4. Voronova A
    5. Ketela T
    6. Kaplan DR
    7. Bader GD
    8. Miller FD
    (2017) NCBI Gene Expression Omnibus
    ID GSE107122. Developmental emergence of adult neural stem cells as revealed by single cell transcriptional profiling.
    1. Briggs JA
    2. Weinreb C
    3. Wagner DE
    4. Megason S
    5. Peshkin L
    6. Kirschner MW
    7. Klein AM
    (2018) NCBI Gene Expression Omnibus
    ID GSE113074. The dynamics of gene expression in vertebrate embryogenesis at single cell resolution.
    1. Wagner DE
    2. Weinreb C
    3. Collins ZM
    4. Megason SG
    5. Klein AM
    (2018) NCBI Gene Expression Omnibus
    ID GSE112294. ystematic mapping of cell state trajectories, cell lineage, and perturbations in the zebrafish embryo using single cell transcriptomics.

References

    1. Bryant SV
    2. Gardiner DM
    (2018) Regeneration: sooner rather than later
    The International Journal of Developmental Biology 62:363–368.
    https://doi.org/10.1387/ijdb.170269dg
    1. Edgar LG
    2. Wolf N
    3. Wood WB
    (1994)
    Early transcription in Caenorhabditis elegans embryos
    Dev Camb Engl 120:443–451.
    1. Foe VE
    (1989)
    Mitotic domains reveal early commitment of cells in Drosophila embryos
    Dev Camb Engl 107:1–22.

Decision letter

  1. Wenying Shou
    Reviewing Editor; University College London, United Kingdom
  2. Aleksandra M Walczak
    Senior Editor; École Normale Supérieure, France
  3. Wenying Shou
    Reviewer; University College London, United Kingdom
  4. David M Suter
    Reviewer; Ecole Polytechnique Fédérale de Lausanne, Switzerland

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Cell cycle duration acting as a filter that constrains transcription is an idea proposed 30 years ago. Here, authors propose that a very simple model can produce results that qualitatively echo single-cell RNA seq data published by other labs. Overall, this study suggests that the slowing down of the cell cycle during development can act to allow longer genes to be transcribed and more cell types to be generated. Experimental test of this hypothesis are needed for future work.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "Control of tissue development by cell cycle dependent transcriptional filtering" for consideration by eLife. Your article has been reviewed by 4 peer reviewers, including Wenying Shou as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: David M Suter (Reviewer #2).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

Although there was some support for your work, the amount of time required for revision will likely exceed two months. It is the policy of eLife that a manuscript requiring substantial revision should be rejected. We do invite you to resubmit if you feel that you have addressed reviewers comments.

Reviewer #1:

Abou Chakra et al. posed the hypothesis that cell cycle transcriptional filtering – the transcribing of long genes only when cell cycle slows down – might help control tissue development and the generation of diverse cell fates. This study is based on simple math modeling and comparing model predictions to single-cell RNA seq data.

From an outsider's point of view, the article is interesting, although very speculative. Thus, I suggest softening your statements throughout. For example, the current title can be changed to "Cell cycle dependent transcription filtering can potentially contribute to tissue development".

Scientifically, I would like to see an analysis of whether transcripts specific to development (e.g. neuronal development) tend to be longer.

Reviewer #2:

This study addresses the important and exciting topic of how the cell cycle duration gates gene expression diversity at the single cell level by limiting the expression of long genes. This work intersects between genetics, gene expression, developmental and evolutionary biology, and should therefore be of broad interest to the readership of eLife. The manuscript is well written and structured, and the findings are interesting. I have some technical and conceptual concerns about the causality links made by the authors.

1. In general the authors should provide more details about the implementation of their method in the manuscript. While Figure 1 is very clear and allows non-specialists to grasp the concept, a more detailed explanation of the model would be useful.

In particular, how is transcription re-initiation modelled ? This is important for the following reason. If a gene takes 10 hours to be transcribed and the cell cycle is 10 hours, only 1 transcript can be made. However, if the cell cycle is longer, the maximum number of transcripts will depend on the distance between successive polymerases (thus depending on initiation rate), and could thus rapidly increase. The number of transcripts generated in 11 hours will thus scale with initiation rate, or inversely with the distance between polymerases on the gene. Also, how does transcriptional bursting affect their modelling ?

2. The authors show that short genes tend to be expressed at higher levels than long genes. They use this data to support that elongation rates * cell cycle duration limits the expression of long genes. In this scenario, one would expect that long genes are more depleted in pol II at the end of the gene, while short genes would not. The authors could look into Pol II footprinting/ChIP-seq datasets to confirm this, and also to exclude that short genes are expressed at higher levels because of their higher initiation rates.

3. The authors mention that cell cycle duration should be broadly correlated with development time. But cell sizes and organism size will also impact this correlation. Why do the authors not consider these parameters ?

4. The authors should provide an analysis of which classes of genes are enriched into different genes length bins. According to their results, more specialised genes, i.e. expressed in terminally differentiated, non-dividing cells should be longer on average.

5. Figure 6A: How can the average cell cycle duration become higher than the max cell cycle duration at the single cell level (shown as color code) ?

6. What is the relative contribution of introns vs exons in long vs short genes ? Could the longer cell cycle of some species allow to accumulate more/longer introns to increase splice isoform diversity/regulatory potential ?

7. Linked to 5., one challenge here is to understand causality links. The cell cycle could be used by organisms to gate cell diversity during development, but longer cell cycles could also allow to accumulate longer genes on evolutionary time scales. The authors should comment on this.

Reviewer #3:

We've known for a very long time that cells in many animal embryos have very fast cell cycles, and that the duration of cell cycle increases as cells differentiate into progeny with specialized function. However, whether cell cycle duration directly impacts cell fate decisions via filtering transcriptional activity is not clear. Certainly it has been proposed on multiple occasions, and the ability of mitosis and DNA replication to interrupt and abort transcription has been demonstrated. This study presents an interesting attempt to address this topic via mathematically modeling the relationship of cell cycle length and the diversity of transcriptome. The authors utilized a simplified model to simulate how cell cycle length, gene length, number of genes and transcription rate affect resulting transcriptomes in cell populations. Not unexpectedly, the simulations show that increasing the length of the cell cycle can increase the proportion of mRNAs from long genes, and also the complexity of the transcriptome and (more interestingly) the diversity of transcriptomes between different cells in a population. The model is clever and the demonstration is useful, if highly over-simplified. Importantly the authors also analyze some real transcriptome data to see if it supports their conclusions. At some levels there is support, but overall the real data from single cell sequence don't align well with their hypothesis and fall short of a compelling experimental validation. Overall we felt the study was interesting in concept and could be a valuable addition to the literature devoted to cell cycle and development, but that it could be much better with comprehensive revisions as discussed below.

1. The simulations in Figure 2A show that increasing cell cycle duration will allow more relative transcription of longer genes. Unfortunately the real data from Xenopus and Danio (Figure 2B, S2) don't appear to show this same trend. The authors should check this more carefully by comparing proportions of transcripts of different lengths at the different developmental times. If there is not change in proportions then the real data may not validate the simulation's predictions. At issue might be the "contamination" of the real data with maternal transcripts. Perhaps the authors should try to remove maternal transcripts and analyze only zygotic transcripts. Furthermore, the timepoints shown for Xenopus in Figure 2B are probably too late and too closely spaced to see a trend. They should compare transcripts for very early and very late in development.

2. I was not convinced by the simulations and arguments about the generation of cell diversity. It seems that this might only occur with very low numbers of transcripts per cell; i.e. when stochastic variation came into effect. This may not be the case for real cells.

3. Figure 6AB really baffled me. Transcriptome diversity seems not to be graphed at all (check the axis labels and key), even though diversity is the point of the figure. Either it is mis-labeled or it needs more explanation. Figure 6C also requires more explanation, both within the figure and in the text (lines 219-227), which is opaque.

4. Figure 5 states that slower cell cycles would increase cell type diversity at the price of fewer progeny number. However, the real data in Figure 6 don't support this idea. Slower cell cycle actually increased differentiated progenies, this suggests that the authors' simulation settings failed to capture a critical aspect of the regulation of transcription.

5. We suggest that the authors also look into the Oikopleura dioica genome (see Danks et al. 2012 "OikoBase: a genomics and developmental transcriptomics resource for the urochordate Oikopleura dioica"). This could be very interesting because this organism develops exceptionally fast (embryogenesis last 4-5 hours before hatching a tadpole), and they have an incredibly compact genome (most introns are no longer than 100 bp). There is also good transcriptome data from Drosophila that could be informative, especially if maternal transcripts could somehow be subtracted out. Overall, the analysis of real transcriptome data is superficial, and much more could be done here, that might support the authors' conclusions much better that what is presented.

6. In the introduction (line 47), the authors should state how they envision the "transcriptional filter" works. By abortion of transcription at M phase? Or during S phase? There is data on both mechanisms and these should be cited and described explicitly. This also comes up in the discussion (line 269). The authors should be aware that the attenuation of transcription during S-phase is limited to interactions with replication forks. In fact there is a great deal of transcription in most S phases.

7. We were uncomfortable with the assumption that appears to be made on lines 86-92 and Figure 2, where the number of transcripts made during a period of time equals: synthesized transcripts=time/time to make one transcript. Does this assumption fail to take into account that multiple RNA transcription bubbles may exist on the same gene once transcription starts, thus speeding up transcript production once the first transcript in finished? If so it is inaccurate.

Reviewer #4:

This manuscript presents a theoretical model that explains potential contribution of cell cycle, as a transcription filter, to organism development. The hypothesis is not entirely new. The main contribution is that the authors defined a math/simulation framework and compared it with some real data (gene size distribution in various organisms and single-cell transcriptomic data obtained from several organisms/processes) for justification. Overall, I do not feel this work adds much new to current understanding. Real data did not strongly validate the proposed model, nor led to refinement in knowledge/hypotheses.

1. Page 4. Algorithms and parameters used in simulation were not described in sufficient detail. For example, how were single-cell RNA-seq data simulated? Was noise considered?

2. Figure 2, the trends between A and B are similar but the actual distributions are not: either mean average transcript count per cell, or the extent of spread look quite different. One could also argue the dissimilarity between the two figures. Does it reveal that the model lacks necessary accuracy? This needs to be further discussed/explained.

3. Figure 3A. The authors should be able to derive analytical solutions for the transcriptome diversity, directly from the model defined in Figure A. Not sure why they need to show these relations indirectly from simulation (Page 5). Perhaps it is the order/logic of the presentation.

4. Figure 3C. Unclear how many cells are there under each condition (when does the simulation stop?) and how do #cell affect the clustering results. Also, #cluster, as a metric is quite misleading (not comparable across conditions) without specifying the total #cell and the clustering algorithms/parameters used.

5. Page 6, the authors stated that "we expect faster developing organisms to have short cell cycles and genes whereas slower developing organisms will have longer cell cycles and genes". This is very rough. Can it be quantified and then supported by the proposed model?

6. Figure 4. How were the 11 genomes selected? Why not select more? Are there genomes that do not follow the trend, i.e., having large genome but relatively shorter genes? Also, it is unclear in what order the organisms are listed. Particularly, Fugu and zebra-fish did not follow the monotonic trend in means. Some distributions do not look statistically significantly different.

7. Figure 6. E15.5 and E17.5 has a slightly reversed trend.

In general, there needs to be more real data used to back up the theoretical models.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Control of tissue development and cell diversity by cell cycle dependent transcriptional filtering" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Wenying Shou as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Aleksandra Walczak as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: David M Suter (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this decision letter to help you prepare a revised submission.

Essential revisions:

The decision has taken a long time because as you will see, the three reviewers have disagreements. Upon further discussions, we have reached an agreement. We feel that although experiments are always desired, there should be a place in science for extracting information from published datasets, despite the varying quality of datasets. We will waive the requirements for experiments, but we do request you to address comments not related to experiments, and be very careful in stressing the limitations of the datasets you used and the conclusions you drew. For example, your model suggests a mechanism, but does not exclude other mechanisms.

Reviewer #1:

1. Figure 4: It may be more meaningful to model stem-cell like behavior where a fast cell always gives birth to a slow and a fast cell, whereas a slow cell always gives rise to two slow cells. I believe that the cell # patterns will look more realistic under this assumption.

2. Figure 7: I am not sold about this figure and the associated text. "Sensory" and "perception" (short genes) seem to be related to neural-development (long genes). Also, the main text said "shortest genes… enriched for genes involved in core processes (e.g…. transcription…), whereas in Figure 7, "transcription" is associated with long genes.

Reviewer #2:

I found the revised version of this manuscript improved, and the authors have adequately addressed most of the points I raised.

Notably, they now describe more in detail their methodology, and also assess how their model behaves when assuming that several RNA Pol II can be present on the same gene. I would suggest to use this as a first assumption rather than "We assume RNA polymerase II re-initiation occurs once a transcript is complete". To me the latter one is not supported by what we know about transcription that occurs in bursts that can generate large number of RNA molecules within minutes. Also see Tantale et al., Nature Communications 2016, who show evidence of RNA Pol II convoys on actively transcribed genes.

Reviewer #3:

In this manuscript the authors use mathematical modeling to address whether cell cycle length determines cell fate using a correlation of gene transcript length. Since a longer cell cycle time, allows transcription of longer genes, it could affect the cell fate of the progeny. If longer transcripts are needed for highly differentiated cells, there would be a need for longer cell cycle times. Since it has been shown in stem cells that lengthening of the G1 phase is correlated with increased differentiation of cells, this hypothesis could make a lot of sense.

Using mathematical modeling is a great approach to answer this question and is definitely one of the strengths of this manuscript. This manuscript is trying to address an important and fundamental question that has been on the minds of scientists for a long time.

The drawback of the manuscript is that validation of the hypothesis is only partially or poorly confirmed by the experimental data. Essentially, the data does not contradict the hypothesis of the authors. This is great but is it good enough? Should the data not univocally prove that the hypothesis is correct? One of the major issues is that the authors use publicly available data, which originates from different organisms, different developmental time points, and have been acquired using different platforms. Therefore, the underlying data may not be solid enough.

Rather than trying to find universal rules that apply to all organisms, tissues, and developmental time points, it may be more useful to stick to one organism. If the authors could prove that their hypothesis is correct even in only one specific cell types, this would be an important step. Sometimes taking a small step can be more important than making a giant leap that is not well supported by the data.

This manuscript is interesting and contains good hypotheses but for sure the authors had to use a number of simplifications. Whether this still allows to generalize the conclusions of this manuscript is up for debate.

I am not a mathematician and therefore I am not able to check the mathematical models that were used. Nevertheless, I will assess if the conclusions make sense in real biology.

My conclusion after reading this manuscript is that of interest but remains speculative. What I mean by this is that the mathematical predictions would need to be verified by experiments. Although the authors use a number of datasets, they are assembled from different organisms and different developmental timepoints. As the authors mention, the data does not contradict their hypotheses. This is ok but maybe not good enough? Should the data not univocally support the mathematical hypotheses in order that the readers will buy them?

Here are the main reasons:

1. Line 128: "in general, cells express more short genes than longer genes over multiple developmental time points." Although there may be a trend, I am not entirely convinced of this statement. There seems to be a lot of noise (variation), which may not support this conclusion.

2. Line 223: "While cell cycle duration measurements are not widely available, we instead ask if organisms with longer genes would also take longer to develop." Although this is understandable, I am not sure that this is a correct surrogate. The duration of development must not necessarily be dependent on cell cycle length. Nevertheless, I agree the cell cycle duration measurements are not widely available.

3. The authors use data from different organisms and from different developmental time points. Of course, the idea is that there are universal rules that apply across species. This would be ideal but is there any proof of that? The unwanted side effect is that it becomes really confusing and the authors may compare apples to oranges.

4. Then there is the issue of splicing and introns. It is not surprising that larger genes contain more introns. To some degree splice isoforms could also explain the differences between stem cells and differentiated cells. Nevertheless, I feel this is a distraction. Therefore, analyzing organisms that contain few introns would be more useful. Budding yeast is such an organism.

5. The pathway analysis of the short and long genes is not thorough enough. In addition, the authors should use random sets of genes (same number) from the intermediate genes, which are the majority of genes.

6. The time it takes to transcribe a gene is not only dependent on its size and a fixed speed. This is an oversimplification.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Control of tissue development and cell diversity by cell cycle dependent transcriptional filtering" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Wenying Shou as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Aleksandra Walczak as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: David M Suter (Reviewer #2).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Please revise your writing to address Reviewer 1 and 3's critiques.

Reviewer #1:

Authors have mainly addressed my comments.

Figure 9: I wonder whether you can make further statements. For example, if immune cells have short cell cycle, then its enrichment for short genes will make more sense. Also, might olfactory short genes be related to environmental sensing genes which in turn involve signal transduction pathways also used in fast-growing cells?

Figure 5 legend: 220 should be 219.

Reviewer #2:

I am happy to see that the authors successfully integrated RNA Pol II re-initiation in their model, which did not affect their conclusions. The authors have addressed my concerns adequately, and the manuscript should be ready for publication.

Reviewer #3:

The authors have invested efforts to address the issues that were raised by the reviewers. The story of this manuscript has not fundamentally changed (which probably was also not expected) and there remain shortcomings. One aspect that I wish would improve is to use more understatement rather than claiming things that the authors cannot prove.

Here are a few examples, there the manuscript could be improved:

1. Line 128/129: "found that, in general, short genes have a higher expression level than longer genes within a cell." When I was reading this, I had trouble believing it but in Figure 3B, the authors show mRNA expression. This is though not mentioned in the text and the reader can be misled that this also applies to protein expression. It would be desirable that the authors are precise without using generalizations.

2. Line 179: "second child cell" I believe these are usually referred to as "daughter cells".

3. Line 233: "We started by asking if organisms with longer genes would also take longer to develop." I apologize but this question (or hypothesis) does not make a lot of sense to me. There are a million reasons why an organism takes a certain amount of time to develop and this may be also dependent on the environment. Reducing it to the length of the genes is surely only one of many reasons. In their conclusion on line 249, the authors call it "strong relationship", which probably is an association and we all know that associations are weak (remember the one about the amount of chocolate consumption and that chance to win the Nobel prize?).

4. Line 266-278: I am not sure if I get the point here "cell cycle duration and gene expression vary spatially.". Not only spatially but also dependent on age, environment, nutrition, and many more factors.

5. In the discussion, the limitations (some of which are mentioned) should be discussed much more honestly.

6. In several figures (for example Figure 6—figure supplement 2 but there are others), the authors use a representation (word clouds) that are not very helpful. The authors should find a better way to bring across the point that they are trying to make.

https://doi.org/10.7554/eLife.64951.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1:

Abou Chakra et al. posed the hypothesis that cell cycle transcriptional filtering – the transcribing of long genes only when cell cycle slows down – might help control tissue development and the generation of diverse cell fates. This study is based on simple math modeling and comparing model predictions to single-cell RNA seq data.

From an outsider's point of view, the article is interesting, although very speculative. Thus, I suggest softening your statements throughout. For example, the current title can be changed to "Cell cycle dependent transcription filtering can potentially contribute to tissue development".

Scientifically, I would like to see an analysis of whether transcripts specific to development (e.g. neuronal development) tend to be longer.

Thank you for the comments. The major change we made to address this is to focus the paper on the mathematical model and separate the speculative statements into a new section that includes discussion of the implications of the model, providing a number of new analyses to support these implications. We have also reviewed the paper as a whole to clarify speculative statements. We hope this clarifies and strengthens the overall paper.

We have also included a pathway enrichment analysis based on gene length which clearly shows that the longest genes are enriched for genes expressed in mature cells (long cell cycles) in processes such as neural and muscle development, whereas the shortest genes are enriched for genes involved in core processes (e.g. metabolism, transcription, translation) and transcription factors.

Reviewer #2:

This study addresses the important and exciting topic of how the cell cycle duration gates gene expression diversity at the single cell level by limiting the expression of long genes. This work intersects between genetics, gene expression, developmental and evolutionary biology, and should therefore be of broad interest to the readership of eLife. The manuscript is well written and structured, and the findings are interesting. I have some technical and conceptual concerns about the causality links made by the authors.

1. In general the authors should provide more details about the implementation of their method in the manuscript. While Figure 1 is very clear and allows non-specialists to grasp the concept, a more detailed explanation of the model would be useful.

Thank you we have now included a more detailed explanation in the methods section.

In particular, how is transcription re-initiation modelled ? This is important for the following reason. If a gene takes 10 hours to be transcribed and the cell cycle is 10 hours, only 1 transcript can be made. However, if the cell cycle is longer, the maximum number of transcripts will depend on the distance between successive polymerases (thus depending on initiation rate), and could thus rapidly increase. The number of transcripts generated in 11 hours will thus scale with initiation rate, or inversely with the distance between polymerases on the gene. Also, how does transcriptional bursting affect their modelling ?

In our original manuscript, we assumed that a transcript must be fully transcribed before another copy can be made. That is, we didn’t allow a transcript to be transcribed by multiple polymerases at the same time. Including an initiation rate would act to increase the number of transcripts as the reviewer points out. To show how this affects our results, we performed a simulation with re-initiation, re-initiation distance is the kb distance an rnaPol complex can reinitiate along the gene, (Figure S2A, e.g. if it is set to 1, a new rnaPolII can initiate 1 kb apart from the first) or a random transcription rate (Figure S2B, random rate means a new rate is assigned to each gene from a Gaussian distribution around a set rate, λ). We show that changing initiation distance and transcription rate doesn’t affect the trend that short genes are expressed more than long genes (Figure S2 A-B). Furthermore, we test under which condition the transcriptional filter can be suppressed. We conducted simulations for a genome with genes of the same length (e.g. LG=9) and the cell cycle duration is short Γ=1 hour (Figure S2 C-D). In this case the long genes do not have enough time to be transcribed and changing initiation distance cannot overcome the limiting effects of the cell cycle. Only by increasing the transcription rate sufficiently high (e.g. λ=4) do we see the genes completely transcribed. On the other hand, simulation for a genome with genes of the same length (e.g. LG=9) with a long cell cycle duration (Γ=10 hours) (Figure S2 E-F ) shows that re-initiation distance and transcriptional rate help make cell cycle duration more important.

2. The authors show that short genes tend to be expressed at higher levels than long genes. They use this data to support that elongation rates * cell cycle duration limits the expression of long genes. In this scenario, one would expect that long genes are more depleted in pol II at the end of the gene, while short genes would not. The authors could look into Pol II footprinting/ChIP-seq datasets to confirm this, and also to exclude that short genes are expressed at higher levels because of their higher initiation rates.

We tried to do this by analyzing a series of Pol II footprinting data sets (we tried GSE34301 (Gaertner et al., 2012), GSE81521 (Li et al., 2016), and GSM565202 (Saha et al., 2011)), but the data did not distinguish between an elongating versus a poised RNA pol II. There were very few data sets of this type available for organism development and we couldn’t find any for early development to better match the context of our study. Also, we couldn’t be sure what phase of the cell cycle the cells were in when measured. Overall, the reviewer makes an interesting point and there do seem to be a number of technologies that may be able to support this line of inquiry (e.g. GRO-seq and related methods), so we will be watching for these data in the future.

3. The authors mention that cell cycle duration should be broadly correlated with development time. But cell sizes and organism size will also impact this correlation. Why do the authors not consider these parameters ?

We agree that cell size will impact the decision of when the cell will divide, however the data on how cell size is influenced by cell cycle duration is still unclear, and current work suggests that S-phase is important in cell size decision (Kafri et al., 2013). We had added this to the discussion.

4. The authors should provide an analysis of which classes of genes are enriched into different genes length bins. According to their results, more specialised genes, i.e. expressed in terminally differentiated, non-dividing cells should be longer on average.

We have now included a pathway enrichment analysis based on gene length which clearly shows that the longest genes are enriched for genes expressed in mature cells (that we expect to have long cell cycles) in processes such as neural and muscle development, whereas the shortest genes are enriched for genes involved in core processes (e.g. metabolism, transcription, translation) and transcription factors, see figures S7, S8 and S9.

5. Figure 6A: How can the average cell cycle duration become higher than the max cell cycle duration at the single cell level (shown as color code) ?

Sorry for the confusion. This may have been caused by our lack of a legend in Figure 6A, which is now included.

6. What is the relative contribution of introns vs exons in long vs short genes ?

To answer this question we calculated the ratio of intron length to total gene length for each human gene, and plotted the distribution for four gene length bins (all bins have roughly the same number of genes). This shows that short genes (n=1453, dashed line at 0.05 below) have relatively short or no intron regions, while the longest genes (dashed line at 0.95) have the greatest intron length-to-gene length ratio.

Author response image 1

Could the longer cell cycle of some species allow to accumulate more/longer introns to increase splice isoform diversity/regulatory potential ?

This is an interesting question. We do expect this to happen. Other questions arise from this line of inquiry. For instance, how does gene length and intron/exon structure change over evolution? How do these changes affect different gene function categories? How do these relate, considering various factors including intron length, position, number of introns? We are working on these questions for a separate paper focused on these questions, so did not include these results here.

7. Linked to 5., one challenge here is to understand causality links. The cell cycle could be used by organisms to gate cell diversity during development, but longer cell cycles could also allow to accumulate longer genes on evolutionary time scales. The authors should comment on this.

Yes, this is an interesting challenge. We have generally assumed that gene lengths and cell cycle duration dynamics coevolved across evolutionary time scales; our hypothesis is that a cell cycle dependent transcriptional filter is a fundamental regulatory mechanism operating during development (because gene length is fixed in the genome). We have now included a comment about this in the discussion.

Reviewer #3:

We've known for a very long time that cells in many animal embryos have very fast cell cycles, and that the duration of cell cycle increases as cells differentiate into progeny with specialized function. However, whether cell cycle duration directly impacts cell fate decisions via filtering transcriptional activity is not clear. Certainly it has been proposed on multiple occasions, and the ability of mitosis and DNA replication to interrupt and abort transcription has been demonstrated. This study presents an interesting attempt to address this topic via mathematically modeling the relationship of cell cycle length and the diversity of transcriptome. The authors utilized a simplified model to simulate how cell cycle length, gene length, number of genes and transcription rate affect resulting transcriptomes in cell populations. Not unexpectedly, the simulations show that increasing the length of the cell cycle can increase the proportion of mRNAs from long genes, and also the complexity of the transcriptome and (more interestingly) the diversity of transcriptomes between different cells in a population. The model is clever and the demonstration is useful, if highly over-simplified.

Importantly the authors also analyze some real transcriptome data to see if it supports their conclusions. At some levels there is support, but overall the real data from single cell sequence don't align well with their hypothesis and fall short of a compelling experimental validation.

Overall we felt the study was interesting in concept and could be a valuable addition to the literature devoted to cell cycle and development, but that it could be much better with comprehensive revisions as discussed below.

1. The simulations in Figure 2A show that increasing cell cycle duration will allow more relative transcription of longer genes. Unfortunately the real data from Xenopus and Danio (Figure 2B, S2) don't appear to show this same trend. The authors should check this more carefully by comparing proportions of transcripts of different lengths at the different developmental times. If there is not change in proportions then the real data may not validate the simulation's predictions. At issue might be the "contamination" of the real data with maternal transcripts. Perhaps the authors should try to remove maternal transcripts and analyze only zygotic transcripts. Furthermore, the timepoints shown for Xenopus in Figure 2B are probably too late and too closely spaced to see a trend. They should compare transcripts for very early and very late in development.

The reviewer makes a number of good points, but unfortunately, it seems we confused the presentation by including multiple developmental time points. We intended for this figure to just show that short genes are more expressed than long genes – irrespective of developmental time point or difference in cell cycle duration. Since we don’t know parameters such as mRNA degradation rate and how these change over time, we didn’t feel comfortable making more precise statements, such as comparing the ratio of short/long gene expression values between time points. We have simplified this figure using new data that shows one time point per species for a wider range of species. The overall message of Figure 2 is that short genes are more expressed than long genes. As the reviewer mentions, this is not unexpected, but is important to show to validate this aspect of our model.

2. I was not convinced by the simulations and arguments about the generation of cell diversity. It seems that this might only occur with very low numbers of transcripts per cell; i.e. when stochastic variation came into effect. This may not be the case for real cells.

We created additional simulations with more genes in the genome (10,100, 1000) per cell and show we get the same results. As shown in Figure S4.

3. Figure 6AB really baffled me. Transcriptome diversity seems not to be graphed at all (check the axis labels and key), even though diversity is the point of the figure. Either it is mis-labeled or it needs more explanation. Figure 6C also requires more explanation, both within the figure and in the text (lines 219-227), which is opaque.

Thank you for pointing this out and we apologize for the confusion. The point of this figure (now Figure 5) is not about transcriptome diversity. It is that when you allow the cell cycle duration to slow down over time and more naturally (randomly) vary, closer to what we understand happens in a real developing organism, the model generates a diverse set of cell proportions (represented by pie charts). So it is the diversity of the pie charts (the cell proportion patterns) generated that we want to highlight. This is interesting because it suggests cell cycle duration could help control cell type proportions during tissue development. We have completely redesigned this figure, which we hope shows this more intuitively. In particular, we show how the cell duration parameters/cell division events that vary over time generate in a diverse set of cells (cycling at different rates) shown in pie charts. We also compared these results to real mouse brain development data where we know how cell cycle changes (lengthens) over time, showing a similar trend.

The novel idea here is that just changing cell cycle dynamics can influence the proportion of fast to slow cycling cells (which we can think of as stem and differentiated cells, especially in the neural context). We have updated the text to clarify our idea.

4. Figure 5 states that slower cell cycles would increase cell type diversity at the price of fewer progeny number. However, the real data in Figure 6 don't support this idea. Slower cell cycle actually increased differentiated progenies, this suggests that the authors' simulation settings failed to capture a critical aspect of the regulation of transcription.

Sorry for the confusion between figures 5 and 6. Figure 5 (original manuscript) investigates simulations involving two cell lineages with two different, but constant cell cycle durations. This illustrates that asynchronous cell cycle duration can select a mix of cell diversity programs (not allowing cell cycle duration to increase over time).

Figure 6 (original manuscript) shows the results of a simulation where the cell cycle duration of all cells is increasing over time. This illustrates that changing cell cycle duration over time affects the proportion of cells generated with various cycling rates. We are able to compare the generation of cell proportion diversity to real mouse brain development data and show the same trend in simulated and real data, as discussed above.

We have redesigned these figures (now Figure 4 and 5) to more clearly show the different questions being asked.

We note that we did not compare the number of cell types in simulations and real data because we couldn’t guarantee a match in the range of cell durations present, and there could be artifacts with single cell RNA-seq data not accurately identifying all cell types and proportions, or sensitivity to scRNA-seq data clustering parameters. However, we do note that our simulations show that cell type diversity increases and then decreases as cell duration increases (see Figure 3B) and interestingly, we do see the same trend of gradual increase, then decrease in the real mouse brain development data (scRNA-seq) across the four time points (8 cell types at E11.5, 11 at E13.5, 16 at E15.5 and 9 at E 17.5).

5. We suggest that the authors also look into the Oikopleura dioica genome (see Danks et al. 2012 "OikoBase: a genomics and developmental transcriptomics resource for the urochordate Oikopleura dioica"). This could be very interesting because this organism develops exceptionally fast (embryogenesis last 4-5 hours before hatching a tadpole), and they have an incredibly compact genome (most introns are no longer than 100 bp).

We have added analysis of O. dioica genome to our analysis. Interestingly, its gene length distribution is shifted globally to be shorter, in agreement with a fast development process, refer to Figure 6. Thank you for this suggestion.

There is also good transcriptome data from Drosophila that could be informative, especially if maternal transcripts could somehow be subtracted out.

Thank you. We were able to find a recent multi-organism (n=10) scRNA-seq data set to add to our analysis to address this comment and the related comments from reviewer 4.

Overall, the analysis of real transcriptome data is superficial, and much more could be done here, that might support the authors' conclusions much better that what is presented.

The main challenges with published transcriptome data are the potential confounding technical factors (e.g. cell type bias from single cell dissociation workflows), low sensitivity and lack of coverage of many biological contexts. We compared to real data where we felt confident doing so and resisted in other areas. We now include discussion of the opportunity for more to be done here in the future. For now, we want to get the overall idea out there to help us connect with experimentalists who could help generate new data designed to answer specific questions raised by the model.

6. In the introduction (line 47), the authors should state how they envision the "transcriptional filter" works. By abortion of transcription at M phase? Or during S phase? There is data on both mechanisms and these should be cited and described explicitly. This also comes up in the discussion (line 269).

Thank you for raising this point. We have added these citations to address this.

– Danio rerio zygotic transcript lengths are shorter than maternally provided ones; The earliest zygotic genes are without introns. (Heyn et al., 2014; Kwasnieski et al., 2019; Shermoen and O’Farrell, 1991) https://pubmed.ncbi.nlm.nih.gov/1680567https://pubmed.ncbi.nlm.nih.gov/24440719 https://pubmed.ncbi.nlm.nih.gov/31235656

– Cell cycle duration can limit transcripts based on their size

Short cell cycles can constrain transcription in D. melanogaster. “the length of mitotic cycles provides a physiological barrier to transcript size, and is therefore a significant factor in controlling developmental gene activity during short 'phenocritical' periods.” (Rothe et al., 1992) https://pubmed.ncbi.nlm.nih.gov/1522901

The authors should be aware that the attenuation of transcription during S-phase is limited to interactions with replication forks. In fact there is a great deal of transcription in most S phases.

We have not been able to find the extent of transcription in S phase, we have only found papers discussing that transcription was found to decrease at the onset of S phase (Newport and Kirschner, 1982b; Bertoli et al., 2013).

https://pubmed.ncbi.nlm.nih.gov/7139712

https://pubmed.ncbi.nlm.nih.gov/23877564

7. We were uncomfortable with the assumption that appears to be made on lines 86-92 and Figure 2, where the number of transcripts made during a period of time equals: synthesized transcripts=time/time to make one transcript. Does this assumption fail to take into account that multiple RNA transcription bubbles may exist on the same gene once transcription starts, thus speeding up transcript production once the first transcript in finished? If so it is inaccurate.

This point (also raised by reviewer 1 and discussed above) is now addressed by adding simulations that consider multiple RNA Pol II/transcription bubbles per transcript, see Figure S2. Adding this to our simulations does not change our conclusions.

Reviewer #4:

This manuscript presents a theoretical model that explains potential contribution of cell cycle, as a transcription filter, to organism development. The hypothesis is not entirely new. The main contribution is that the authors defined a math/simulation framework and compared it with some real data (gene size distribution in various organisms and single-cell transcriptomic data obtained from several organisms/processes) for justification. Overall, I do not feel this work adds much new to current understanding. Real data did not strongly validate the proposed model, nor led to refinement in knowledge/hypotheses.

The novel aspect of our work is the proposal that a cell cycle dependent transcriptional filter can control cellular diversity within a tissue over development. However, some of the concepts that we build on are known and are recognized in the community to varying degrees. We bring these together for the first time to support the model and generate predictions. In particular, we list these concepts in the Appendix and clarify our novel contribution.

Our work is the first to tie these diverse ideas together into one unified model. Further, our model enabled us to uncover a novel emergent property, that a cell cycle controlled transcriptional filter can be used as a mechanism to control generation of diversity of cells within a developing system (one formed by successive rounds of cell division). Finally, we are the first to apply mathematical modeling to this topic, which enables us to propose testable predictions that we hope will interest experimental biologists to investigate.

Our comparison to real data cannot prove our model, and this is not used to justify, prove or validate our results. Instead, it serves to show that real data does not refute our predictions. In discussions with diverse experimental biologists, we have found comparison with real data in this way necessary to move forward with discussions about our predictions, and that is what we hope to accomplish by including it here.

We have edited the text to clarify the novelty and we have included the information in the Appendix.

1. Page 4. Algorithms and parameters used in simulation were not described in sufficient detail. For example, how were single-cell RNA-seq data simulated? Was noise considered?

Yes, noise is considered by the model in the cell division step, with random assortment of transcripts to children cells. We have clarified this and other aspects of the method in the manuscript, as also requested by reviewer 1.

2. Figure 2, the trends between A and B are similar but the actual distributions are not: either mean average transcript count per cell, or the extent of spread look quite different. One could also argue the dissimilarity between the two figures. Does it reveal that the model lacks necessary accuracy? This needs to be further discussed/explained.

The intent of figure 2 is only to show the overall pattern of higher expression in shorter genes than longer genes in both simulation and real data. We now make this clearer by showing one time point (in response to reviewer 3) and with data across species.

3. Figure 3A. The authors should be able to derive analytical solutions for the transcriptome diversity, directly from the model defined in Figure A. Not sure why they need to show these relations indirectly from simulation (Page 5). Perhaps it is the order/logic of the presentation.

The analytical solution for transcriptome diversity is described in the text. We include both simulation and calculations to show that they agree. We chose to focus on the simulation, because later in the paper, we consider number of cell clusters as an additional measure of transcriptome diversity and this is not straightforward to define an analytical solution for, so we choose to follow the simulation thread/explanation through the sections of the paper for consistency.

4. Figure 3C. Unclear how many cells are there under each condition (when does the simulation stop?) and how do #cell affect the clustering results.

There are 10,000 cells sampled from each simulation. Simulations were run over a single cell division event (stopping criteria), but were run 1,000,000 times to provide a good sampling of the space of solutions. All clustering results include exactly 10,000 cells, so they are comparable. We varied the number of cells (from 1000 up to 10,000) and obtained similar results. Including more cells reduced variance (we iterated n=20 for each the test) and as a result chose 10,000.

Author response image 2 shows how selecting (1000, 5000 or 10000) cells does not affect cluster number, shown for cell cycle durations 1-5 hours.

Author response image 2

Also, #cluster, as a metric is quite misleading (not comparable across conditions) without specifying the total #cell and the clustering algorithms/parameters used.

Sorry for this oversight. We have now specified the number of cells (10,000 for each clustering result in the paper) and the clustering parameters (Using Seurat 3.1.2, cells were clustered using a shared nearest neighbor (SNN)-based Louvain algorithm with reduction set as ”pca”. The clustering resolution was set to 1 for all data sets, and all calculated PCs were used in the downstream clustering process using the original Louvain algorithm.)

5. Page 6, the authors stated that "we expect faster developing organisms to have short cell cycles and genes whereas slower developing organisms will have longer cell cycles and genes". This is very rough. Can it be quantified and then supported by the proposed model?

Thank you for highlighting this. We have now added discussion in a new text section and improved the figure (now Figure 6) to clarify this point.

6. Figure 4. How were the 11 genomes selected? Why not select more? Are there genomes that do not follow the trend, i.e., having large genome but relatively shorter genes? Also, it is unclear in what order the organisms are listed. Particularly, Fugu and zebra-fish did not follow the monotonic trend in means. Some distributions do not look statistically significantly different.

We originally selected these species to be distant to each other to cover a broad range of species, but also are common model organisms with well annotated genomes. The organisms are listed based on their evolutionary relationship and not the size of the genome. We have now analysed over 100 genomes (n=101) and we see the same trend: an increase in longer genes correlates with an increase in development duration (see Figure 6).

7. Figure 6. E15.5 and E17.5 has a slightly reversed trend.

In general, there needs to be more real data used to back up the theoretical models.

We are only interested in interpreting the overall trend. This data is available as a time series and it is the only one where we also know the average cell cycle duration. We need both cell cycle and cell proportion information to compare data to simulation predictions. The discrepancy between E15.5 and E17.5 may be due to the difference in cell number, which is 5000 cells for E15.5 and 2000 cells for E17.5, however the prediction is that cell cycle duration can affect proportions and we see such an effect across the time series data.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Essential revisions:

The decision has taken a long time because as you will see, the three reviewers have disagreements. Upon further discussions, we have reached an agreement. We feel that although experiments are always desired, there should be a place in science for extracting information from published datasets, despite the varying quality of datasets. We will waive the requirements for experiments, but we do request you to address comments not related to experiments, and be very careful in stressing the limitations of the datasets you used and the conclusions you drew. For example, your model suggests a mechanism, but does not exclude other mechanisms.

Thank you for arranging another useful review of our work and for waiving the requirements for experiments. We have addressed all the other comments and concerns of the reviewers. A point-by-point response to the reviewers follows to support our resubmission.

Reviewer #1:

1. Figure 4: It may be more meaningful to model stem-cell like behavior where a fast cell always gives birth to a slow and a fast cell, whereas a slow cell always gives rise to two slow cells. I believe that the cell # patterns will look more realistic under this assumption.

We have now added the “stem-cell like” behaviour of a fast cell giving rise to a slow and a fast cell as scenario 3 in Figure 4-now-5. These results are interesting. They show us how the organism can increase the number of “slow” cells in the system, at a cost of the number of fast cells in the system.

2. Figure 7: I am not sold about this figure and the associated text. "Sensory" and "perception" (short genes) seem to be related to neural-development (long genes). Also, the main text said "shortest genes… enriched for genes involved in core processes (e.g…. transcription…), whereas in Figure 7, "transcription" is associated with long genes.

We apologize for the confusion, you are correct that transcription occurs for both long and short genes, so we have removed this from the visualization and focused on showing terms that have an uneven distribution of gene lengths. We have extensively revised the pathway themes and reanalysed genes based on their length. For example, we separated olfactory, eye, auditory and brain related pathways. This more careful assignment of pathways to themes helps clean up our results. For example, we find ‘olfactory’, as opposed to the other 3 neural themes, has more (~500) short genes involved vs. long genes (~17).

We also differentiated pathways involved in brain development (structure/tissue), neurons, nerves, synapse/synaptic and neurotransmitters (e.g. dopamine, norepinephrine, serotonin). All these pathways still exhibit more long genes than short genes. The neuron specific pathways increase from ~15 short genes (<1.6kb) to ~400 long genes (>244kb). Although some (~50) short genes (<1600bp) are involved in brain related pathways, there are overall more (~500) long genes (>244kbp). Spine, neurotransmitters and synapse follow the same trend with more long genes than short genes.

We have now updated the Figure 7-now-8 and Figures 8-figures supplement 2-3 and the related text to incorporate these results. We have also added supplementary files 5 and 6 to show how each of the H. sapiens pathways were grouped in themes. Furthermore, we found that when H. sapiens genes were divided into 20 groups from shortest genes to the longest genes, the top themes associated with short genes (some examples shown in blue Author response image 3, with more in Figure 8—figure supplement 3) have a decreasing moving average across all gene groups, whereas top themes associated with longest genes (some examples shown in black in author response image 3, with more in Figure 8—figure supplement 3) have an increasing moving average across all gene groups.

Author response image 3
Example gene function themes with strong trends associated with short (blue) or long (black) genes.

Reviewer #2:

I found the revised version of this manuscript improved, and the authors have adequately addressed most of the points I raised.

Notably, they now describe more in detail their methodology, and also assess how their model behaves when assuming that several RNA Pol II can be present on the same gene. I would suggest to use this as a first assumption rather than "We assume RNA polymerase II re-initiation occurs once a transcript is complete". To me the latter one is not supported by what we know about transcription that occurs in bursts that can generate large number of RNA molecules within minutes. Also see Tantale et al., Nature Communications 2016, who show evidence of RNA Pol II convoys on actively transcribed genes.

We now assume that several RNA Pol II can be present on the same gene by default. We repeated all simulations and generated new figures (3,4,5,6 3—figure supplement 1 and 6—figure supplement 1) with the assumption that several RNA Pol II can be present on the gene. This did not change our conclusions, but we’re happy that this is a more realistic model.

Reviewer #3:

In this manuscript the authors use mathematical modeling to address whether cell cycle length determines cell fate using a correlation of gene transcript length. Since a longer cell cycle time, allows transcription of longer genes, it could affect the cell fate of the progeny. If longer transcripts are needed for highly differentiated cells, there would be a need for longer cell cycle times. Since it has been shown in stem cells that lengthening of the G1 phase is correlated with increased differentiation of cells, this hypothesis could make a lot of sense.

Using mathematical modeling is a great approach to answer this question and is definitely one of the strengths of this manuscript. This manuscript is trying to address an important and fundamental question that has been on the minds of scientists for a long time.

The drawback of the manuscript is that validation of the hypothesis is only partially or poorly confirmed by the experimental data. Essentially, the data does not contradict the hypothesis of the authors. This is great but is it good enough? Should the data not univocally prove that the hypothesis is correct? One of the major issues is that the authors use publicly available data, which originates from different organisms, different developmental time points, and have been acquired using different platforms. Therefore, the underlying data may not be solid enough.

Rather than trying to find universal rules that apply to all organisms, tissues, and developmental time points, it may be more useful to stick to one organism. If the authors could prove that their hypothesis is correct even in only one specific cell types, this would be an important step. Sometimes taking a small step can be more important than making a giant leap that is not well supported by the data.

This manuscript is interesting and contains good hypotheses but for sure the authors had to use a number of simplifications. Whether this still allows to generalize the conclusions of this manuscript is up for debate.

I am not a mathematician and therefore I am not able to check the mathematical models that were used. Nevertheless, I will assess if the conclusions make sense in real biology.

My conclusion after reading this manuscript is that of interest but remains speculative. What I mean by this is that the mathematical predictions would need to be verified by experiments. Although the authors use a number of datasets, they are assembled from different organisms and different developmental timepoints. As the authors mention, the data does not contradict their hypotheses. This is ok but maybe not good enough? Should the data not univocally support the mathematical hypotheses in order that the readers will buy them?

We have now added more explicit statements that address the limitation of the data used. We truly hope that publishing this work can help move us to be able to perform the experiments that are needed to test the model, by raising awareness of this idea.

Here are the main reasons:

1. Line 128: "in general, cells express more short genes than longer genes over multiple developmental time points." Although there may be a trend, I am not entirely convinced of this statement. There seems to be a lot of noise (variation), which may not support this conclusion.

We have applied a statistical test and found that all of the comparisons of short gene expression to longer gene expression are highly significant (p-values <10-16, Kolmogorov-Smirnov test). This result has been added to the figure 3 caption.

2. Line 223: "While cell cycle duration measurements are not widely available, we instead ask if organisms with longer genes would also take longer to develop." Although this is understandable, I am not sure that this is a correct surrogate. The duration of development must not necessarily be dependent on cell cycle length. Nevertheless, I agree the cell cycle duration measurements are not widely available.

We are sorry for the confusion, we agree that duration of development does not necessarily imply cell cycle length. We changed the sentence accordingly.

3. The authors use data from different organisms and from different developmental time points. Of course, the idea is that there are universal rules that apply across species. This would be ideal but is there any proof of that? The unwanted side effect is that it becomes really confusing and the authors may compare apples to oranges.

It is correct that comparing patterns across species may be like comparing apples to oranges, though it does support the generality of the conclusion. On the other hand, comparing results within species is a more appropriate comparison, but may lack generality. We have thus included results for both complementary situations where possible. For instance, in Figure 3, we show results across species and 2—figure supplement 4 shows results within species for three different species. However, in general, we cannot say that all patterns are universal, so we have added phrasing in the discussion/future direction section about this.

4. Then there is the issue of splicing and introns. It is not surprising that larger genes contain more introns. To some degree splice isoforms could also explain the differences between stem cells and differentiated cells. Nevertheless, I feel this is a distraction. Therefore, analyzing organisms that contain few introns would be more useful. Budding yeast is such an organism.

We agree that splice isoforms can add to the differences seen among the cells. The model proposed is directed at total gene length, irrespective of whether introns are present or absent. However we reviewed over 380 genomes in the Ensembl database and identified 3 additional species with genomes that contain few introns (>85% of intronless genes in their genome) – these happen to all be fungi. We now include these (Ashbya gossypii, Komagataella pastoris, and Yarrowia lipolytica) in our analysis (figure 9). These genomes do show gene length dependent effects, similar to budding yeast.

5. The pathway analysis of the short and long genes is not thorough enough. In addition, the authors should use random sets of genes (same number) from the intermediate genes, which are the majority of genes.

We now analyse all the genes in the system, including the intermediate genes. We have divided the human genome into 20 gene sets based on gene length as shown in new Figure 8—figure supplement 2. Our conclusions remain the same, that there are strong patterns of change in proportion of pathways annotated to genes depending on the gene length. We have also included additional plots that show how gene length is distributed among the top most gene-length-dependent pathways, shown in author response image 3, and new Figure 8—figure supplement 3.

6. The time it takes to transcribe a gene is not only dependent on its size and a fixed speed. This is an oversimplification.

We agree it is a simplified assumption, however there are examples where length and the speed of the transcript define the time it takes. For instance, the human Dystrophin gene takes 16hrs to transcribe and this does not appear to vary between measurements (https://pubmed.ncbi.nlm.nih.gov/7719347/). We now include this reference in the paper. We also include a relaxation of this assumption in Figure 3—figure supplement 1, where the transcription rate is allowed to vary (following a normal distribution) and this does not affect our conclusions.

[Editors’ note: what follows is the authors’ response to the third round of review.]

Essential revisions:

Please revise your writing to address Reviewer 1 and 3's critiques.

Reviewer #1:

Authors have mainly addressed my comments.

Figure 9: I wonder whether you can make further statements. For example, if immune cells have short cell cycle, then its enrichment for short genes will make more sense.

We tried this, but found it is difficult to make further statements about immune cells for three main reasons. First, they seem to have diverse cell cycle dynamics. For example, CD8+ T cells can divide as fast as 2 hours, with most averaging 6 hours with very little G1, though this changes after 5 divisions where the cell cycle slows down (PMID: 21079741). On the other hand, macrophages can have a ~19 h cell cycle (PMID: 3622537). Second, it is challenging to collect comprehensive immune system information to support general statements because most papers with cell cycle duration information don’t usually measure or report it, but instead report related values (such as mitotic rate or doubling time of cell culture) that may or may not be possible to infer cell cycle duration from (e.g. the reported values may be an average over a large population of different cell types). Third, most cell cycle studies examine the regenerating hematopoietic system in adults, rather than immune system development, and these likely have different cell cycle dynamics. In the regenerative case, hematopoietic stem cells are slow dividers and speed up as you go down the lineage (PMID:30084312). For instance in (PMID:30084312), LSK (Lin-Sca-1+c-Kit+) cells, which contain adult bone marrow hematopoietic stem cells (HSCs), have a cell cycle of 47.0 +/-4.9 hrs and give rise to LSK- (Lin-Sca-1+c-Kit-), which have a cell cycle of 23.4+/-2.3 hrs. This also happens for other hematopoietic cell types (PMID:30084312). We are currently working on modeling hematopoiesis to help study the differences between development and regenerative contexts as part of a future project.

Also, might olfactory short genes be related to environmental sensing genes which in turn involve signal transduction pathways also used in fast-growing cells?

The genes contributing to this pattern are almost all olfactory receptors, which are generally very short (e.g. in human, the family of over 500 genes has an average length of around 7000bp, with many in the 1000bp range). These genes are also environmental sensing and signaling genes, though these latter terms are more general and contain many other genes and have a wider gene length distribution, as can be seen in Author response image 4. We clarified this in the manuscript.

Author response image 4

Figure 5 legend: 220 should be 219.

Thank you. We fixed it in the legend.

Reviewer #3:

The authors have invested efforts to address the issues that were raised by the reviewers. The story of this manuscript has not fundamentally changed (which probably was also not expected) and there remain shortcomings. One aspect that I wish would improve is to use more understatement rather than claiming things that the authors cannot prove.

Here are a few examples, there the manuscript could be improved:

1. Line 128/129: "found that, in general, short genes have a higher expression level than longer genes within a cell." When I was reading this, I had trouble believing it but in Figure 3B, the authors show mRNA expression. This is though not mentioned in the text and the reader can be mislead that this also applies to protein expression. It would be desirable that the authors are precise without using generalizations.

We have now specified that we mean transcript (or sometimes more precisely mRNA) expression and transcript counts throughout the manuscript.

2. Line 179: "second child cell" I believe these are usually referred to as "daughter cells".

It is correct that “daughter cells” is the typical term, however, we chose to use gender neutral terminology in our manuscript.

3. Line 233: "We started by asking if organisms with longer genes would also take longer to develop." I apologize but this question (or hypothesis) does not make a lot of sense to me. There are a million reasons why an organism takes a certain amount of time to develop and this may be also dependent on the environment. Reducing it to the length of the genes is surely only one of many reasons. In their conclusion on line 249, the authors call it "strong relationship", which probably is an association and we all know that associations are weak (remember the one about the amount of chocolate consumption and that chance to win the Nobel prize?).

We agree, we have now updated the text to clarify it is an association.

4. Line 266-278: I am not sure if I get the point here "cell cycle duration and gene expression vary spatially.". Not only spatially but also dependent on age, environment, nutrition, and many more factors.

We have clarified that we do not exclude other factors, just that gene length can be one of the mechanisms which helps set up the spatial boundaries within the organism. We analyzed this factor because we had data for it. We have clarified this point in the revision.

5. In the discussion, the limitations (some of which are mentioned) should be discussed much more honestly.

We have tried to reiterate limitations throughout the manuscript and in the discussion.

Here are specific limitations we have now clarified in the discussion:

1) We did not separate the phases of the cell cycle; we only considered the interphase and M phase.

2) We only consider transcription and the theoretical time taken for it and not other aspects of gene/protein expression.

3) We do not consider the effects of gene regulatory networks, cell-cell interactions or other effects that are known to play a role in development.

4) We limited the model to two modes of division: symmetric (where the cell gives rise to identical cells, e.g. Figure 5A) and asymmetric (where the cell gives rise to a fast and slow cell, e.g. Figure 5C).

6. In several figures (for example Figure 6—figure supplement 2 but there are others), the authors use a representation (word clouds) that are not very helpful. The authors should find a better way to bring across the point that they are trying to make.

We tried many ways to visualize the large amount of data we have considered, but each had different advantages and disadvantages and we couldn’t find a single best representation, while maintaining a consistent analysis across all species. For example, representing the data using the typical bar and pie charts has the advantage of being familiar, but these plots were unreadable when all the information was displayed. There is a trade off between the amount of detail that can be shown and the ability to summarize many details in a large data set. We thus decided to include multiple representations in the revision. We kept word clouds to summarize information about gene annotation for thousands of genes per species, and included a pathway enrichment analysis visualization made using the Cytoscape analysis software (Figure 8 supplementary 1), a linear plot with a moving average across bins (Figure 8 supplementary 2) and a matrix plot displaying frequencies of gene annotation themes (Figure 8 supplementary 4) as alternative perspectives and approaches to the same data. Since we wanted to incorporate all Gene Ontology biological processes (pathways) with more than two genes for all species we needed to digest the large data set (13 species, range of 4,700 to 26,000 genes per species, biological pathway annotation range of 2,000 to 12,000 pathways per species, which is millions of data points) and have our message accessible to the readers.

https://doi.org/10.7554/eLife.64951.sa2

Article and author information

Author details

  1. Maria Abou Chakra

    The Donnelly Centre, University of Toronto, Toronto, Canada
    Contribution
    Conceptualization, Data curation, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing, Development of the cell model
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4895-954X
  2. Ruth Isserlin

    The Donnelly Centre, University of Toronto, Toronto, Canada
    Contribution
    Data curation, Formal analysis, Pathway analysis using GSEA, cytoscape and enrichment map
    Competing interests
    No competing interests declared
  3. Thinh N Tran

    The Donnelly Centre, University of Toronto, Toronto, Canada
    Contribution
    Formal analysis, Single cell analysis of mouse, xenopus and zebrafish data
    Competing interests
    No competing interests declared
  4. Gary D Bader

    The Donnelly Centre, University of Toronto, Toronto, Canada
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    gary.bader@utoronto.ca
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0185-8861

Funding

Canada First Research Excellence Fund

  • Gary D Bader

University of Toronto

  • Gary D Bader

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank our reviewers for insightful comments. We thank Zain Patel, Brendan Innes, Derek van der Kooy, Peter Zandstra, Nika Shakiba, Janet Rossant, Eszter Posfai, Maria Shutova, Andras Nagy, and Rudy Winklbauer for thoughtful discussions about this work. This work was funded by the University of Toronto Medicine by Design initiative, by the Canada First Research Excellence Fund.

Senior Editor

  1. Aleksandra M Walczak, École Normale Supérieure, France

Reviewing Editor

  1. Wenying Shou, University College London, United Kingdom

Reviewers

  1. Wenying Shou, University College London, United Kingdom
  2. David M Suter, Ecole Polytechnique Fédérale de Lausanne, Switzerland

Publication history

  1. Received: November 17, 2020
  2. Accepted: July 1, 2021
  3. Accepted Manuscript published: July 2, 2021 (version 1)
  4. Version of Record published: July 14, 2021 (version 2)

Copyright

© 2021, Abou Chakra et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,005
    Page views
  • 119
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    2. Medicine
    Homa MohammadiPeyhani et al.
    Tools and Resources

    The discovery of a drug requires over a decade of intensive research and financial investments – and still has a high risk of failure. To reduce this burden, we developed the NICEdrug.ch resource, which incorporates 250,000 bioactive molecules, and studied their enzymatic metabolic targets, fate, and toxicity. NICEdrug.ch includes a unique fingerprint that identifies reactive similarities between drug–drug and drug–metabolite pairs. We validated the application, scope, and performance of NICEdrug.ch over similar methods in the field on golden standard datasets describing drugs and metabolites sharing reactivity, drug toxicities, and drug targets. We use NICEdrug.ch to evaluate inhibition and toxicity by the anticancer drug 5-fluorouracil, and suggest avenues to alleviate its side effects. We propose shikimate 3-phosphate for targeting liver-stage malaria with minimal impact on the human host cell. Finally, NICEdrug.ch suggests over 1300 candidate drugs and food molecules to target COVID-19 and explains their inhibitory mechanism for further experimental screening. The NICEdrug.ch database is accessible online to systematically identify the reactivity of small molecules and druggable enzymes with practical applications in lead discovery and drug repurposing.

    1. Computational and Systems Biology
    2. Microbiology and Infectious Disease
    Wellington Miranda S et al.
    Research Article Updated

    Many bacteria communicate with kin and coordinate group behaviors through a form of cell-cell signaling called acyl-homoserine lactone (AHL) quorum sensing (QS). In these systems, a signal synthase produces an AHL to which its paired receptor selectively responds. Selectivity is fundamental to cell signaling. Despite its importance, it has been challenging to determine how this selectivity is achieved and how AHL QS systems evolve and diversify. We hypothesized that we could use covariation within the protein sequences of AHL synthases and receptors to identify selectivity residues. We began by identifying about 6000 unique synthase-receptor pairs. We then used the protein sequences of these pairs to identify covariation patterns and mapped the patterns onto the LasI/R system from Pseudomonas aeruginosa PAO1. The covarying residues in both proteins cluster around the ligand-binding sites. We demonstrate that these residues are involved in system selectivity toward the cognate signal and go on to engineer the Las system to both produce and respond to an alternate AHL signal. We have thus demonstrated that covariation methods provide a powerful approach for investigating selectivity in protein-small molecule interactions and have deepened our understanding of how communication systems evolve and diversify.