Topological signatures in regulatory network enable phenotypic heterogeneity in small cell lung cancer
Abstract
Phenotypic (nongenetic) heterogeneity has significant implications for the development and evolution of organs, organisms, and populations. Recent observations in multiple cancers have unraveled the role of phenotypic heterogeneity in driving metastasis and therapy recalcitrance. However, the origins of such phenotypic heterogeneity are poorly understood in most cancers. Here, we investigate a regulatory network underlying phenotypic heterogeneity in small cell lung cancer, a devastating disease with no molecular targeted therapy. Discrete and continuous dynamical simulations of this network reveal its multistable behavior that can explain coexistence of four experimentally observed phenotypes. Analysis of the network topology uncovers that multistability emerges from two teams of players that mutually inhibit each other, but members of a team activate one another, forming a ‘toggle switch’ between the two teams. Deciphering these topological signatures in cancerrelated regulatory networks can unravel their ‘latent’ design principles and offer a rational approach to characterize phenotypic heterogeneity in a tumor.
Introduction
‘Genotype controls phenotype’ has been a prevalent paradigm across multiple biological contexts (Orgogozo et al., 2015). However, past few decades have revealed in many biological organisms that a fraction of cells in a genetically identical population can behave differently from others, even under nearly identical environmental conditions. This ‘phenotypic heterogeneity’ usually refers to ‘nongenetic’ variations among individual cells in a genetically homogeneous scenario (Grote et al., 2015). In microbial populations, this heterogeneity can manifest as variation in morphologies, growth dynamics, metabolic signatures, and response to antibiotics. It can enable ‘bethedging’, thereby providing cell populations or organisms with higher fitness especially in fluctuating environments (Ackermann, 2015). Another advantage of phenotypic heterogeneity is division of labor and the possibility of cooperation among phenotypically distinct subpopulations (Armbruster et al., 2019). Phenotypic heterogeneity is an evolvable trait and can, in turn, shape evolutionary trajectories at genomic levels too, as seen in bacteria, yeast (Bódi et al., 2017; van Boxtel et al., 2017), and more recently in cancer cell populations (Salgia and Kulkarni, 2018). Therefore, decoding mechanistic underpinnings of emergence of phenotypic heterogeneity remains crucial.
Two generic mechanisms proposed for phenotypic heterogeneity are stochastic gene expression and multistability. Stochastic fluctuations in levels of various biomolecules can trigger the cells to reversibly switch their phenotypes, with important functional implications in tolerance to antibiotics (Balaban et al., 2004), in enabling the latency of HIV (Weinberger et al., 2005), and in longterm resistance to anticancer drugs (Inde and Dixon, 2018; Ramirez et al., 2016; Sharma et al., 2010), highlighting common principles of population behavior across living systems (BenJacob et al., 2012; Jolly et al., 2018). Cancer has been largely thought of as a genetic disease driven by accumulated mutations in key oncogenes and tumor suppressor genes involved in hallmarks of cancer such as increased proliferation and decreased cell death (Hanahan and Weinberg, 2011). Thus, evolution in cancer has been mostly postulated as a Darwinian process of clonal selection (Wooten and Quaranta, 2017), therefore promoting gathering of resourceheavy data through international consortiums such as The Cancer Genome Atlas (TCGA) or International Genome Cancer Consortium (ICGC) primarily at a genomic level. However, increasing evidence has shown that a cancer cell’s phenotype is not solely driven by its genotype (Brock et al., 2009; Hu et al., 2020; Sharma et al., 2019) and that nongenetic mechanisms such as stochastic gene expression and/or multistability can propel metastasis (Jolly and CeliàTerrassa, 2019; Lee et al., 2014) and resistance to various therapies (Meyer et al., 2020; Miura et al., 2018; Mohanty et al., 2020; Pisco et al., 2013; Shaffer et al., 2017; Spencer et al., 2009; Su et al., 2019) – the two major unsolved clinical challenges in cancer. Consistently, no unique mutational signature has been yet identified for metastasis; instead, phenotypic adaptability is considered to be a hallmark of metastasis (CeliàTerrassa and Kang, 2016; Welch and Hurst, 2019). Therefore, it is essential to investigate the emergent dynamics of regulatory networks that can give rise to multistability and consequently accelerate tumor aggressiveness.
Here, we elucidate the dynamics of a regulatory network underlying phenotypic heterogeneity in small cell lung cancer (SCLC) – a highly metastatic cancer that accounts for 15% of all lung cancer cases, has a 7% five year survival rate, has no molecular targeted therapy, and shows rapid relapse to a treatmentrefractory phase after initial response to chemotherapy and radiotherapy that remains to be the standard of care for SCLC for over half a century. Therefore, SCLC has been placed in the category of recalcitrant cancers (Gazdar et al., 2017; Udyavar et al., 2017). Nearly all SCLC cases show genomic inactivation of TP53 and RB1 (George et al., 2015); thus, phenotypic heterogeneity observed in SCLC cannot be explained based on the mutational status (Rudin et al., 2019). Here, we demonstrate that phenotypic heterogeneity in SCLC can be explained by simulating the dynamics of a regulatory network underlying SCLC using both continuous and discrete modeling approaches. The four predominant phenotypes that this network enables can be mapped on to recently identified molecular subtypes of SCLC – ASCL1^{high}/NEUROD1^{low}, ASCL1^{low}/NEUROD1^{high}, ASCL1^{high}/NEUROD1^{high}, and ASCL1^{low}/NEUROD1^{low}. Further analysis of SCLC network topology reveals that multistability in this complex network of 33 nodes and 357 edges emerges due to two teams of players in the network that are operating against one another – players of both teams tend to activate and be activated by members of the same team, while they inhibit and are inhibited by members of the other team, thus forming an effective ‘toggle switch’ between the two teams. Intriguingly, this topological signature trait is specific to this SCLC regulatory network; perturbing the network disrupts this signature and leads to disappearance of these four phenotypes. Furthermore, data analysis shows that ASCL1^{low}/NEUROD1^{low} subtype can be subclassified into ASCL1^{low}/NEUROD1^{low}/YAP1^{low}/POU2F3^{high} and ASCL1^{low}/NEUROD1^{low}/YAP1^{high}/POU2F3^{low} classes. Together, our results unravel the latent design principles of the complex SCLC regulatory network and offer a rational framework to decode the phenotypic heterogeneity in SCLC through the lens of network topology.
Results
Discrete and parameteragnostic continuous simulations of the SCLC network result in similar phenotypic distributions
We simulated the emergent dynamics of a SCLC master regulatory network constructed from gene expression signatures, which contained 33 nodes and 357 edges (Udyavar et al., 2017; Figure 1A). To obtain the steadystate (i.e. phenotypic) distributions corresponding to this complex network, we implemented two complementary approaches – one of them is a discrete parameterindependent Boolean modeling approach using Ising model formalism and an asynchronous update mode (FontClos et al., 2018), and the other is RACIPE (Random Circuit Perturbation) (Huang et al., 2017), a parameteragnostic approach that uses a set of coupled ordinary differential equations (ODEs) with parameters sampled over a wide biologically relevant range.
In the Ising model formalism, each state is represented as a Boolean vector of N elements, where N (=33 here) is the number of nodes in the networks. Based on randomly chosen set of initial conditions (n = 2^{20} here), multiple simulations capture the ensemble of steady states obtained. The discretetime asynchronous network dynamics considered here is simulated via
where s_{i} (t) denotes the expression levels of node i at time t. s_{i} = +1 means that the node is highly expressed (i.e. ‘ON’ state), otherwise s_{i} = −1. M depicts the interaction matrix of the network. M_{ij} = 1 indicates that node i promotes the expression of node j, M_{ij }= −1 implies that node i inhibits the expression of node j, M_{ij} = 0 implies no regulatory interaction from node i to node j.
For this network, we obtained 10 unique steady states, among which four of them (X1–X4) had a frequency of 24.3–25.5% each, while the remaining six states (X5–X10) had frequency less than 0.1% (Figure 1B). To characterize these steady states further, we calculate the frustration of each state, which, using the analogy of asymmetric spin glass on a graph, is defined as the fraction of network edges that are frustrated in that state. An edge from node i to node j is said to be frustrated in the state {s_{i}} if M_{ij}s_{i}s_{j} < 0, i.e. if the values of node i and node j do not follow the set of regulatory interactions between them. As expected, states with high frequency (~24.3–25.5% each) had low frustration (0.14) and those with lower frequency (<0.1% each) have higher frustration (0.37) (Supplementary file 1a, Tripathi et al., 2020b).
To understand the role of network topology of the SCLC network in enabling these steady states, we perturbed the network topology by picking up random pairs of edges and swapping them (Figure 1C,i,ii). This process is repeated for many iterations to ensure that the resultant network (referred to as the random network(s) henceforth), is very different from the original SCLC network (referred to as the ‘wildtype’ [WT] network henceforth). We created 1000 such random networks and simulated them using the abovementioned Ising model formalism. Importantly, a majority of these networks gave rise to a much larger number (i.e. 10^{4}–10^{6}) of steady states as compared to the WT network (Figure 1C,iii). Furthermore, none of these steadystate distributions obtained from random networks had any overlap with the steadystate distribution of WT network, suggesting that the steadystate distribution of WT network is unique to its topology.
To gain further confidence in the role of network topology in enabling the steadystate distribution of WT network, we used RACIPE that generates an ensemble of kinetic models (set of coupled ODEs) corresponding to a network topology, each with a different set of parameters sampled uniformly from a wide but biologically relevant parameter space. Each kinetic model is simulated using an ODE solver to obtain different steady states obtainable from different initial conditions. We generated 10^{6} models and simulated each of them with 1000 initial conditions, from which we obtained the ensemble of steady states and discretized them (see Materials and methods) to obtain a distribution that can be compared with the output obtained from a Boolean modeling framework. RACIPE enables a much larger number of states as compared to Boolean, which is not surprising given its continuous nature. The net frequency of the X1–X4 states obtained via RACIPE is 22%. However, all of the top 20 states obtained by RACIPE were close enough to one or more of the top four Boolean states (X1–X4); there were a maximum of 2 of 33 nodes whose values were different (0 in Boolean state and 1 in RACIPE state or vice versa). These top 20 states contribute to a total of 54% frequency of all steadystates identified by RACIPE (Supplementary file 1b). This minor difference observed in state compositions prompted us to a more quantitative comparison. We calculated the net frequency of states that have only one or two node values different than that seen in corresponding Boolean state (X1–X4). The total frequency of RACIPE state, together with its ‘closeenough’ states, was similar to that of corresponding Boolean states (Figure 1D,i), thus endorsing a consistency in gene expression programs identified by Boolean and RACIPE models. This similarity between RACIPE and Boolean outputs reinforces that network topology has a crucial role in enabling the resultant steadystate distributions of the SCLC network.
Next, we investigated the resilience of the WT SCLC network by generating two types of ‘mutant’ networks. Each ‘mutant’ network has an edge either deleted or its sign reversed (from activation to inhibition and vice versa). Thus, we had 2*357 = 714 such ‘mutant’ or ‘perturbed’ networks for which we calculated the steadystate distributions using the Ising model (2^{20} initial conditions). To quantify the degree of similarity between the steadystate distributions of WT SCLC network and that of a ‘perturbed’ network, we used an informationtheory metric known as Jensen–Shannon divergence (JSD), which ranges from 0 to 1 (Lin, 1991). JSD = 0 indicates identical distributions, and JSD = 1 indicates completely nonoverlapping ones. As a control, JSD = 1 was obtained for all 1000 random networks with respect to the WT network, given that there is no overlap of states. Intriguingly, we observed that most singleedge mutations (deletions or sign reversal) had a negligible effect on dynamics of the WT SCLC network (JSD < 0.01 for 678 of 714 perturbed networks). Also, the JSD values of mutated networks with either change correlated well (Figure 1D,ii), suggesting a consistent contribution of the edge mutated in enabling steadystate distributions of the SCLC network. Importantly, among the 36 edge perturbations that led to a higher JSD, i.e. disturbed the frequency distribution, 24 of them were incoming edges for NEUROD1 (Supplementary file 1c). This observation suggests that NEUROD1 can play a key role in maintaining the robustness of SCLC network dynamics and subsequently in enabling phenotypic heterogeneity. Put together, these results revealed that the SCLC network is quite resilient to singleedge perturbations and suggests the possibility of multiple layers of redundancy or reinforcement in the SCLC network to enable this specific steadystate distribution.
Similarity between simulation and SCLC experimental data based on node correlations
To investigate the degree of resilience or reinforcement within the SCLC network, we performed pairwise correlation between all 33 nodes in the network. Strikingly, based on the top four states obtained from Boolean simulation of the WT network, we observed that 32 of 33 nodes formed two groups of nodes, such that within a group, the nodes were positively correlated, but across the groups, each pairwise correlation was negative (all nodes except NEUROD1; Figure 2A,i). Therefore, the network seemed to form two competing ‘teams’ or groups of players. Interestingly, one of the groups (top triangle in Figure 2A,i; called as group A henceforth) contained ASCL1, INSM1, FOXA1, and FOXA2, all of which are implicated in neuroendocrine phenotype(s) in SCLC. On the other hand, the other group (lower triangle in Figure 2A,i; called as group B henceforth) contained REST, SMAD3, and ZEB1 are suggested markers/regulators of nonneuroendocrine (NE) or mesenchymal subtype(s) of SCLC (Borromeo et al., 2016; Tlemsani et al., 2020).
We defined a metric J to quantify the cumulative strength of relationship among different nodes in the network, which quantifies the degree of similarity in the two ‘teams’ and mutual competition among them.
Here, P denotes the correlation matrix, and P_{ij} denotes Pearson’s correlation coefficient between node i and node j; the key to the indices is given in table shown in Figure 2. For correlation matrix obtained for Boolean simulations for the WT network, the coefficient values are either 1 or −1, and the value of J metric is calculated to be 496. However, for the 1000 random networks, the values of J are quite small; the average value of J across 1000 networks is 11 (Figure 2A,ii), thus suggesting that this metric can be a quantitative method to distinguish between WT and random networks and that this observed feature of two ‘teams’ of players is unique to the WT topology.
Similar trends of two ‘teams’ were seen in RACIPE simulation data (Figure 2B,i), albeit the value of J was lower (=373.05), largely because the values of Pearson’s correlation coefficients for all (nondiscretized) RACIPE pairwise comparisons will lie between −1 and +1 on a spectrum, instead of discretized values of −1 and +1 seen for Boolean simulations. For a fairer comparison, we recalculated the value of J metric for 1000 random networks now using continuous values of P_{ij} (see Materials and methods). The mean value of J obtained from these 1000 networks was 14.16, a value much smaller than those corresponding to simulations of the WT network (p<0.01 for twotailed ztest) (Figure 2B,ii). Furthermore, correlation matrices obtained from two experimental datasets – CCLE (n = 52), GSE 73160 (n = 63) – also implied the existence of these two ‘teams’, an observation that was strengthened by their value of J metric being significantly larger (p<0.01 for twotailed ztest) than that of 1000 random networks considered here (Figure 2B,ii).
Put together, this excellent agreement among Boolean and RACIPE simulations and experimental data through correlation matrix underscores that the steady states obtained via simulations can be mapped on specific biological phenotypes seen in experimental data and endorses the idea about two ‘teams’ of molecular players that may inhibit each other to enable these steady states.
SCLC network topological signatures underlie the emergence of biological phenotypes
The remarkable agreement among Boolean and RACIPE simulations, and the endorsement of the correlation patterns seen in simulations with the experimental data, leads us to the hypothesis that the underlying network topology is fundamental to the existence of these distinct phenotypes. To test this hypothesis, we formulated a metric to quantify the topological influence of one node on another, named as influence matrix (Inf). A path between two nodes (A and B) in a network of length l is defined as a series of connected edges starting at node A and ending at node B. While the interaction matrix (M) only quantifies the effect of nodes on each other for a path length of one, the influence matrix considers path lengths of up to ${l}_{max}$ to calculate this effect. Each element of the influence matrix, Inf_{ij}, is a number between −1 and 1, calculated as a weighted sum of net effect of all paths of length less than ${l}_{max}$ from ith node to jth node (see Materials and methods). Inf_{ij} > 0 indicates net activation, Inf_{ij} < 0 suggests net inhibition, and Inf_{ij} = 0 implies no net effect.
The influence matrix for path length ${l}_{max}$ = 10 obtained for the WT network was very similar to the corresponding correlation matrix for RACIPE (Figure 3A,i, Figure 3—figure supplement 1) This similarity elucidates that the players in the two ‘teams’ are highly likely to effectively activate one another, but players belonging to different teams are likely to inhibit each other. To quantify this similarity, we calculated two metrics – R1 and R2 – bearing in mind that while the correlation matrix is symmetric, the influence matrix need not be. R1 is the correlation coefficient obtained from a scatter plot of regressing correlation coefficients between node i and j (P_{ij}) and corresponding influence matrix element (Inf_{ij}). R2 is defined the same as R1 but for Inf_{ji} instead of Inf_{ij}. Both R1 and R2 values are high (>0.85) and statistically significant (Figure 3A,ii) and show a saturating trend for increasing path length (Figure 3—figure supplement 2A), thus we chose ${l}_{max}$ = 10 for further analysis from influence matrix. The high values of R1 and R2, together with high coefficient of determination (Figure 3A,ii), reinforce that the influence matrix of WT SCLC network is quite similar to corresponding correlation matrix. Similar trends seen for correlation matrix from CCLE endorsed that the influence matrix can not only capture the patterns for steady states, but also can reflect biological phenotypes seen in SCLC (Figure 3—figure supplement 2B).
Next, we checked the specificity of the influence matrix to the WT network. We next regressed the elements of RACIPE correlation matrix of WT SCLC network against corresponding elements for influence matrix of one of the 1000 random networks we had simulated using asynchronous Boolean model formalism. Thus, we calculated 1000 values of R1 and R2 for all the random networks; none of them were anywhere close to R1 and R2 of the WT SCLC network; many of these correlations were not even statistically significant (Figure 3B,i). Similar results were seen in the influence matrix from CCLE (Figure 3B,ii). Thus, we concluded that this influence matrix is unique to the WT SCLC network.
Put together, the influence matrix can be a more meaningful readout of interaction matrix in terms of effect of node i on node j. The influence matrix provides further credence to the idea that the SCLC network contains two ‘teams’ of players that are mutually inhibitory and selfactivatory. To strengthen the concept of these two ‘teams’ of genes working together in a more qualitative sense, we constructed two reduced models from ‘effective’ edges calculated both from the interaction matrix and from the influence matrix (for path length = 10) (see Materials and methods) (Figure 3C,ii; Figure 3—figure supplement 2C,i). We performed asynchronous Boolean simulations for these reduced models using the Ising formalism. To compare the output of these reduced models with that of the WT SCLC network, we transformed the most frequent states of the WT SCLC network (shown in Figure 1B) into a representation using the same nodes as that included in the reduced networks. We observed that both reduced models – one generated from interaction matrix and another from influence matrix – resulted in the same four states as that of the WT (Figure 3C,ii, Figure 3—figure supplement 2C,ii). However, the network obtained using the influence matrix was closer to WT in terms of steadystate frequency distribution as compared to the one obtained using the interaction matrix. These results suggest that influence matrix is a better representation of network topology as compared to the interaction matrix.
Put together, we defined an influence matrix to decode the relative strengths of bidirectional regulatory interactions between every possible pair of nodes in the network. Analysis based on the influence matrix established the concept of two ‘teams’ of players that are effectively inhibiting each other and activating themselves, thereby forming an ‘effective’ selfactivating toggle switch (SATS). Such SATS have been shown to be multistable and tend to underlie phenotypic plasticity and heterogeneity in multiple cellfate decisions (Guantes and Poyatos, 2008; Lu et al., 2013; Sahoo et al., 2020; Zhou and Huang, 2011). This ‘teaming up’ can potentially explain (1) why singleedge perturbations in the WT SCLC network are rarely disruptive in terms of steadystate distributions (because as long as an ‘effective’ mutual inhibition between the two teams and ‘effective’ selfactivation in the teams is maintained, the phenotypes are likely to be robust attractors) and (2) why despite such dense and complicated (33 nodes, 357 edges) network, we obtain only four steady states (because the ‘latent’ network topology is fundamentally simplistic).
Classifying experimental SCLC data based on ASCL1 and NEUROD1
The first classification of SCLC was based on differences in cell morphology: a ‘classic’ type of cells that grow as spherical aggregates of floating cells and another relatively less prevalent ‘variant’ type which grew as tightly adherent monolayer or as loosely adherent aggregates. The ‘classic’ type had relatively higher levels of NE markers such as ASCL1 (Gazdar et al., 1985). Next, SCLCs were categorized into two categories based on relative levels of ASCL1 and NEUROD1, both of which are key developmental nodes for pulmonary NE cells (Poirier et al., 2013). Functional contributions of ASCL1 and/or NEUROD1 in SCLC have since been extensively explored (Borromeo et al., 2016; Ikematsu et al., 2020; Osborne et al., 2013). Further characterization of SCLC proposed three different phenotypes – ASCL1^{high}/NEUROD1^{low}, ASCL1^{low}/NEUROD1^{high}, and a ‘double negative’ ASCL1^{low}/NEUROD1^{low} (Poirier et al., 2015). In addition, a ‘double positive’ ASCL1^{high}/NEUROD1^{high} state of SCLC was recently identified (Baine et al., 2020; Simpson et al., 2020). In a recent classification (Rudin et al., 2019), ASCL1^{high}/NEUROD1^{low} was labeled as SCLCA subtype, ASCL1^{low}/NEUROD1^{low} was categorized as nonNE (and subcategorized to SCLCP and SCLCY subtypes), and ASCL1^{low} /NEUROD1^{high} was marked as SCLCN (also referred to as NEV1 Wooten et al., 2019; Figure 4A,i). Besides, the NEV2 subtype expressed ASCL1 and had relatively higher levels of NEUROD1 as that seen in ASCL1^{high}/NEUROD1^{low} subtype; thus, NEV2 can be conjectured to be ASCL1^{high} /NEUROD1^{high}.
Intriguingly, our simulations (Figure 1A,ii) are able to capture all these four phenotypes – these were the four most dominant steady states that Boolean and RACIPE results converged to. Thus, we were able to classify the Boolean steady states into four phenotypes – A+N+, A+N, AN+, and AN (where A represents ASCL1, N represents NEUROD1) (Figure 4A,ii). Hierarchical clustering of experimental data – CCLE (Figure 4B,i) and GSE73160 (Figure 4—figure supplement 1A,i) – using ASCL1 and NEUROD1 levels depict a clear segregation into four large groups at an early level. These four clusters were indeed A+N+, A+N, AN+, and AN (Figure 4B,ii, Figure 4—figure supplement 1A, ii). Alternate clustering methods such as Kmeans revealed K = 4 as the optimal number of clusters, based on the average Silhouette width analysis (Figure 4C,i, Figure 4—figure supplement 1B,i), thus strengthening the results obtained from Boolean and RACIPE analysis. These four clusters obtained by Kmeans also split into A+N+, A+N, AN+, and AN (Figure 4C,ii, Figure 4—figure supplement 1B,ii) reminiscent of observations that a classification system employing only ASCL1 and NEUROD1 may be sufficient to describe these four biologically meaningful phenotypes (Borromeo et al., 2016).
Based on the ‘groups’ we identified from influence matrix, ASCL1 was found to be a part of group A while NEUROD1 did not belong to any group (Figure 3). Therefore, we examined whether the expression levels of genes in group A were high in samples identified as A+N and A+N+. Indeed, in CCLE and GSE73160, we observed that A+N and A+N+ subgroups had high levels of group A and low levels of group B; NEUROD1 did not show any groupspecific pattern (Figure 4—figure supplement 2). This observation endorses that states identified by mechanistic modeling are recapitulated in vitro.
Given that our choice of using ASCL1 and NEUROD1 to cluster SCLC cell lines was based on experimental data, we conducted an unbiased analysis of using any 2 of 33 possible nodes for characterization (i.e. ^{33}C_{2} = 528 combinations) in their ability to define four SCLC phenotypes. Only 140 such combinations yielded four as the optimal number of clustering, and the ASCL1NEUROD1 pair showed up in the top 5 of the node pairs with maximal resolvability (measured via average Silhouette width) in defining these four phenotypes. Thus, the ASCL1 and NEUORD1 pair featured among the top 1% (in top 5 of 528 possibilities) of gene pairs in characterizing SCLC heterogeneity (Figure 4—figure supplement 3).
Extended subtype classification of SCLC into five phenotypes based on POU2F3 and YAP1
Besides ASCL1 and NEUROD1, YAP1 and POU2F3 have been reported as important regulators of SCLC, particularly for the nonNE phenotype (Baine et al., 2020; Huang et al., 2018; Ito et al., 2016; McColl et al., 2017; Song et al., 2020). Thus, we investigated the role of these players in defining SCLC phenotypes. We performed hierarchical clustering over the four genes of interest (ASCL1, NEUROD1, YAP1, POU2F3) on the CCLE dataset (Figure 5A,i) and GSE73160 (Figure 5—figure supplement 1A, i). We observed five clusters, and when projected on the ASCL1NEUROD1 axis, two of these five clusters were both present in the third quadrant (AN), suggesting that nonNE phenotype (ASCL1^{low}/NEUROD1^{low}) may be divided into two subclusters based on levels of YAP1 and POU2F3 (Figure 5A,ii, Figure 5—figure supplement 1A,ii).
To decipher the five clusters better, we first performed Kmeans clustering for K = 4 for CCLE using the levels of ASCL1, NEUROD1, YAP1, and POU2F3 and observed the four clusters: SCLCA1 (A+; n = 28), SCLCN (N+; n = 12) and two nonNE phenotypes: SCLCY (Y+; n = 8) and SCLCP (P+; n = 4) (Figure 5B,i). When Kmeans clustering for K = 5 was performed on the same dataset using the four abovementioned genes, SCLCY (Y+; n = 8) and SCLCP (P+; n = 4) had identical composition as the corresponding clusters for K = 4 (Supplementary file 2a; Figure 5B,ii). Among 12 cell lines classified as SCLCN (N+), 11 of them were still classified as SCLCN with the exception of CORL279, which was classified as A+N+. Interestingly, the SCLCA (A+; n = 28) cluster obtained for K = 4 broke into two subclusters: 14 cell lines were classified as ‘NEV2’ (A+N+) and remaining 14 as SCLCA. Similar trends were seen for hierarchical clustering, indicating the concurrence in assigning these different cell lines to SCLC phenotypes (Supplementary file 2a; Figure 5B,ii). This categorization suggests that the ‘double positive’ phenotype may be an intermediate one on the spectrum of ASCL1^{high}/NEUROD1^{low} (SCLCA) to ASCL1^{low}/NEUROD1^{high} (SCLCN) one, from the perspective of these four genes of interest. We compared the classification of the SCLC cell lines done using the levels of these four genes of interest with that done based on levels of a larger set of genes (Wooten et al., 2019) and obtained an overall good consistency, with a few (n = 5) exceptions where cell lines belonging to SCLCA1 according to our clustering were found to belong to NEv2 as per the previous analysis (Supplementary file 2a). Performing the same procedure on another cohort of SCLC cell lines (GSE73160) – that showed consistent trends as for CCLE cohort (Figure 5—figure supplement 1B) – demonstrated that for K = 4 vs. K = 5, some SCLCN cell lines were classified as the ‘double positive’ (Supplementary file 2b), endorsing that cells may switch to this hybrid phenotype from SCLCA1 and/or SCLCN, similar to observations in hybrid E/M phenotypes (Tripathi et al., 2020a).
Projecting the levels of these four genes of interest in SCLC cell lines on UMAP (Uniform Manifold Approximation and Projection) plots, we observed a neat splitting into five clusters – ASCL1^{high}/NEUROD1^{low}/YAP1^{low}/POU2F3^{low}, ASCL1^{low}/NEUROD1^{high}/YAP1^{low}/POU2F3^{low}, ASCL1^{high}/NEUROD1^{high}/YAP1^{low}/POU2F3^{low }– and two subtypes of ASCL1^{low}/NEUROD1^{low} – YAP1^{high}/POU2F3^{low} and YAP1^{low}/POU2F3^{high} (Figure 5C, Figure 5—figure supplement 1C). Because UMAP does not require the number of clusters as an a priori input (as Kmeans clustering does), the emergence of these five clusters in this fourdimensional space is an independent validation of the robustness of these signatures based on the (relative) levels of ASCL1, NEUROD1, YAP1, and POU2F3. Put together, these results indicate that measuring the levels of these four molecules can be sufficient to quantify the degree of phenotypic heterogeneity, thereby corroborating our current status of understanding of functional relevance of these four key molecules in SCLC.
Discussion
The focus on existence and dynamics of nongenetic, i.e., phenotypic, heterogeneity in cancer is a relatively recent one (Inde and Dixon, 2018; Jolly et al., 2018), given a reluctantly growing realization that extensive efforts and resources to map the genomic landscape solely has had limited success in decoding any fundamental organizing principles of cancer progression, especially metastasis (Brock and Huang, 2017). While the existence of phenotypic heterogeneity in cancer cells is, by no means, a surprise, given its ubiquity in embryonic development (Huang, 2009), the extent of our lack of understanding of organizing principles enabling it is appalling, relative to our current understanding of causes and consequences of phenotypic heterogeneity in ‘simpler’ biological organisms (Ackermann, 2015; van Boxtel et al., 2017; Varahan et al., 2019).
Here, we demonstrate that the phenotypic heterogeneity in SCLC can emerge from multistable dynamics of underlying regulatory network. Multistability in this network has been reported based on Boolean (discrete, parameterindependent) simulations earlier (Tripathi et al., 2020b; Udyavar et al., 2017). Here, RACIPE – a continuous, parameteragnostic approach – revealed that this multistability can be observed over many parametric combinations and reveal the predominance of the four states as seen in Boolean modeling, thereby elucidating the crucial role of network topology in facilitating multistability. These observations corroborate earlier work on comparative analysis of phenotypic (steadystate) distributions obtained from RACIPE and Boolean modeling for networks underlying epithelial–mesenchymal plasticity (EMP) (Hari et al., 2020).
The four phenotypes in SCLC reported here – ASCL1^{high}/NEUROD1^{low}, ASCL1^{low}/NEUROD1^{high}, ASCL1^{high}/NEUROD1^{low}, and ASCL1^{high}/NEUROD1^{high} – have been seen previously using varied Boolean modeling strategies and different networks (Tripathi et al., 2020b; Udyavar et al., 2017; Wooten et al., 2019). Our focus is not to compare comprehensive gene expression profiles that have been mapped on to these phenotypes or to use statistical methods on transcriptomic data to infer modules of coexpressing genes, as has been extensively done before for SCLC. Rather, we use a complementary approach – RACIPE – to ascertain whether the phenotypes obtained via Boolean approaches are seen in ODE models also, and consequently identify design principles of the complex SCLC network, which constrain the number of steady states obtained. While RACIPE identifies many more steady states, this feature can be attributed to its continuous mode of simulation, a difference we also noted earlier in EMP networks (Hari et al., 2020). The biological significance, if any, of these additional states needs further investigation. Furthermore, both Boolean and RACIPE approaches are implemented here only on transcriptional networks, whereas additional mechanisms such as epigenetics may alter frequency of steady states by effectively altering network topology (Somarelli et al., 2016).
Moreover, here, we quantified the role of network topology by defining an influence matrix that revealed that the SCLC network consists of two ‘teams’ of players that mutually inhibit each other and selfactivate themselves. This topological signature helps understand why despite such size, density, and complexity of the network, it led to a limited number of steady states. Interestingly, this property of the network was largely intact upon most singleedge perturbations, indicating enough redundancy and robustness in underlying network design. We found this property of the network is unique to this topology: not only was it lost upon randomization, but also none of the steady states obtained from random networks matched with that of the WT network topology. Importantly, we observed striking similarity between the influence matrix and the correlation matrix that included pairwise correlations among all gene expression values, endorsing that the influence matrix can provide an approximate readout of the patterns of steady states that can be expected in a network and that can be deciphered without having to simulate the dynamics of the network. However, the scalability or universality of influence matrix remains to be investigated cautiously.
Importantly, the patterns of steady states seen in influence and correlation matrices matched well with those seen in experimental data. This observation evoked confidence in biological insights that can be gained from the simulations of SCLC network. First, ASCL1 and NEUROD1, the two master regulators extensively reported in SCLC (Jiang et al., 2009; Osborne et al., 2013; Poirier et al., 2013), did not lie in the same ‘team’, consistent with recent reports that they regulate distinct set of genes and that ASCL1^{high} and NEUROD1^{high} cell lines can have quite distinct chromatin landscapes too (Borromeo et al., 2016). Second, clustering based results raised the possibility that some SCLC cell lines may represent a ASCL1^{high}/NEUROD1^{high} state, which can be perceived as an ‘intermediate’ state between the canonical SCLCA (ASCL1^{high}/NEUROD1^{low}) and SCLCN (ASCL1^{low}/NEUROD1^{high}) ones. Similar ‘intermediate’ states are witnessed in diverse biological contexts (Duddu et al., 2020; Jolly et al., 2015; Tripathi et al., 2020c; Yu et al., 2017). Profiling of circulating tumor cellderived explant models (CDXs) supports such possibility in SCLC (Simpson et al., 2020). Third, multistability often leads to phenotypic switching and coexistence of multiple states (heterogeneity). Mouse models of SCLC reveal phenotypically distinct cells with a NE or nonNE marker profile that cooperated to enhance their metastatic potential (Calbo et al., 2011), offering selective advantage to the tumor. Recent singlecell RNAseq measurements of CDXs and patientderived circulating tumor cells demonstrated intratumor heterogeneity (ITH). Particularly, ITH was found to be higher in drugresistant CDXs relative to drugsensitive ones and was found to have a transcriptional rather than a genomic basis (Stewart et al., 2020). Thus, it is expected that most SCLC tumors contain cells in these phenotypes with varying frequencies, i.e., one SCLC tumor can contain multiple subtypes within it (Wooten et al., 2019), as has been postulated for breast cancer (Yeo and Guan, 2017). Their relative frequency can be altered by adaptations at genetic and/or nongenetic levels (Mollaoglu et al., 2017; Udyavar et al., 2017).
A limitation of our simulations is the inability to account for the two ‘nonNE’ subtypes, SCLCY (YAP1^{high}) and SCLCP (POU2F3^{high}), because the SCLC regulatory network simulated (Udyavar et al., 2017) does not contain YAP1 or POU2F3 as nodes. Thus, we performed clustering using ASCL1, NEUROD1, YAP1, and POU2F3 for SCLC cell lines and observed YAP1 and POU2F3 expression to be largely mutually exclusive within the ‘double negative’ ASCL1^{low}/NEUROD1^{low} cluster. These results are consistent with recent analysis of SCLC cell lines (Ito et al., 2016), CDXs (Pearsall et al., 2020), and patient samples (Baine et al., 2020). Furthermore, YAP1 negatively correlate with INSM1 (McColl et al., 2017), a crucial regulator of NE differentiation in SCLC (Fujino et al., 2015), which was found to be ‘ON’ in our A+N and A+N+ states. Interestingly, AN+ (ASCL1^{low}/NEUROD1^{high}) is classified as NEvariant (NEv1) (Wooten et al., 2019). Concomitantly, NEUROD1 has been associated with ‘mesenchymal’ (i.e., nonNE) markers (Simpson et al., 2020) and features such as migration (Osborne et al., 2013). Thus, in terms of an ‘NE score’, the phenotypes can be possibly expected to lie on a spectrum: ASCL1^{high}/NEUROD1^{low }≈ ASCL1^{high}/NEUROD1^{high} > ASCL1^{low}/NEUROD1^{high} > ASCL1^{low}/NEUROD1^{low}.
Our results highlight an important step in decoding the design principles of underlying regulatory network for phenotypic plasticity and heterogeneity in aggressive cancers such as SCLC. SCLC has no targeted therapy available till date (Subbiah et al., 2020), a limitation that is expected to be overcome by recent surge in highthroughput experimental data collection for SCLC and computational approaches to identify SCLC subtypes for clinical action (Salgia et al., 2018; Stewart et al., 2020; Tlemsani et al., 2020; Udyavar et al., 2017; Wooten et al., 2019). An abysmal, if any, correlation between mutational profiles and distinct SCLC subtypes (George et al., 2015), and multistability in SCLC network, argue for nongenetic causes of phenotypic heterogeneity. We have identified the topological signatures in underlying SCLC network that can give rise to multistability – two ‘teams’ of players with mutually inhibitory and selfactivatory – and can result in a limited (n = 4) number of biologically relevant phenotypes, despite its overwhelming complexity (37 nodes, 357 edges). Such ‘motifs’ and consequent multistability appear to be the hallmarks of regulatory networks underlying cell differentiation and phenotypic plasticity (Burda et al., 2011; Shiraishi et al., 2010; Zhou and Huang, 2011). Given that network topology can hold an impressive amount of information about both the steadystate and dynamical behaviors, as demonstrated for multiple gene regulatory networks (Alon, 2007; Feng et al., 2016; Ma et al., 2009; Santolini and Barabási, 2018; Gómez Tejeda Zañudo et al., 2017), this work calls for an unprecedented approach to restrict the emergence of phenotypic heterogeneity in cancers – i.e., by ‘attacking’ (via tools such as CRISPR) the salient features of network topology.
Materials and methods
Experimental data
Request a detailed protocolGene expression profiles of 52 SCLC cell lines were downloaded from Broad Institute’s CCLE expression data. Data for GSE73160 was downloaded from NCBI website.
Normalization of experimental data
Request a detailed protocolAll data were normalized as per zscore normalization. The normalization was done across all the sample points (stable states or experimental data points) for a particular variable. The formula for calculating the zscore for jth observation for the ith variable is:
where μ is the mean, and σ is the standard deviation. For the purposes of the code, we used the preprocessing.scale function of the sklearn package in python.
Correlation analysis
Request a detailed protocolPearson correlation coefficient is a statistic that measures the linear correlation between two variables of interest. It depicts how strongly two variables are linearly related. It has a value between +1 and −1. A value of +1 is the total positive linear correlation, 0 is no linear correlation, and −1 is a total negative linear correlation. This static is applied to the experimental data to get the sense of how closely the expression levels of several nodes are linearly related. For the purposes of the code, the scipy.stats.pearsonr function of the scipy package in python was used. Along with the Pearson correlation coefficient, the function also returns twotail pvalue, which tells us whether two variables are statistically significant or not.
Uniform Manifold Approximation and Projection
Request a detailed protocolUMAP is a dimension reduction technique that can be used for visualization (similar to tSNE) and is considered an improvement over previous dimension reduction techniques. UMAP was applied on zscore normalized data. For the purposes of the code, the UMAP.fit_transform function of the umap package in python was used. The n_neighbours parameter was set to 4 (after extensive trials over a broad range of values) and n_epochs set to 1000. The rest of the parameters was set to default.
Hierarchical clustering
Request a detailed protocolHierarchical clustering was applied to the zscore normalized data. For purposes of code, the cluster.hierarchy.linkage function of the scipy package in python was used with the method set to ward. The dendrograms were plotted using the same, with the color grouping decided by setting the threshold values to desirable values.
Kmeans clustering
Request a detailed protocolKmeans algorithm was applied to the zscore normalized data. For purposes of code, the cluster.KMeans function of the sklearn package in python was used with the init parameter set to random, the n_int set to 20, and max_iter to 300.
Silhouette score analysis
Request a detailed protocolSilhouette score is a way to determine the goodness of fit of a particular kvalue in kmeans clustering. It depicts how close the points in a cluster are to other points in the same cluster versus the other clusters. A higher score shows welldifferentiated clustering. For the purposes of the code, we used metrics.silhouette_score function from the sklearn package in python, with the identified clusters being taken from fit_predict function applied to clusters obtained from the kmeans algorithm outlined above.
Influence matrix
Request a detailed protocolThe influence matrix is a formulation similar to that of the interaction matrix ($M)$), but gives a more comprehensive estimation of the effect of each on others in the network. For a given pair of nodes A and B, a path is a collection of serially connected edges originating from A and ending at B. The pathlength is then defined as the number of edges in such a path. Therefore, the interaction matrix depicts the influence of each node on the other for a pathlength of one. Consider the matrix ${M}^{2}$. The elements of the matrix can be written as:
Which is the influence of ith node on the jth node when considering all paths of length 2. Similarly, each element of the matrix ${M}^{l}$ depicts the influence of nodes over pathlengths of $l$. We then calculate the influence matrix for a pathlength of ${l}_{max}$ by combining the matrices ${M}^{l}$ for $l<{l}_{max}$ as follows:
${M}_{max}$, obtained by setting all nonzero elements of the interaction matrix to 1, represents the magnitude of the maximum possible interaction for the given network topology and hence is used as the normalizing factor. The division ($\frac{M}{{M}_{max}}$) represents elementwise division and the range of the elements of the resultant matrix is restricted between −1 and 1. The summation is divided by ${l}_{max}$ to again restrict the range of the elements between −1 and 1.
Conversion from discrete to continuous formulation of correlation matrix
Request a detailed protocolExperimental datasets gives us Pearson correlation coefficients which range from −1 to 1 (continuous values) while Boolean steadystate distribution (of both WT SCLC network and random networks) gives us correlation coefficients of either −1 or 1 (discrete values). Thus, we cannot formulate the J metric to be the same for all the cases. To make a fair comparison, we have converted the discrete values of the correlation coefficients to continuous values in the following way:
The Pearson correlation matrix of the steadystate distribution resulting from the Ising model is calculated (as mentioned in Correlation analysis section)
The values of the correlation matrix are then replaced with a random number drawn from a uniform distribution (using the numpy.random.uniform from numpy package in Python 3.6) ranging between α and one for positive values and −1 and α for negative values, where α = 0.01. The choice of α stems from the fact that all the correlation coefficients are statistically significant, p<0.05. Therefore, the replacements must be statistically significant as well.
For each Boolean network (WT SCLC network or Random network), 1000 such continuous correlation matrices are generated.
J value is calculated for all the 1000 continuous correlation matrices obtained for a given network, and a mean of those 1000 values is taken as the J value for a given network.
The value of J obtained for random networks (n = 1000) is plotted as a histogram.
Similar continuous formulation of correlation matrix was done for RACIPE correlation matrix as well, that is why the J value for RACIPE seen in Figure 2B,ii (J ~ 254) is smaller than that for RACIPE seen in Figure 2A,ii (J ~ 373), and similar to that for Boolean case as seen in Figure 2B,ii (J ~ 259).
RACIPE and Boolean simulations
Request a detailed protocolPlease see details of Boolean and RACIPE simulations in Supplementary Information.
Effective edge calculation and reduced networks
Request a detailed protocolFor a given Influence/Interaction matrix, an effective edge from a team of genes ${T}_{i}$ to another team ${T}_{j}$ is given by the metric E:
where ${T}_{i}$ and ${T}_{j}$ are teams of genes and $i$ and $j$ are indices of genes in Influence matrix/Interaction matrix belonging to teams ${T}_{i}$ and ${T}_{j}$. Here ${M}^{Max}$, obtained by setting all nonzero elements of the influence matrix /interaction matrix to 1, represents the magnitude of the maximum possible interaction for the given network topology and hence is used as the normalizing factor.
Real numbers which are close to zero are taken to zero by defining a threshold by ${f}_{Th}$ given as.,
For the Reduced models, Threshold is taken as 0.05. Effective edges are reported using Signum function $sgn$ which extracts the sign of real number given as:
Using these edges, one can make reduced models from Influence matrix of any length and for a specific threshold.
To clearly understand how the two ‘teams’ of genes (group A: 22 genes – labeled from 1 to 22 in Figure 2 and group B: 10 genes – labeled from 24 to 33 in Figure 2), subnetworks were generated consisting of genes from individual teams. NEUROD1 is not a part of either group because it did not show any correlation with either team per se in correlation matrix.
Another interesting observation in steady states observed in Figure 1B is that state 1 and state 4 (and similarly, state 2 and state 3) differ only in the levels of NEUROD1 (i.e. degeneracy in NEUROD1). Given the observations stated above, we expected the subnetworks to yield two states, all on and all off. Interestingly when genes of group A are simulated using the Ising model asynchronous update, we obtained four states as opposed to the expected two. Analysis of the states revealed that a degeneracy analogous to that of NEUROD1 in WT network was being caused by ELF3 in group A (Supplementary file 3a, S7). Therefore, we decided to include ELF3 as an independent entity for the reduced model. However, steadystate frequencies of group B did not reveal any such degeneracy caused due to ELF3 (Supplementary file 3c, S9). By calculating the ‘effective edges’ metric described above, we observed two differences in the reduced model obtained from influence matrix (model 1; Figure 3C) and that obtained from interaction matrix (model 2; Figure 3—figure supplement 2c). Model 1, but not model 2, contains selfactivation on ELF3 as well as an activatory link from team A to ELF3.
Singleedge perturbation
Request a detailed protocolWe perturbed each edge of WT SCLC network in the following two ways:
Change of sign of the edge (A → B to A ⊣ B and vice versa)
Deletion of the edge (A → B/A ⊣ B to A ϕ B) where ϕ indicates the absence of a link. Every perturbed network is then simulated using the Ising model with asynchronous update for 2^20 random initial conditions. JSD is calculated for the resulting state distributions with respect to the steadystate distribution of WT SCLC network.
Boolean framework – Ising model with asynchronous update
Request a detailed protocolIsing model formalism uses discrete variables to represent the expression level of molecular species (such as microRNA or transcription factors etc.). Therefore, the state of the regulatory network (Nnode network) can be represented by a sequence {${s}_{i}$, ${s}_{i}$∈ {−1, 1}} called a ‘Boolean Vector’ of N binary variables where ${s}_{i}$ = 1 represents highexpression level of ith node and ${s}_{i}$ = −1 represents low expression of the node. In modeling the dynamics of network via this framework, the only knowledge required is whether each regulatory relationship between network nodes is activating or inhibitory. Regulatory interactions between the molecular species are represented by an N*N matrix called an ‘Interaction Matrix (M)’ where ${M}_{ij}$ = 1 represents ‘promotion’ (or activation) of levels of ith node by jth node and ${M}_{ij}$ = −1 represents ‘inhibition’ (or repression) of levels of ith node by jth node of the Nnode network. The absence of any regulatory relationship between species i and species j is indicated by ${M}_{ij}$ = 0.
At every discretetime step, the expression level of a node ${s}_{i}$(t+1) is given as +1 if $\sum}_{i=1}^{N}{M}_{ij}*{s}_{j}\left(t\right)>0$ and −1 if $\sum}_{j=1}^{N}{M}_{ij}*{s}_{j}\left(t\right)<0$ and remains the same when $\sum}_{j=1}^{N}{M}_{ij}*{s}_{j}\left(t\right)=0$. The expression levels are updated using the asynchronous scheme in which a node from the Nnode network is picked up at random at every discretetime step and updated using the above nonlinear relation. For large discretetime dynamics, the network settles in a steady state, which means that the Boolean Vector is a fixed point of the abovegiven relation (i.e. will not change as time progresses).
RAndom CIrcuit PErturbation
Request a detailed protocolRACIPE is a tool that identifies robust dynamical properties of transcriptional regulatory networks (TRNs) by generating an ensemble of continuous network models with distinct kinetic parameters. For every continuous model of a TRN, RACIPE first generates a system of ordinary differential equations (ODEs). For a given node ‘N’ of the TRN and a set of input activating edges ${A}_{i}$ and input inhibiting edges ${I}_{j}$, the differential equation corresponding to the expression level of N is given as:
Here, $N,{A}_{i}$, and ${I}_{j}$ represent the expression levels of the species of the TRN. ${G}_{N}$ and ${k}_{N}$ denote the Production and Degradation rates, respectively. ${A}^{0}{}_{iN}$ is the threshold value of ${A}_{i}$ expression level at which the nonlinearity in the dynamics of $N$ due to ${A}_{i}$ is seen. $n$ is termed as the Hill coefficient and represents the extent of nonlinearity in the regulation. $\lambda $ represents the fold change in the target node expression level upon overexpression of regulating nodes. The functions ${H}^{S+}$ and ${H}^{S}$ are known as Shifted Hill functions (Lu et al., 2013) and represent the regulation of the target node by the regulatory node. Shifted Hill functions take the following form:
For the system of ODEs, RACIPE randomly samples the kinetic parameters from a predefined set of parameter ranges. At each parameter set, RACIPE integrates the model from multiple initial conditions and obtains steady states in the state space. For the current analysis, a sample size of 10^6 for parameters sets and 1000 for initial conditions was used. The parameters were sampled via a uniform distribution, and the ODE integration was carried out using the Runge–Kutta–Fehlberg method of numerical integration.
For the given TRN with ‘n’ nodes, the steadystate expression levels of the nodes were normalized in the following way:
For the ith node, ${E}_{in}$ is the normalized expression level of the node, ${E}_{i}$ is the steadystate expression level, ${f}_{i}$ is the normalization factor, ${g}_{i}$ and ${k}_{i}$ are the production and degradation of the ith node corresponding the current steadystate, and ${\lambda}_{ij}$ are the fold change in the expression of node i due to node j. The normalized expression levels of all steady states are then converted into zscores by scaling about their combined mean:
where $\underset{\_}{{E}_{in}}$ is the combined mean and ${\sigma}_{in}$ is the combined variance.
The zscores can then be classified into Low (zero) and High (one) expression levels based on the sign of their values. This way we have discretized the continuous steadystate levels of the network for comparison with the frequency of Boolean steady states. The way to calculate the total frequency of each discrete state is by counting the occurrence in all the parameter sets. For parameter sets with n steady states, the count of each steady state is taken as 1/n, invoking the assumption that all the states are equally stable.
Data and code availability
Request a detailed protocolAll codes are available at https://github.com/uday2607/CSBSCLC; Chauhan, 2021; copy archived at swh:1:rev:eb4c869fe572bb0a98a6a7ce7a09631ad584200e.
Data availability
Gene expression profiles of 52 SCLC cell lines were downloaded from Broad Institute's CCLE expression data. Data for GSE73160 was downloaded from NCBI website. All codes used to generate and analyze simulation data, and codes used to analyze gene expression data are available at : https://github.com/uday2607/CSBSCLC copy archived at https://archive.softwareheritage.org/swh:1:rev:eb4c869fe572bb0a98a6a7ce7a09631ad584200e/.

NCBI Gene Expression OmnibusID GSE73160. Exon expression for NCI small cell lung cancer cell line panel.
References

A functional perspective on phenotypic heterogeneity in microorganismsNature Reviews Microbiology 13:497–508.https://doi.org/10.1038/nrmicro3491

Network motifs: theory and experimental approachesNature Reviews Genetics 8:450–461.https://doi.org/10.1038/nrg2102

Bacterial persistence as a phenotypic switchScience 305:1622–1625.https://doi.org/10.1126/science.1099390

Bacterial survival strategies suggest rethinking Cancer cooperativityTrends in Microbiology 20:403–410.https://doi.org/10.1016/j.tim.2012.06.001

Phenotypic heterogeneity promotes adaptive evolutionPLOS Biology 15:e2000644.https://doi.org/10.1371/journal.pbio.2000644

Nongenetic heterogeneitya mutationindependent driving force for the somatic evolution of tumoursNature Reviews Genetics 10:336–342.https://doi.org/10.1038/nrg2556

Precision oncology: between vaguely right and precisely wrongCancer Research 77:6473–6479.https://doi.org/10.1158/00085472.CAN170448

Distinctive properties of metastasisinitiating cellsGenes & Development 30:892–908.https://doi.org/10.1101/gad.277681.116

Multistability in cellular differentiation enabled by a network of three mutually repressing master regulatorsJournal of the Royal Society Interface 17:20200631.https://doi.org/10.1098/rsif.2020.0631

Core signalling motif displaying multistability through multistate enzymesJournal of the Royal Society Interface 13:20160524.https://doi.org/10.1098/rsif.2016.0524

InsulinomaAssociated protein 1 is a crucial regulator of neuroendocrine differentiation in lung CancerThe American Journal of Pathology 185:3164–3177.https://doi.org/10.1016/j.ajpath.2015.08.018

Characterization of variant subclasses of cell lines derived from small cell lung Cancer having distinctive biochemical, morphological, and growth propertiesCancer Research 45:2924–2930.

Smallcell lung Cancer: what we know, what we need to know and the path forwardNature Reviews Cancer 17:725–737.https://doi.org/10.1038/nrc.2017.87

Phenotypic heterogeneity, a phenomenon that may explain why quorum sensing does not always result in truly homogenous cell behaviorApplied and Environmental Microbiology 81:5280–5289.https://doi.org/10.1128/AEM.0090015

Multistable decision switches for flexible control of epigenetic differentiationPLOS Computational Biology 4:e1000235.https://doi.org/10.1371/journal.pcbi.1000235

Identifying inhibitors of epithelialmesenchymal plasticity using a network topologybased approachNpj Systems Biology and Applications 6:15.https://doi.org/10.1038/s4154002001321

Interrogating the topological robustness of gene regulatory circuits by randomizationPLOS Computational Biology 13:e1005456.https://doi.org/10.1371/journal.pcbi.1005456

POU2F3 is a master regulator of a tuft celllike variant of small cell lung CancerGenes & Development 32:915–928.https://doi.org/10.1101/gad.314815.118

The impact of nongenetic heterogeneity on Cancer cell deathCritical Reviews in Biochemistry and Molecular Biology 53:99–114.https://doi.org/10.1080/10409238.2017.1412395

Loss of YAP1 defines neuroendocrine differentiation of lung tumorsCancer Science 107:1537–1538.https://doi.org/10.1111/cas.13013

Operating principles of Notch–Delta–Jagged module of cell–cell communicationNew Journal of Physics 17:055021.https://doi.org/10.1088/13672630/17/5/055021

Dynamics of phenotypic heterogeneity associated with EMT and stemness during Cancer progressionJournal of Clinical Medicine 8:1542.https://doi.org/10.3390/jcm8101542

Divergence measures based on the shannon entropyIEEE Transactions on Information Theory 37:145–151.https://doi.org/10.1109/18.61115

Tristability in cancerassociated microRNATF chimera toggle switchThe Journal of Physical Chemistry B 117:13164–13174.https://doi.org/10.1021/jp403156m

The differential view of genotypephenotype relationshipsFrontiers in Genetics 6:179.https://doi.org/10.3389/fgene.2015.00179

NonDarwinian dynamics in therapyinduced Cancer drug resistanceNature Communications 4:2467.https://doi.org/10.1038/ncomms3467

Selective tropism of Seneca Valley virus for variant subtype small cell lung CancerJNCI: Journal of the National Cancer Institute 105:1059–1065.https://doi.org/10.1093/jnci/djt130

Molecular subtypes of small cell lung Cancer: a synthesis of human and mouse model dataNature Reviews Cancer 19:289–297.https://doi.org/10.1038/s4156801901339

Emergent properties of the HNF4αPPARγ network may drive consequent phenotypic plasticity in NAFLDJournal of Clinical Medicine 9:870.https://doi.org/10.3390/jcm9030870

LargeScale analysis of network bistability for human cancersPLOS Computational Biology 6:26.https://doi.org/10.1371/journal.pcbi.1000851

MesenchymalEpithelial transition in sarcomas is controlled by the combinatorial expression of MicroRNA 200s and GRHL2Molecular and Cellular Biology 36:2503–2513.https://doi.org/10.1128/MCB.0037316

Phenotypic heterogeneity and evolution of melanoma cells associated with targeted therapy resistancePLOS Computational Biology 15:e1007034.https://doi.org/10.1371/journal.pcbi.1007034

A mechanism for epithelialmesenchymal heterogeneity in a population of Cancer cellsPLOS Computational Biology 16:e1007619.https://doi.org/10.1371/journal.pcbi.1007619

Biological networks regulating cell fate choice are minimally frustratedPhysical Review Letters 125:088101.https://doi.org/10.1103/PhysRevLett.125.088101

The physics of cellular decision making during EpithelialMesenchymal transitionAnnual Review of Biophysics 49:1–18.https://doi.org/10.1146/annurevbiophys121219081557

Taking chances and making mistakes: nongenetic phenotypic heterogeneity and its consequences for surviving in dynamic environmentsJournal of the Royal Society Interface 14:20170141.https://doi.org/10.1098/rsif.2017.0141

Defining the hallmarks of metastasisCancer Research 79:3011–3027.https://doi.org/10.1158/00085472.CAN190458

Mathematical models of cell phenotype regulation and reprogramming: make Cancer cells sensitive again!Biochimica Et Biophysica Acta (BBA)  Reviews on Cancer 1867:167–175.https://doi.org/10.1016/j.bbcan.2017.04.001

Breast Cancer: multiple subtypes within a tumor?Trends in Cancer 3:753–760.https://doi.org/10.1016/j.trecan.2017.09.001
Decision letter

Aleksandra M WalczakSenior Editor; École Normale Supérieure, France

Sandeep KrishnaReviewing Editor; National Centre for Biological Sciences‐Tata Institute of Fundamental Research, India

Jean ClairambaultReviewer

David WootenReviewer; Pennsylvania State University, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
This paper provides a detailed analysis of the master regulatory network involved in small cell lung cancer. Using multiple mathematical tools, the authors transform a complex, highly connected network into a small, easytointerpret reduced form. The reduced network provides insight into the topological features – here the existence of two "teams" of genes that inhibit each other – that determine the nongenetic phenotypic plasticity in these cancer cells. The concordance between the reduced network and experimental knowledge from the field suggest that these methods may be useful more broadly to make sense of complex regulatory networks elsewhere.
Decision letter after peer review:
Thank you for submitting your article "Topological signatures in regulatory network enable phenotypic heterogeneity in small cell lung cancer" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Aleksandra Walczak as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jean Clairambault (Reviewer #1); David Wooten (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The reviewers appreciated your detailed analysis of the regulatory network involved in small cell lung cancer. However, they also raise some concerns that are expressed in the reviews appended below. In your revision, please try to address the concerns of the reviewers, in particular please expand your analysis and comparison to similar results in literature (Reviewer 2) and discuss how your methods could be generalized to apply to other networks (Reviewer 3).
We would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.
Reviewer #1:
Detailed study of the master regulatory network of SCLC by probabilistic methods, focusing on the two genes ASCL1 and NEUROD1. Supports by uptodate literature reports and sound gene expression analysis the nongenetic phenotypic plasticity of SCLC cells. May be a major step forward in understanding nongenetic phenotype plasticity in cancer.
Firstly, let me say that the technical aspects of this very detailed study go far beyond my field of expertise. I will evaluate it from an external eye, and according to the contribution it brings to the investigation of non genetic phenotype plasticity in cancer, here focusing on small cell lung cancer (SCLC). Plasticity in cancer cells is indeed a major field of research nowadays, resulting in metastasis and drug persistence/tolerance, and especially SCLC has escaped efficacious treatments until now.
As far as I could understand the analysis, by two different methods, one based on Boolean networks, the other on elementary (nevertheless of great dimensionality) ordinary differential equations, the authors statistically “shake” the possible parameters of the regulatory network of genes found “masterly” expressed in SCLC, according to a study of 2017 (Udyavar et al., 2017), network in which one can recognize 3 of the 4 Yamanaka genes, MYC, KLF4 and SOX2.
Rather than on the Yamanaka genes, the authors focus on the two genes ASCL1 and NEUROD1, for both of which they find marked differential expression (high/low), resulting in 4 different coupled phenotypes at the gene expression level, which, as “patterns of steady states seen in influence and correlation matrices”, correlate with experimental results collected from experimental data on SCLC. Indeed, ASCL1 and NEUROD1 had already been identified as master regulators of the salient gene network of SCLC, ASCL1 being a neuroendocrine marker and NEUROD1 being associated with mesenchymal markers susceptible of inducing metastasis, i.e., two markers of plasticity.
The design of this analysis relies on an apparently complete analysis of the scientific literature on both the subject of nongenetic phenotypic heterogeneity and the particular case go gene expression studies on SCLC. The results reported by the authors support by gene expression analyses the idea that within the same genomic status of SCLC cells, different steady states in gene expression lead to different phenotypic states (multistability) that are found experimentally. Although no clear therapeutic perspective for SCLC emerges from this study, it is an interesting step to better understand the plasticity of SCLC cells, possibly resulting in their drug tolerance and thus escaping treatments so far.
For these reasons, I propose that this study be published as it is in eLife.
Reviewer #2:
This paper will be of interest to researchers working on the origins of cellular heterogeneity in cancer. Using network and gene expression analysis, previous known findings on the potential role of topology in generating four broad cellular states in small cell lung cancer (SCLC) are confirmed here. Interestingly, it is observed that both discrete and continuous gene expression models can give rise to similar steady states. Two modules of genes inhibiting each other are found to lead to the four emergent states in SCLC. However, these results are not well compared and contrasted in the context of prior literature.
Chauhan, et al. investigate the origins of cellular heterogeneity in Small Cell Lung Cancer. Previous experimental work has classified four major states based upon the expression levels of Ascl1 and Neurod1 genes, along with other states. Previous computational work based on Boolean networks (Wooten et al., 2019 and Udyavar et al., 2017) suggested these four states may arise naturally as the steady states of the underlying geneinteraction network in SCLC.
Here the authors reanalyze the SCLC geneinteraction network based on a Boolean framework with an underlying Ising hamiltonian. They show that the four major states based on Ascl1 and Neurod1 can be recovered from dynamic simulations of the 33 node 357 edge network. While not surprising in the light of prior work, the authors successfully demonstrate that a somewhat different Boolean framework leads to similar results, suggesting the underlying topology is indeed important for generating cellular heterogeneity. However, it is not clear whether this formalism based upon the Ising Model improves upon the previously used Boolean framework (Wooten et al. and Udyavar et al) in any way.
The authors then model the dynamics of the gene network using ODEs, allowing continuous values for the expression levels of the nodes. 4 of the top 10 frequent steady states corresponded to the 4 states found earlier. However, these 4 states have much lower frequency (adding up to only about 22%) whereas in the Boolean framework the frequencies added up to almost 100%. The identity and frequencies of the remaining 6 out of 10 frequent states is not discussed.
Using a pairwise correlation strategy for all the nodes, the authors then uncover an interesting result: the nodes fall into two modules that repress each other. This also seems to be true in two publicly available gene expression datasets. Surprisingly, Neurod1 did not appear in these modules and it is unclear why, given the four main states are defined based on Neurod1 and Ascl1 expression levels. The two modules are also distinct from the higher number of modules found earlier (Wooten et al., 2019), and it is unclear why these differences arise.
Finally, using a series of clustering algorithms (hierarchical and Kmeans clustering, UMAP) on gene expression datasets, the authors show that not only the 4 states based on Neurod1 and Ascl1, but also further states based on Pou2f3 and Yap1 expression levels can be recovered. However, the authors use only a few genes of interest in performing the clustering and it is not clear why all the available genes were not used.
In my opinion, the most significant area where this manuscript needs to be strengthened is in providing critical comparisons with prior literature and results (primarily Wooten et al and Udyavar et al). Discussions on what advances have been made in this paper with respect to what was already known earlier, need to be highlighted. I found it quite hard to judge this manuscript and place it in context, since a lot of the methods and datasets used here are very similar to the previous works. Detailed suggestions along these lines as well as some possible new analyses are provided below:
1) It would be good to know why the authors chose an Ising Model – based Boolean simulation strategy as compared to the Boolean model used in Wooten at al. Is there some difference in the statistics expected from these two different formalisms? Is there some limitation of the previous work that the authors wanted to address here? Given that Wooten et al. showed that the 4 SCLC states can be recapitulated, is it surprising that the authors get the same 4 states using their Boolean method on an identical network?
2) For the ODE method, the frequency of the four states add up to only about 22%. It would be interesting to see a full list of the top ten states with their frequencies, and a discussion on why these other states appear in the ODE but not in the Boolean formalism and its biological implications.
3) Following up on point (2) above, was there a reason for using two separate axes for the same quantity (frequency) in Figure 1D i ? I found this quite confusing, because for example, at first sight it seems like the S2 steady state has similar frequencies in RACIPE vs Boolean. But the frequencies are in reality very different, right? I would therefore suggest to plot both RACIPE and Boolean results using just one axis, to avoid confusion.
4) The observation of two "modules" using pairwise correlations is interesting. However, it was unclear to me why Wooten et al. find 1718 modules, though their WGCNA method also uses a pairwise gene correlation technique. A detailed discussion on this would be very helpful for readers in my opinion.
5) Related to the pairwise correlation method, I was surprised to see that Neurod1 does not seem to be part of any module in Figure 2. In the Discussion, the authors mention that Ascl1 and Neurod1 don't fall in the same team, but it seems to me from Figure 2 that Neurod1 doesn't belong to any team! This seems to be contradictory to the rest of the results, unless I have misunderstood something here. A discussion on these lines seems warranted.
6) Given that the dynamical simulations were carried out with 33 genes, why did the authors choose to perform all the clustering analyses with only a handful of genes? This may be problematic, for example, if sets of 2 or 4 randomly chosen genes are used for clustering the expression datasets, how likely are we to find a few well separated clusters? If we find that random gene subsets also separate into clusters, how biologically meaningful is it to see clusters with Ascl1 and Neurod1?
Reviewer #3:
This work by Chauhan et al. finds order from complexity. Using multiple mathematical tools, they transform a complex, highly connected network into a small, easytointerpret reduced form. The concordance between their reduced network and experimental knowledge from the field support the insights they gained from their analysis here, and that their methods may be useful more broadly to make sense of complex networks elsewhere.
The final findings (Figures 45) on experimental data, which show that just a few genes can reproduce the fullspectrum of known SCLC heterogeneity seem to strongly support the authors' primary conclusions. Specifically, a system which is well classifiable using a few nodes closely matches my expectation for a network that can be reduced to a few tightly connected groups of nodes.
The agreement between Boolean and RACIPE greatly strengthen their results.
The J metric is interesting, and while it would be beyond the scope of this work, I would be very interested in seeing it applied to other networks to see how generalizable its interpretation is. However, the definition is tightly coupled to the 33 genes in this network, and their pattern of expression in the steady states. Discussion about how the metric could generalized would strengthen the manuscript.
I found it very interesting that not only Neurod1, but also Elf3 was pulled out as an individual node in the reduced form. In another work (Wooten et al., 2019), the authors identified ELF3 as a master regulator of one of their subtypes (NEv2). The identification of this node through topological features here also lends further credence to the influence matrix.
The authors state that "These results suggest that influence matrix is a better representation of network topology as compared to the interaction matrix.". However, since the influence matrix comes from the interaction matrix, it seems like it necessarily contains less information. The authors make this claim based on the fact that a network reduction based on influence matrix more closely represents the steady state distributions than a similar reduction based on interaction matrix. But it is not clear how much this conclusion is specific to this particular network, or reduction strategy.
The correspondence of the steady states with expression data appears quite promising! However, the fact that Neurod1 is the sole gene that distinguishes S1 from S4, or S2 from S3, makes me suspect other genes must also contribute to the difference? Are there other genes in the literature that the authors think could be included into new versions the network that could give a broader picture of the differences between S1 vs S4, or S2 vs S3? Given the other 31 nodes in the network, do their steady state values more closely match one or another cluster from Figure 4B?
When introducing the FontClos s_{i}(t+1) equation, I recommend to describe what happens if s_{i}=0, rather than just including that info in supplement.
Figure 1B should have a legend indicating dark=off, blank=on (even though it is in the caption)
I do not see what test / method was used to find the +/ % confidence intervals in Figure 1B, nor what size interval they represent (e.g., 95%?)
The reference intext to Figure 1C, i, regarding swapping random edges, seems to actually refer to both i and ii
In the text, the connection between the larger number of steady states of "random" networks to the true network's topology lacks a relevant reference to Figure 1C, iii
The text introducing the J metric should describe what the indices are, rather than requiring the reader to search the figure.
The introduction of influence matrix was very hard to follow, the grammar is confusing, and "lmax" is not clearly described in the main text, even though it is used several times.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Topological signatures in regulatory network enable phenotypic heterogeneity in small cell lung cancer" for further consideration by eLife. Your revised article has been evaluated by Aleksandra Walczak (Senior Editor) and a Reviewing Editor.
The reviewers are happy with the revisions you have made and the paper is almost ready for acceptance, but one substantive point raised by Reviewer 2 remains (see below). Please try to perform clustering analysis for randomly chosen sets of 2 genes (other than ASCL1 and NEUROD1). If these don't result in wellseparate clusters it strengthens your result, but if they do then you can modify your statements accordingly, for instance by expanding on other biological reasons to focus on ASLC1 and NEUROD1 based clustering. Alternatively, if you feel such clustering based on randomly chosen genes is not useful, please provide some reasons why you think so.
Reviewer #2:
The authors have now satisfactorily responded to most of the comments/queries.
Their response to point (6) that I had raised is not entirely satisfactory however, since they do not seem to have addressed the randomization point that I had raised. If random sets of two (or more) genes are chosen for the clustering analysis, how often do we see well separated clusters? It seems to me an important point to analyze and understand, in order to put the ASCL1 and Neurod1 based clustering in perspective. I would strongly urge the authors to include this analysis, unless they feel this is not a sensible question to address, in which case it would be good to hear their arguments against this.
Reviewer #3:
The revised manuscript addresses my original comments, and I support its publication.
https://doi.org/10.7554/eLife.64522.sa1Author response
Reviewer #2:
In my opinion, the most significant area where this manuscript needs to be strengthened is in providing critical comparisons with prior literature and results (primarily Wooten at al and Udyavar et al). Discussions on what advances have been made in this paper with respect to what was already known earlier, need to be highlighted. I found it quite hard to judge this manuscript and place it in context, since a lot of the methods and datasets used here are very similar to the previous works. Detailed suggestions along these lines as well as some possible new analyses are provided below:
1) It would be good to know why the authors chose an Ising Model – based Boolean simulation strategy as compared to the Boolean model used in Wooten at al. Is there some difference in the statistics expected from these two different formalisms? Is there some limitation of the previous work that the authors wanted to address here? Given that Wooten et al. showed that the 4 SCLC states can be recapitulated, is it surprising that the authors get the same 4 states using their Boolean method on an identical network?
We thank the reviewer for this important question. Udyavar et al., 2017 – the precursor manuscript to Wooten et al., 2019 – had also used the Ising modelbased Boolean simulation strategy. Thus, as far as Figure 1B is concerned, these are for the same circuit (Figure 1A) and modeling strategy as in Udyavar et al.
Wooten et al. have used a different network than what was used in Udyavar et al. and also a different modeling strategy (BooleaBayes) which depends on inferring logical relationships between nodes in gene regulatory networks using gene expression data and using a “Bayeslike adjustment approach”. Therefore, the network Wooten et al. arrived at using BooleaBayes (Figure 7A) is different from the one we and Udyavar et al. have used. Hence, obtaining 4 states in this network may not necessarily be surprising as far as SCLC phenotypes are concerned. However, obtaining only four states from such a complex network (33 nodes, 357 edges) was very intriguing to us, which drove further our investigations in network topologybased analysis.
The key message in our manuscript is not that this network has only four states, but the reasons for why only four states are seen in this complex and large network (33 nodes, 357 edges). We have deciphered a “latent” design principle in SCLC network, and have offered a conceptual framework to decode similar design principles hidden in other regulatory networks.
Moreover, in addition to running Ising model based simulations for the network, we have used a parameteragnostic approach – RACIPE – and notice at least a semiquantitative agreement in terms of dominance of the four states identified via Ising model. This agreement strengthens our key point that network topology alone can contain enough information about its dynamics. We have now added a paragraph in the Discussion section highlighting these salient points.
2) For the ODE method, the frequency of the four states add up to only about 22%. It would be interesting to see a full list of the top ten states with their frequencies, and a discussion on why these other states appear in the ODE but not in the Boolean formalism and its biological implications.
We thank the reviewer for this subtle observation and raising this important question. We have now performed a detailed analysis of top 20 states obtained by RACIPE whose frequencies add up to 54%, and observed that these 20 states are very similar to the top 4 states obtained by Boolean (X1X4 in Figure 1B), with a difference noted in only one or two nodes (i.e. node value = 0 in Boolean state and 1 in RACIPE state or vice versa).
It is not surprising to see that the RACIPE output has a much larger number of states than Boolean output, given its continuous nature, an observation we made in our previous manuscript as well (Hari et al., 2020) and is now included in a new paragraph in the Discussion.
Conceptually speaking, given that the sizes of the two “groups” identified are 22 and 10 nodes, a difference in values of one or two nodes are not very likely to give rise to a completely different biological phenotype. Thus, these “closeenough” states can be thought of as “microstates” that overall constitute a biological “macrostate” or phenotype.
3) Following up on point (2) above, was there a reason for using two separate axes for the same quantity (frequency) in Figure 1D i ? I found this quite confusing, because for example, at first sight it seems like the S2 steady state has similar frequencies in RACIPE vs Boolean. But the frequencies are in reality very different, right? I would therefore suggest to plot both RACIPE and Boolean results using just one axis, to avoid confusion.
Thanks to the additional analysis performed (mentioned in response to point (2)); we have now replaced Figure 1D i to address this point.
4) The observation of two "modules" using pairwise correlations is interesting. However, it was unclear to me why Wooten et al. find 1718 modules, though their WGCNA method also uses a pairwise gene correlation technique. A detailed discussion on this would be very helpful for readers in my opinion.
The WGCNA method gives us correlation data from all the genes whose expression values are used as an input to the algorithm. It is a statistical method that works on threshold based correlation, and does not use any mechanistic information embedded in a network topology. Because network topology information is not required, Wooten et al. were able to start with a much larger set of genes and obtain a large number of modules. Not surprisingly, we found that the 33 genes considered here were spread across different modules. This is not surprising or contradictory because any of these 33 genes can still be correlated strongly with any other gene not in the network simulated here, and those genes may belong to different modules. It should be noted that two genes can show a good correlation in their gene expression values without any of them directly or indirectly affecting each other, say transcription factor activates gene B but inhibits gene C, thus, B and C are most likely to be negatively correlated. In brief, our analysis is based on network topology, not transcriptomic data that is input to WCGNA. This point is now included in a new paragraph added in the Discussion section.
5) Related to the pairwise correlation method, I was surprised to see that Neurod1 does not seem to be part of any module in Figure 2. In the Discussion, the authors mention that Ascl1 and Neurod1 don't fall in the same team, but it seems to me from Figure 2 that Neurod1 doesn't belong to any team! This seems to be contradictory to the rest of the results, unless I have misunderstood something here. A discussion on these lines seems warranted.
We apologize for a potential semantic confusion caused. The two statements – “NEUROD1 does not belong in the same team as ASCL1” and “NEUROD1 does not belong to either of the two teams” – are not contradictory to one another. We have performed further analysis based on CCLE and GSE73160 gene expression values and see that NEUROD1 levels do not align with the patterns seen in the expression values of members of two groups (Figure 4—figure supplement 2).
6) Given that the dynamical simulations were carried out with 33 genes, why did the authors choose to perform all the clustering analyses with only a handful of genes? This may be problematic, for example, if sets of 2 or 4 randomly chosen genes are used for clustering the expression datasets, how likely are we to find a few well separated clusters? If we find that random gene subsets also separate into clusters, how biologically meaningful is it to see clusters with Ascl1 and Neurod1?
Our reason to include ASCL1 and NEUROD1 based clustering was purely based on available experimental literature suggesting these two nodes as key markers and/or inducers of phenotypic heterogeneity in SCLC. In the latter half of our analysis (Figure 5), we have used YAP1 and POU2F3 in addition to ASCL1 and NEUROD1 to classify CCLE SCLC cell lines, but we do not have either of them as a part of the network in the first place. Again, their choice was made based on available literature in SCLC heterogeneity as mentioned by Wooten et al.
Reviewer #3:
[…]
The authors state that "These results suggest that influence matrix is a better representation of network topology as compared to the interaction matrix.". However, since the influence matrix comes from the interaction matrix, it seems like it necessarily contains less information. The authors make this claim based on the fact that a network reduction based on influence matrix more closely represents the steady state distributions than a similar reduction based on interaction matrix. But it is not clear how much this conclusion is specific to this particular network, or reduction strategy.
We are grateful to the reviewer for encouraging remarks and we share the excitement about agreement in Boolean and RACIPE results, as well as a broader application of influence matrix. We have already seen preliminary evidence showing the two “teams” regulating phenotypic heterogeneity in other cancerrelated signaling networks (for instance, see Figure 2 in Jia et al., 2020 – https://www.oncotarget.com/article/27651/text/). Because these additional networks are not directly related to SCLC, we are not including those results in this manuscript, but influence matrix based analysis will be the focus of our upcoming manuscript(s).
The correspondence of the steady states with expression data appears quite promising! However, the fact that Neurod1 is the sole gene that distinguishes S1 from S4, or S2 from S3, makes me suspect other genes must also contribute to the difference? Are there other genes in the literature that the authors think could be included into new versions the network that could give a broader picture of the differences between S1 vs S4, or S2 vs S3? Given the other 31 nodes in the network, do their steady state values more closely match one or another cluster from Figure 4B?
We have included additional analysis (Figure 4—figure supplement 2) highlighting the expression of other nodes besides ASCL1 and NEUROD1. As expected, the two groups identified via influence matrix largely show similar trends, i.e. as compared to AN and AN+ subgroups, members of group A (ASCL1 is one of them) have higher expression levels in A+N and A+N+ samples while members of group B have lower expression levels in these two subgroups. This analysis suggests that while other genes could play a role in defining the subtypes, their contribution might not possess the distinguishing power of ASCL1 and NEUROD1.
When introducing the FontClos s_{i}(t+1) equation, I recommend to describe what happens if s_{i}=0, rather than just including that info in supplement.
We have now included the same.
Figure 1B should have a legend indicating dark=off, blank=on (even though it is in the caption)
We have included this legend in Figure 1B.
I do not see what test / method was used to find the +/ % confidence intervals in Figure 1B, nor what size interval they represent (e.g., 95%?)
We have now clarified it in the figure legend. The +/ % represent mean and standard deviation of the frequencies obtained across three independent Isingmodel replicates.
The reference intext to Figure 1C, i, regarding swapping random edges, seems to actually refer to both i and ii
We have now referred to both i and ii.
In the text, the connection between the larger number of steady states of "random" networks to the true network's topology lacks a relevant reference to Figure 1C, iii
We thank the reviewer for pointing this out and have now included a reference to Figure 1C, iii.
The text introducing the J metric should describe what the indices are, rather than requiring the reader to search the figure.
In the main text, we have now described what the indices and the corresponding matrix is.
The introduction of influence matrix was very hard to follow, the grammar is confusing, and "lmax" is not clearly described in the main text, even though it is used several times.
We apologize for the confusion caused, and thank the reviewer for pointing it out. We have now expanded on the introduction of influence matrix both in the Materials and methods section and in the main text.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Reviewer #2:
The authors have now satisfactorily responded to most of the comments/queries.
Their response to point (6) that I had raised is not entirely satisfactory however, since they do not seem to have addressed the randomization point that I had raised. If random sets of two (or more) genes are chosen for the clustering analysis, how often do we see well separated clusters? It seems to me an important point to analyze and understand, in order to put the ASCL1 and Neurod1 based clustering in perspective. I would strongly urge the authors to include this analysis, unless they feel this is not a sensible question to address, in which case it would be good to hear their arguments against this.
We thank the reviewer for this constructive comment, and have performed the randomization analysis as well now. We find that the ASCL1NEUROD1 gene pair is among the top 1% of all possible gene pairs (^{33}C_{2} = 528) in terms of defining four SCLC phenotypes experimentally reported (Figure 4—figure supplement 3). In other words, approximately 99% of all gene pairs considered here do not offer such wellsegregated biologically relevant phenotypic distinction.
https://doi.org/10.7554/eLife.64522.sa2Article and author information
Author details
Funding
Science and Engineering Research Board (SB/S2/RJN049/2018)
 Mohit Kumar Jolly
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by Ramanujan Fellowship awarded to MKJ by Science and Engineering Research Board (SERB), Department of Science and Technology (DST), Government of India (SB/S2/RJN049/2018).
Senior Editor
 Aleksandra M Walczak, École Normale Supérieure, France
Reviewing Editor
 Sandeep Krishna, National Centre for Biological Sciences‐Tata Institute of Fundamental Research, India
Reviewers
 Jean Clairambault
 David Wooten, Pennsylvania State University, United States
Publication history
 Received: November 1, 2020
 Accepted: March 16, 2021
 Accepted Manuscript published: March 17, 2021 (version 1)
 Version of Record published: March 31, 2021 (version 2)
Copyright
© 2021, Chauhan et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,522
 Page views

 335
 Downloads

 17
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
Biological age, distinct from an individual's chronological age, has been studied extensively through predictive aging clocks. However, these clocks have limited accuracy in short timescales. Here we trained deep learning models on fundus images from the EyePACS dataset to predict individuals' chronological age. Our retinal aging clocking, 'eyeAge', predicted chronological age more accurately than other aging clocks (mean absolute error of 2.86 and 3.30 years on qualityfiltered data from EyePACS and UK Biobank, respectively). Additionally, eyeAge was independent of blood markerbased measures of biological age, maintaining an allcause mortality hazard ratio of 1.026 even when adjusted for phenotypic age. The individualspecific nature of eyeAge was reinforced via multiple GWAS hits in the UK Biobank cohort. The top GWAS locus was further validated via knockdown of the fly homolog, Alk, which slowed agerelated decline in vision in flies. This study demonstrates the potential utility of a retinal aging clock for studying aging and agerelated diseases and quantitatively measuring aging on very short timescales, opening avenues for quick and actionable evaluation of geroprotective therapeutics.

 Computational and Systems Biology
Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structurebased measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higherorder statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.