Glycan processing in the Golgi as optimal information coding that constrains cisternal number and enzyme specificity

  1. Alkesh Yadav
  2. Quentin Vagne
  3. Pierre Sens
  4. Garud Iyengar  Is a corresponding author
  5. Madan Rao  Is a corresponding author
  1. Raman Research Institute, India
  2. Laboratoire Physico Chimie Curie, Institut Curie, CNRS UMR168, France
  3. Industrial Engineering and Operations Research, Columbia University, United States
  4. Simons Centre for the Study of Living Machines, National Centre for Biological Sciences, India

Abstract

Many proteins that undergo sequential enzymatic modification in the Golgi cisternae are displayed at the plasma membrane as cell identity markers. The modified proteins, called glycans, represent a molecular code. The fidelity of this glycan code is measured by how accurately the glycan synthesis machinery realizes the desired target glycan distribution for a particular cell type and niche. In this article, we construct a simplified chemical synthesis model to quantitatively analyse the trade-offs between the number of cisternae, and the number and specificity of enzymes, required to synthesize a prescribed target glycan distribution of a certain complexity to within a given fidelity. We find that to synthesize complex distributions, such as those observed in real cells, one needs to have multiple cisternae and precise enzyme partitioning in the Golgi. Additionally, for a fixed number of enzymes and cisternae, there is an optimal level of specificity (promiscuity) of enzymes that achieves the target distribution with high fidelity. The geometry of the fidelity landscape in the multidimensional space of the number and specificity of enzymes, inter-cisternal transfer rates, and number of cisternae provides a measure for robustness and identifies stiff and sloppy directions. Our results show how the complexity of the target glycan distribution and number of glycosylation enzymes places functional constraints on the Golgi cisternal number and enzyme specificity.

Editor's evaluation

This article contributes to an important and largely unexplored topic in cell biology: the understanding of glycosylation. The authors introduce a mathematical model of glycosylation in the Golgi apparatus and use the model to investigate how the complexity (diversity) and fidelity of the plasma membrane glycan distribution depend on parameters such as the number of Golgi cisternae or enzyme specificity. The article is well written and makes the effort to present a rather complex topic in an accessible way by leaving some of the details in the appendices.

https://doi.org/10.7554/eLife.76757.sa0

Introduction

A majority of the proteins synthesized in the endoplasmic reticulum (ER) are transferred to the Golgi cisternae for further chemical modification by glycosylation (Alberts, 2002), a process that sequentially and covalently attaches sugar moieties to proteins, catalyzed by a set of enzymatic reactions within the ER and the Golgi cisternae. These enzymes, called glycosyltransferases, are localized in the ER and cis-medial and trans-Golgi cisternae in a specific manner (Varki, 2009; Cummings and Pierce, 2014). Glycans, the final products of this glycosylation assembly line, are delivered to the plasma membrane (PM) conjugated with proteins, whereupon they engage in multiple cellular functions, including immune recognition, cell identity markers, cell-cell adhesion, and cell signalling (Varki, 2009; Cummings and Pierce, 2014; Varki, 2017; Drickamer and Taylor, 1998; Gagneux and Varki, 1999). This glycan code (Gabius, 2018; Dwek, 1996), representing information (Winterburn and Phelps, 1972) about the cell, is generated dynamically, following the biochemistry of sequential enzymatic reactions and the biophysics of secretory transport (Varki, 2017; Varki, 1998; Pothukuchi et al., 2019).

In this article, we will focus on the role of glycans as markers of cell identity. For the glycans to play this role, they must inevitably represent a molecular code (Gabius, 2018; Varki, 2017; Pothukuchi et al., 2019). While the functional consequences of glycan alterations have been well studied, the glycan code has remained an enigma (Gabius, 2018; Pothukuchi et al., 2019; Bard and Chia, 2016; D’Angelo et al., 2013). We study the fidelity of molecular code generation, that is, the precision and reliability with which the glycan distribution is created. While it has been recognized that fidelity of the glycan code is necessary for reliable cellular recognition (Demetriou et al., 2001), a quantitative measure of fidelity of the mechanism and the constraints that fidelity requirements put on cellular structure and organization are lacking.

There are two aspects of the cell-type-specific glycan code and the code generation mechanism that have an important bearing on quantifying fidelity. The first is that extant glycan distributions have high complexity (section ‘Complexity of glycan code’), owing to evolutionary pressures arising from (a) reliable cell-type identification amongst a large set of different cell types in a complex organism, the preservation and diversification of ‘self-recognition’ (Drickamer and Taylor, 1998), (b) pathogen-mediated selection pressures (Varki, 2009; Varki, 2017; Gagneux and Varki, 1999), and (c) herd immunity within a heterogenous population of cells of a community (Wills and Green, 1995) or within a single organism (Drickamer and Taylor, 1998). We interpret this to mean that the target distribution of glycans for a given cell type is complex; in section ‘Complexity of glycan code’, we define a quantitative measure for complexity and demonstrate its implications. The second is that the cellular machinery for the synthesis of glycans, which involves sequential chemical processing via cisternal resident enzymes and cisternal transport, is subject to variation and noise (Varki, 2017; Varki, 1998; Pothukuchi et al., 2019); the synthesized glycan distribution is, therefore, a function of cellular parameters such as the number and specificity of enzymes, inter-cisternal transfer rates, and number of cisternae. We will discuss an explicit model of the cellular synthesis machinery in section ‘Synthesis of glycans in the Golgi cisternae’.

Here, we define fidelity as the minimum achievable Kullback-Leibler (KL) divergence (Cover and Thomas, 2012; MacKay, 2003) between the synthesized distribution of glycans and the target glycan distribution as a function of given cellular parameters, such as the number and specificity of enzymes, inter-cisternal transfer rates, and number of cisternae (section ‘Optimization problem’). Using a simplified chemical synthesis model, we analyse the trade-offs between the number of cisternae and the number and specificity of enzymes in order to achieve a prescribed target glycan distribution with high fidelity (section ‘Results of optimization’). Our analysis leads to a number of interesting results, a few of which we list here:

  1. First, since an important function of the glycan spectrum is cell type/niche identification, it seems natural to relate glycan complexity to organismal complexity taken to be associated with the number of cell types in the organism (Carroll, 2001; Bonner, 1998). Here, we provide a measure of the complexity of the glycan distribution of a given cell type using mass spectrometry coupled with determination of molecular structure (MSMS) data. Using this we have analysed the MSMS data from hydra, planaria, and mammalian cells. We find that the complexity of the glycan distribution indeed correlates with the organism complexity.

  2. Constructing a high-fidelity representation of a complex target distribution, such as those observed in real cells, requires a complex Golgi machinery with multiple cisternae, precise enzyme partitioning, and control on enzyme specificity. This definition of fidelity of the glycan code allows us to provide a quantitative argument for the evolutionary requirement of multiple compartments. While it is possible to produce complex glycan distributions in one compartment using a large number of enzymes, such a design would inevitably require a more elaborate genetic cost.

  3. Within our synthesis model, an increase in the number of Golgi cisternae drives an increase in the glycan complexity, keeping everything else fixed.

  4. We explore the geometry of the fidelity landscape in the multidimensional space of the number and specificity of enzymes, inter-cisternal transfer rates, and number of cisternae. This allows us to discuss issues such as robustness to noise, and stiff and sloppy directions in this multidimensional space.

  5. For fixed number of enzymes and cisternae, there is an optimal level of specificity of enzymes that achieves the complex target distribution with high fidelity. Keeping the number of enzymes fixed, having low specificity or sloppy enzymes and larger cisternal number could give rise to a diverse repertoire of functional glycans, a strategy used in organisms such as plants and algae. Promiscuous enzymes bring in the potential for evolvability (Kirschner and Gerhart, 2008); promiscuity allows the system to be stable to random mutations in proteins or variations in the target distribution.

Thus, our results imply that the pressure to produce the target glycan code for a given cell type with high fidelity places strong constraints on the cisternal number and enzyme specificity (Sengupta and Linstedt, 2011). Taken together, our quantitative analysis of the trade-offs has deep implications for the non-equilibrium self-assembly of Golgi cisternae and suggests that the control of cisternal number must involve a coupling of non-equilibrium self-assembly of cisternae with enzymatic chemical reaction kinetics (Glick and Malhotra, 1998). This combined dynamics of chemical processing with non-equilibrium membrane dynamics involving fission, fusion, and transport (Sachdeva et al., 2016; Sens and Rao, 2013) opens up a new direction for future research.

Complexity of glycan code

Since each cell type (in a niche) is identified with a distinct glycan profile (Gabius, 2018; Varki, 2017; Pothukuchi et al., 2019), and this glycan profile is noisy because of the stochastic noise associated with the synthesis and transport (Pothukuchi et al., 2019; Bard and Chia, 2016; D’Angelo et al., 2013), a large number of different cell types can be differentiated only if the cells are able to produce a large set of glycan profiles that are distinguishable in the presence of this noise. Our task is to identify a quantitative measure for the complexity of a glycan profile such that a set of more complex glycan profiles is able to support a larger number of well-separated profiles, and therefore, a larger number of cell types, or equivalently, a more complex organism (a rigorous definition of complexity can be given in terms of the KL metric [Cover and Thomas, 2012; MacKay, 2003] between two glycan profiles. We declare that two profiles are distinguishable only if the KL distance between the profiles is more than a given tolerance. This tolerance is an increasing function of the noise. We define the complexity of a set of possible glycan profiles as the size of the largest subset such that the KL distance of any pair of profiles is larger than the tolerance). Furthermore, we would like to be able to estimate the complexity of a glycan profile from molecular structure (MSMS) measurements (Cummings and Crocker, 2020; Subramanian et al., 2018; Sahadevan et al., 2014).

In order to identify such a quantitative measure of complexity, we first need a consistent way of smoothening or coarse-graining the raw glycan profiles obtained from MSMS measurements to remove measurement and synthesis noise. Here, we denoise the glycan profile by approximating it by a Gaussian mixture model (GMM) with a specified number of components that are supported on a finite set of indices (Bacharoglou, 2010). Since the size of the set of all possible m-component Gaussian densities is an increasing function of m, we define the complexity of a mixture of Gaussians as the number of components m. Figure 1 demonstrates that the value of m at which the m-component GMM approximation of the target profile saturates is a good measure of complexity. Using this definition we see that the complexity of the glycan profiles of various organisms correlates well with the number of cell types in an organism (details of the procedure are given in Appendix 1). We will now describe a general model of the cellular machinery that is capable of synthesizing glycans of any complexity. We expect that cells need a more elaborate mechanism to produce profiles from a more complex set.

Living cells display a complex glycan distribution.

(a) 3-Gaussian mixture model (GMM) and 20-GMM approximation for the relative abundance of glycans taken from mass spectrometry coupled with determination of molecular structure (MSMS) data of planaria Schmidtea mediterranea, Hydra magnipapillata, and human neutrophils. (b) The change in the Kullback–Leibler (KL) divergence D(pTpGMM(m)) as a function of the number of GMM components m. The KL divergence for planaria saturates at m=5, for hydra at m=11, and for human cells at m=20. Thus, the number of components required to approximate the glycan profile correlates well with the complexity of the organism. Details are given in Appendix 1.

Synthesis of glycans in the Golgi cisternae

The glycan display at the cell surface is a result of proteins that flux through and undergo sequential chemical modification in the secretory pathway, comprising an array of Golgi cisternae situated between the ER and PM, as depicted in Figure 2. Glycan-binding proteins (GBPs) are delivered from the ER to the first cisterna, whereupon they are processed by the resident enzymes in a sequence of steps that constitute the N-glycosylation process (Varki, 2009). A generic enzymatic reaction in the cisterna involves the catalysis of a group transfer reaction in which the monosaccharide moiety of a simple sugar donor substrate, for example, UDP-Gal, is transferred to the acceptor substrate, by a Michaelis–Menten (MM)-type reaction (Varki, 2009)

(1) Acceptor+glycosyl donor+Enzymeωbωf[Acceptorglycosyl donorEnzyme]ωcglycosylated acceptor+nucleotide+Enzyme
Enzymatic reaction and transport network in the secretory pathway.

Represented here is the array of Golgi cisternae (blue) indexed by j=1,,NC situated between the endoplasmic reticulum (ER) and plasma membrane (PM). Glycan-binding proteins Pc1(1) are injected from the ER to cisterna-1 at rate q. Superimposed on the Golgi cisternae is the transition network of chemical reactions (column) – inter-cisternal transfer (rows), the latter with rates μ(j). Pc1(1) denotes the acceptor substrate in compartment j and the glycosyl donor c0 is chemostated in each cisterna. This results in a distribution (relative abundance) of glycans displayed at the PM (red curve), which is representative of the cell type.

From the first cisterna, the proteins with attached sugars are delivered to the second cisterna at a given inter-cisternal transfer rate, where further chemical processing catalyzed by the enzymes resident in the second cisterna occurs. This chemical processing and inter-cisternal transfer continue until the last cisterna, thereupon the fully processed glycans are displayed at the PM (Varki, 2009). The network of chemical processing and inter-cisternal transfer forms the basis of the physical model that we will describe next.

Any physical model of such a network of enzymatic reactions and cisternal transfer needs to be augmented by reaction and transfer rates and chemical abundances. To obtain the range of allowed values for the reaction rates and chemical abundances, we use the elaborate enzymatic reaction models, such as the KB2005 model (Umaña and Bailey, 1997; Krambeck et al., 2009; Krambeck and Betenbaugh, 2005) (with a network of 22,871 chemical reactions and 7565 oligosaccharide structures) that predict the N-glycan distribution based on the activities and levels of processing enzymes distributed in the Golgi cisternae of mammalian cells. For the allowed rates of cisternal transfer, we rely on the recent study by Ungar and coworkers (Fisher et al., 2019; Fisher and Ungar, 2016), whose study shows how the overall Golgi transit time and cisternal number can be tuned to engineer a homogeneous glycan distribution.

Model

Chemical reaction and transport network in cisternae

We consider an array of NC Golgi cisternae, labelled by j=1,,NC, between the ER and PM (Figure 2). GBPs, denoted as Pc1(1), are delivered from the ER to cisterna-1 at an injection rate q. It is well established that the concentration of the glycosyl donor in the jth cisterna is chemostated (Varki, 2009; Hirschberg et al., 1998; Caffaro and Hirschberg, 2006; Berninsone and Hirschberg, 2000), thus in our model we hold its concentration c0(j) constant in time for the jth cisterna. The acceptor Pc1(1) reacts with c0(1) to form the glycosylated acceptor Pc2(1), following an MM reaction (1) catalyzed by the appropriate enzyme. The acceptor Pc2(1) has the potential of being transformed into Pc3(1), and so on, provided the requisite enzymes are present in that cisterna. This leads to the sequence of enzymatic reactions Pc1(1)Pc2(1)Pck(1), where k enumerates the sequence of glycosylated acceptors using a consistent scheme (such as in Umaña and Bailey, 1997). The glycosylated GBPs are transported from cisterna-1 to cisterna-2 at an inter-cisternal transfer rate μ(1), whereupon similar enzymatic reactions proceed. The processes of intra-cisternal chemical reactions and inter-cisternal transfer continue to the other cisternae and form a network as depicted in Figure 2. Although, in this article, we focus on a sequence of reactions that form a line graph, the methodology we propose extends to tree-like reaction sequences, and more generally to reaction sequences that form a directed acyclic graph.

Let Ns-1 denote the maximum number of possible glycosylation reactions in each cisterna j, catalyzed by enzymes labelled as Eα(j), with α=1,,NE, where NE is the total number of enzyme species in each cisterna. Since many substrates can compete for the substrate-binding site on each enzyme, one expects in general that NsNE. The configuration space of the network in Figure 2 is Ns×NC . For the N-glycosylation pathway in a typical mammalian cell, Ns=2×104, NE = 10–20, and NC = 4–8 (Umaña and Bailey, 1997; Krambeck and Betenbaugh, 2005; Krambeck et al., 2009; Fisher and Ungar, 2016). We account for the fact that the enzymes have specific cisternal localization by setting their concentrations to zero in those cisternae where they are not present.

The action of enzyme Eα(j) on the substrate Pck(j) in cisterna j is given by

(2) Pck(j)+Eα(j)ωb(j,k,α)ωf(j,k,α)c0(j)[Eα(j)Pck(j)c0(j)]ωc(j,k,α)Pck+1(j)+Eα(j)

where k=1,Ns-1. In general, the forward, backward, and catalytic rates ωf, ωb, and ωc, respectively, depend on the cisternal label j, the reaction label k, and the enzyme label α, which parametrize the MM reactions (Price and Stevens, 1999). For instance, structural studies on glycosyltransferase-mediated synthesis of glycans (Moremen and Haltiwanger, 2019) would suggest that the forward rate ωf depends on the binding energy of the enzyme Eα(j) to acceptor substrate Pck(j) and a physical variable that characterizes the cisternae.

A potential candidate for such a cisternal variable is pH (Kellokumpu, 2019), whose value is maintained homeostatically in each cisterna (Casey et al., 2010); changes in pH can affect the shape of an enzyme (substrate) or its charge properties, and in general the reaction efficiency of an enzyme has a pH optimum (Price and Stevens, 1999). Another possible candidate for a cisternal variable is membrane bilayer thickness (Dmitrieff et al., 2013); indeed, both pH (Llopis et al., 1998) and membrane thickness are known to have a gradient across the Golgi cisternae. We take ωf(j,k,α)P(j)(k,α), where P(j)(k,α)(0,1) is the binding probability of enzyme Eα(j) with substrate Pck(j), and define the binding probability P(j)(k,α) using a biophysical model, similar in spirit to the Monod-Wyman-Changeux model of enzyme kinetics (Monod et al., 1965; Changeux and Edelstein, 2005) that depends on enzyme-substrate-induced fit.

Let α(j) and k denote, respectively, the optimal ‘shape’ for enzyme Eα(j) and the substrate Pck(j). We assume that the mismatch (or distortion) energy between the substrate k and enzyme Eα(j) is k-α(j), with a binding probability given by

(3) P(j)(k,α)=exp(-σα(j)k-α(j))

where is a distance metric defined on the space of α(j) (e.g. the square of the 2-norm would be related to an elastic distortion model [Savir and Tlusty, 2007]) and the vector σ[σα(j)] parametrizes enzyme specificity. This distortion model captures the above idea that the reaction between the flexible enzyme and fixed substrate is facilitated by an induced fit. A large value of σα(j) indicates a highly specific enzyme, a small value of σα(j) indicates a promiscuous enzyme. It is recognized that the degree of enzyme specificity or sloppiness is an important determinant of glycan distribution (Varki, 2009; Roseman, 2001; Hossler et al., 2007; Yang et al., 2018).

Our synthesis model is mean field, in that we ignore stochasticity in glycan synthesis that may arise from low copy numbers of substrates and enzymes, multiple substrates competing for the same enzymes, and kinetics of inter-cisternal transfer (Umaña and Bailey, 1997; Krambeck et al., 2009; Krambeck and Betenbaugh, 2005). Then the usual MM steady-state conditions for (2), which assumes that the concentration of the intermediate enzyme-substrate complex does not change with time, imply that

[Eα(j)Pck(j)c0(j)]=ωf(j,k,α)c0(j)ωb(j,k,α)+ωc(j,k,α)Eα(j)ck(j).{}

where ck(j) is the concentration of the acceptor substrate Pck(j) in compartment j.

Together with the constancy of the total enzyme concentration, [Eα(j)]tot=Eα(j)+k=1Ns [Eα(j)Pck(j)c0(j)], this immediately fixes the kinetics of product formation (not including inter-cisternal transport),

(4) dck+1(j)dt=α=1NEV(j,k,α)P(j)(k,α)ck(j)M(j,k,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))k=1,,NS;j=1,,NC

where

M(j,k,α)=ωb(j,k,α)+ωc(j,k,α)ωf(j,k,α)c0(j)P(j)(k,α)

and

V(j,k,α)=ωc(j,k,α)[Eα(j)]tot.

This reparametrization of the reaction rates ωf,ωb,ωc in terms of M,V is convenient since it relates to experimentally measurable parameters Vmax and MM constant KM, for each (j,k,α), which can be easily read out (see Appendix 2). As is the usual case, the maximum velocity Vmax is not an intrinsic property of the enzyme because it is dependent on the enzyme concentration [Eα(j)]tot; while KM(j,k,α)=M(j,k,α)c0(j)/P(j)(k,α) is an intrinsic parameter of the enzyme and the enzyme-substrate interaction. The enzyme catalytic efficiency, the so-called “kcat/KMP(j)(k,α), is high for perfect enzymes (Bar-Even et al., 2015) with minimum mismatch.

We now add to this chemical reaction kinetics the rates of injection (q) and inter-cisternal transport μ(j) from the cisterna j to j+1; in Appendix 3, we display the complete set of equations that describe the changes in the substrate concentrations ck(j) with time. These kinetic equations automatically obey the conservation law for the protein concentration (p). At steady state, these kinetic equations lead to a set of nonlinear recursion equations (15)-(16) that are displayed in Appendix 3, which can be solved numerically to obtain the steady-state glycan concentrations, cck(j), as a function of the independent vectors M[M(j,k,α)], V[V(j,k,α)], and L[P(j)(k,α)], the transport rates μ[μ(j)] and specificity, σ[σα(j)].

Optimization problem

Let c denote the ‘target’ concentration distribution, normalized to the distribution so that k=1Nsck*=1, for a particular cell type, that is, the goal of the sequential synthesis mechanism described in the section ‘Chemical reaction and transport network in cisternae’ is to approximate c. Let c¯ denote the normalized steady-state glycan concentration distribution displayed on the PM. Then Equation 16 implies that c¯k=μ(NC)ck(NC), k=1,,Ns. We measure the fidelity F(cc¯) between the c and c¯ by the ratio of the KL divergence D(cc) (Cover and Thomas, 2012; MacKay, 2003) to the entropy H(c)

(5) F(cc¯):=D(cc¯)H(c)=k=1Nsckln(ckc¯k)=k=1Nsckln(ckck(NC)μ(NC))k=1Nsckln(1/ck)

The reason why we divide the KL divergence by the entropy of the target distribution is to enable comparison of the fidelity of the mechanism across target distributions of different complexity. Note that high fidelity corresponds to low values of F(cc¯), vice versa.

Thus, the problem of designing a sequential synthesis mechanism that approximates c for a given enzyme specificity σ, transport rate μ, number of enzymes NE, and number of cisternae NC is given by

(6) Optimization A:D¯(σ,NE,NC,c):=minμ, M, V, L 0 F(cc¯),

where we emphasize that the optimum fidelity D¯(σ,NE,NC,c) is a function of (σ,NE,NC,c). Note that there is a separation of time scales implicit in Optimization A – the chemical kinetics of the production of glycans and their display on the PM happens over cellular time scales, while the issues of trade-offs and changes of parameters are related to evolutionary timescales.

Optimization A, though well-defined, is a hard problem since the steady-state concentrations (16) are not explicitly known in terms of the parameters (μ,M,V,L). In Appendix 4, we formulate an alternative problem Optimization B in which the steady-state concentrations are defined explicitly in terms of new parameters μ, R, and L, and prove that Optimization A and Optimization B are exactly equivalent. This is a crucial insight that allows us to obtain all the results that follow. In Appendix 5, we describe the variant of the sequential quadratic programming (SQP) (Boyd and Vandenberghe, 2004), which we use to numerically solve the optimization problem.

Results

The dimension of the optimization search space is extremely large O(Ns×NE×NC). To make the optimization search more manageable, we make the following simplifying assumptions:

  1. We ignore the k-dependence of the vectors (M,V), or alternatively of R – see Appendix 4 for details.

  2. The enzyme-substrate-binding probability P(j)(k,α) is still dependent on the substrate k. We assume that the shape function is a scalar (a length), that is, lα(j)=α(j). It further simplifies the algebra to assume that the lengths of the substrates are integer multiples of a basic unit (which we take to be 1), that is, k=k. The norm that appears in (3) is taken to be the absolute value difference |lklα(j)|. Other metrics, such as |lklα(j)|2, corresponding to the elastic distortion model (Savir and Tlusty, 2007), do not pose any computational difficulties, and we see that the results of our optimization remain qualitatively unchanged.

  3. We drop the dependence of the specificity on α and j, and take it to be a scalar σ.

These restrictions significantly reduce the dimension of the optimization search, so much so that in certain limits we can solve the problem analytically (in Appendix 6, we show that Equation 21 can be solved analytically in the limit Ns1 since the glycan index k can be approximated by a continuous variable, and the recursion relations for the steady-state glycan concentrations Equations 15–16 can be cast as a matrix differential equation. This allows us to obtain an explicit expression for the steady-state concentration in terms of the parameters (R,L)). This helps us obtain some useful heuristics (Appendix 6) on how to tune the parameters, for example, NE, NC, σ, and others, in order to generate glycan distributions c of a given complexity. These heuristics inform our more detailed optimization using ‘realistic’ target distributions.

The calculations in Appendix 6 imply, as one might expect, that the synthesis model needs to be more elaborate, that is, needs a larger number of cisternae NC or a larger number of enzymes NE, in order to produce a more complex glycan distribution. For a real cell type in a niche, the specific elaboration of the synthesis machinery would depend on a variety of control costs associated with increasing NE and NC. While an increase in the number of enzymes would involve genetic and transcriptional costs, the costs involved in increasing the number of cisternae could be rather subtle.

Notwithstanding the relative control costs of increasing NE and NC, it is clear from the special case that increasing the number of cisternae achieves the goal of obtaining an accurate representation of the target distribution. Suppose the target distribution ck*=δ(k-M) for a fixed M1, that is, ck=1 when k=M, and 0 otherwise, and that the NE enzymes that catalyze the reactions are highly specific. In this limit, Optimization A reduces to a simple enumeration exercise (Jaiman and Thattai, 2018): clearly, one needs NE=M enzymes, one for each k=1,,M reactions, in order to generate PcM. For a single Golgi cisterna with a finite cisternal residence time (finite μ), the chemical synthesis network will generate a significant steady-state concentration of lower index glycans Pck with k<M, contributing to a low fidelity. To obtain high fidelity, one needs multiple Golgi cisternae with a specific enzyme partitioning (E1,E2,,EM) with Ej enzymes in cisterna j=1,,Nc. This argument can be generalized to the case where the target distribution is a finite sum of delta functions. The more general case, where the enzymes are allowed to have variable specificity, needs a more detailed study, to which we turn to next.

Target distribution from coarse-grained MSMS

As discussed in the section ‘Complexity of glycan code’, we obtain the target glycan distribution from glycan profiles for real cells using MSMS measurements (Cummings and Crocker, 2020). The raw MSMS data, however, is not suitable as a target distribution. This is because it is very noisy, with chemical noise in the sample and Poisson noise associated with detecting discrete events being the most relevant (Du et al., 2008). This means that many of the small peaks in the raw data are not part of the signal, and one has to ‘smoothen’ the distribution to remove the impact of noise.

We use MSMS data from human T-cells (Cummings and Crocker, 2020) for our analysis. As discussed in the section ‘Complexity of glycan code’, the GMMs are often used to approximate distributions with a mixed number of modes or peaks (MacKay, 2003), or in our setting, a given fixed complexity. Here, we use a variation of the GMMs (see Appendix 1 for details) to create a hierarchy of increasingly complex distributions to approximate the MSMS raw data. Thus, the 3-GMM and 20-GMM approximations represent the low- and high-complexity benchmarks, respectively. In Appendix 1, we show that the likelihood for the glycan distribution of the human T-cell saturates at 20 peaks. Thus, statistically the human T-cell glycan distribution is accurately approximated by 20 peaks.

This hierarchy allows us to study the trade-off between the complexity of the target distribution and the complexity of the synthesis model needed to generate the distribution as follows. Let T(i) denote the i-component GMM approximation for the human T-cell MSMS data. We sample this target distribution at indices k=1,,Ns, that represent the glycan indices, and then renormalize to obtain the discrete distribution {Tk(i),k=1,,Ns}. To highlight the role of target distribution complexity, we focus on the 3-GMM T(3) (low complexity) and 20-GMM approximation T(20) (high complexity) in describing our results.

Trade-offs between number of enzymes, number of cisternae, and enzyme specificity to achieve given complexity

We summarize the main results that follow from an optimization of the parameters of the glycan synthesis machinery to a given target distribution in Figure 3 and Figure 4.

Trade-offs amongst the glycan synthesis parameters, enzyme specificity σ, cisternal number NC, and enzyme number NE to achieve a complex target distribution c.

(a, b) Normalised Kullback–Leibler distance D¯(σ,NE,NC,c) as a function of σ and NC (for fixed NE=3), (c, d) D¯(σ,NE,NC,c) as a function of σ and NE (for fixed NC=3), with the target distribution c set to the 3-Gaussian mixture model (GMM) (less complex) and 20-GMM (more complex) approximations for the human T-cell mass spectrometry coupled with determination of molecular structure (MSMS) data. D¯(σ,NE,NC,c) is a convex function of σ for each (NE,NC,c), decreasing in NC,NE for each (σ,c), increasing in the complexity of c for fixed (σ,NE,NC). The specificity σmin(c,NE,NC)=argminσ{D¯(σ,NE,NC,c)} that minimizes the error for given (NE,NC,c) is an increasing function of NC,NE and the complexity of the target distribution c.

Fidelity of glycan distribution and optimal enzyme properties to achieve a complex target distribution.

The target c is taken from 3-Gaussian mixture model (GMM) (less complex) and 20-GMM (more complex) approximations of the human T-cell mass spectrometry coupled with determination of molecular structure (MSMS) data. (a, b) Optimum fidelity minσ{D¯(σ,NC,NE,c)} as a function of (NE,NC). More complex distributions require either a larger NE or NC. The marginal impact of increasing NE and NC on the fidelity D¯ is approximately equal. (c, d) Enzyme specificity σmin that achieves minσ{D¯(σ,NC,NE,c)} as a function of (NE,NC). σmin increases with increasing NE or NC. To synthesize the more complex 20-GMM approximation with high fidelity requires enzymes with higher specificity σmin compared to those needed to synthesize the broader, less complex 3-GMM approximation.

  1. The optimal fidelity D¯(σ,NE,NC,c) is a convex function of σ for fixed values for other parameters (see Figure 3), that is, it first decreases with σ and then increases beyond a critical value of σmin.

    The lower complexity distributions can be synthesized with high fidelity with small (NE,NC), whereas higher complexity distributions require significantly larger (NE,NC) (see Figure 4a and b). For a typical mammalian cell, the number of enzymes in the N-glycosylation pathway is in the range NE=10-20 (Umaña and Bailey, 1997; Krambeck and Betenbaugh, 2005; Krambeck et al., 2009; Fisher and Ungar, 2016), Figure 4b would then suggest that the optimal cisternal number would range from NC=3-8 (Sengupta and Linstedt, 2011).

    The fidelity D¯(σ,NE,NC,c) is decreasing in NC and NE for fixed values of the other parameters, and increasing in the complexity of c for fixed (σ,NC). The marginal contribution of NC and NE in improving fidelity D¯ is approximately equal (see Figure 4a and b). We discuss the origin of this symmetry later in this section.

  2. The optimal enzyme specificity σmin(c,NC)=argminσ{D¯(σ,N¯E,NC,c)}, which minimizes the error as function of (NC,c) with NE fixed at N¯E, is an increasing function of NC and the complexity of the target distribution c (Figure 3a and b and Figure 4c and d). This is consistent with the results in Appendix 6 where we established that the width of the synthesized distribution is inversely dependent on the specificity σ: since a GMM approximation with fewer peaks has wider peaks, σmin is low, and vice versa. Similar results hold when NC is fixed at N¯C, and NE is varied (see Figure 3c and d and Figure 4c and d).

Our results are consistent with those in Fisher et al., 2019. They optimize incoming glycan ratio, transport rate, and effective reaction rates in order to synthesize a narrow target distribution centred around the desired glycan. The ability to produce specific glycans without much heterogeneity is an important goal in the pharmaceutical industry. They define heterogeneity as the total number of glycans synthesized and show that increasing the number of compartments NC decreases heterogeneity and increases the concentration of the specific glycan. They also show that the effect of compartments in reducing heterogeneity cannot be compensated by changing the transport rate. Our results are entirely consistent with theirs – we have shown that D¯ decreases as we increase NC. Thus, if the target distribution has a single sharp peak, increasing NC will reduce the heterogeneity in the distribution.

We insert an important cautionary note here. It would seem that the results in Figure 4 imply that there is an approximate NE-NC symmetry in the model, that is, increasing either NE or NC affects the fidelity, optimal enzyme specificity, and the sensitivity in approximately the same way. This would be an erroneous inference, and is a consequence of the distortion model we have used for calculating the binding probabilities of substrates with enzymes. The root cause for this apparent symmetry is that we have allowed for all enzymes to catalyze reactions in all cisternae (albeit with different efficiencies). This symmetry is violated by simply restricting the activity of the enzymes to be dependent on the cisternae. A simple realization of this in terms of the distortion model is given in Appendix 7.

Optimal partitioning of enzymes in cisternae

Having studied the optimum NE,NC,σ to attain a given target distribution with high fidelity, we ask what is the optimal partitioning of the NE enzymes in these NC cisternae? Answering this within the context of our chemical reaction model (section ‘Chemical reaction and transport network in cisternae’) requires some care since it incorporates the following enzymatic features: (a) enzymes with a finite specificity σ can catalyze several reactions, although with an efficiency that varies with both the substrate index k and cisternal index j, and (b) every enzyme appears in each cisternae; however, their reaction efficiencies depend on the enzyme levels, the enzymatic reaction rates, and the enzyme matching function L, all of which depend on the cisternal index j.

Therefore, instead of focusing on the cisternal partitioning of enzymes, we identify the chemical reactions that occur with high propensity in each cisternae. For this we define an effective reaction rate Reff(j,k) for PckPck+1 in the jth cisterna as

(7) Reff(j,k)=α=1NERα(j)P(j)(k,α).

According to our model presented in the section ‘Chemical reaction and transport network in cisternae’, the list of reactions with high effective reaction rates in each cisterna corresponds to a cisternal partitioning of the perfect enzymes. In a future study, we will consider a Boolean version of a more complex chemical model to address more clearly the optimal enzyme partitioning amongst cisternae.

Figure 5ai shows the heat map of the effective reaction rates in each cisterna for the optimal that minimizes the normalized KL distance to the 20-GMM target distribution T(20) (see Figure 5aii). The optimized glycan profile displayed in Figure 5aiii is very close to the target. An interesting observation from Figure 5ai is that the same reaction can occur in multiple cisternae.

Optimal enzyme partitioning in cisternae.

(a) Heat map of the effective reaction rates in each cisterna (representing the optimal enzyme partitioning) and the steady-state concentration in the last compartment (c(NC)) for the 20-Gaussian mixture model (GMM) target distribution. Here, NE=5, NC=7, normalized D(T(20)c(NC))/H(T(20))=0.11. (b) Effective reaction rates after swapping the optimal enzymes of the fourth and second cisternae. The displayed glycan profile is considerably altered from the original profile.

Keeping everything else fixed at the optimal value, we ask whether simply repartitioning the optimal enzymes amongst the cisternae alters the displayed glycan distribution. In Figure 5bi, we have exchanged the enzymes of the fourth and second cisterna. The glycan profile after enzyme partitioning (see Figure 5biii) is now completely altered (compare Figure 5bii with Figure 5biii). Thus, one can generate different glycan profiles by repartitioning enzymes amongst the same number of cisternae (Jaiman and Thattai, 2018).

Geometry of the fidelity landscape

Here we show that the optimum solution is not unique, rather it is highly degenerate, with several equally good optimum solutions. Thus the multidimensional fidelity landscape in R, μ, L, and σ is typically rugged. We analyse the geometry of this fitness landscape by doing a local Hessian analysis about the optimal solutions.

Degeneracies in the synthesis model

The synthesis model is highly degenerate, in the sense that many combinations of parameters give rise to the same glycan profile. This makes the optimization non-convex as there are many equally good minima. These degeneracies are both discrete and continuous. The continuous degeneracies correspond to regions in reaction rate (R)-transport rate (μ) space moving along which does not change the concentration profile. The discrete degeneracies are disconnected regions in the parameter space which correspond to the same glycan profile. The number of discrete degeneracies increases exponentially with increase in (NE,NC). We also find that the fraction of initial conditions converging to a solution close to the global minima increases on increasing (NE,NC). Technical details of these issues are discussed in Appendix 8.

Stiff and sloppy directions

We analyse the change in fidelity on small perturbations in R, μ, L, and σ around the optimal solution. This allows us to determine where the cell needs to develop a tighter control mechanism (stiff directions) and where it has more leeway around the optimal values (sloppy directions). We do this by analysing the eigenvalues and eigenvectors of the Hessian around the optimal point (details in Appendix 9). We find that small perturbations around the optimal values in σ change the glycan profile a lot more compared to perturbations in the other parameters and this stiffness in σ generally decreases on increasing NE,NC (Figure 6a–c). Small perturbations in μ and some L directions around the optimum also significantly alter the glycan profile and the stiffness increases on increasing NC,NE, eventually becoming comparable to σ. The glycan profile is robust to perturbations in most R and some L directions (Figure 6b). The total average stiffness of the optimization parameters, defined by the mean of all eigenvalues of the hessian, decreases on increasing NE,NC (Figure 6c).

Stiff and sloppy directions in the optimization parameters.

(a) Eigenvectors of the Hessian matrix 2XiXjF|Xmin for (NE,NC)=(4,4). The x-axis indexes the NC+2NENC+1=37 eigenvectors, the y-axis indexes the NC+2NENC+1 components of the eigenvectors, and the greyscale denotes the absolute value of the component in the range [0,1]. The components are grouped according to (μ,R,L,σ), and the eigenvectors are ordered according to the most dominant component in the eigenvector (μ, orange; R, blue; L, green; σ, purple). There is some mixing of the different components (R and μ or σ and μ) but this is usually small. (b) The distribution of eigenvalues λi of the Hessian matrix 2XiXjF|Xmin. Each stripe represents an eigenvalue, and the location of the stripe on the x-axis represents whether the dominant component of the associated eigenvector belongs to μ, R, L, or σ direction. (c) The average stiffness along μ, R, L, or σ directions, defined by the log of the average of eigenvalues corresponding to the eigenvectors in the respective group, as a function of NC for fixed NE=4. (d) Total average stiffness λ=log(λiNC+2NENC+1) as a function of NE,NC.

Implications for robustness to parametric noise

Since the synthesized glycan distribution displayed by the cell marks its identity, it must be robust to noise intrinsic to the synthesis machinery. The degeneracy of solutions and sloppy directions in the fidelity landscape makes the glycan distribution robust to intrinsic noise in the synthesis and cell-to-cell variations in the kinetic parameters. We find that the number of degeneracies increases on increasing (NE,NC), and the average stiffness of the optimized parameters decreases on increasing (NE,NC), making the synthesis more robust to parameter fluctuations. Further, while the parameter space is high dimensional, the dimension of controllable parameters (measured by the stiff directions) is low dimensional. We find this dimensional reduction a compelling idea which we will take up later.

Strategies to achieve high glycan diversity

So far we have studied how the complexity of the target glycan distribution places constraints on the evolution of Golgi cisternal number and enzyme specificity. We now take up another issue, namely, how the physical properties of the Golgi cisternae, namely, cisternal number and inter-cisternal transport rate, may drive the diversity of glycans (Varki, 2011; Dennis et al., 2009). There is substantial correlative evidence to support the idea that cell types that carry out extensive glycan processing employ larger numbers of Golgi cisternae. For example, the salivary Brunner’s gland cells secrete mucous that contains heavily O-glycosylated mucin as its major component (Van Halbeek et al., 1983). The Golgi complex in these specialized cells contain 9–11 cisternae per stack. Additionally, several organisms such as plants and algae secrete a rather diverse repertoire of large, complex glycosylated proteins, for a variety of functions (McFarlane et al., 2014; Koch et al., 2015; O’Neill et al., 2004; Hayashi and Kaida, 2011; Kumar et al., 2011; Gow and Hube, 2012; Atmodjo et al., 2013; Free, 2013; Pauly et al., 2013; Burton and Fincher, 2014). These organisms possess enlarged Golgi complexes with multiple cisternae per stack (Becker and Melkonian, 1996; Mironov et al., 2017; Donohoe et al., 2007; Mogelsvang et al., 2003; Ladinsky et al., 2002).

We define diversity as the total number of glycan species produced above a specified threshold abundance cth. This last condition is necessary because very small peaks will not be distinguishable in the presence of noise. In computing the diversity from our chemical synthesis model, we have chosen the threshold to be cth=1/Ns, where Ns is the total number of glycan species. We have checked that the qualitative results do not depend on this choice (see Appendix 10—figure 1).

We use the sigmoid function (1+e-x/τ)-1 as a differentiable approximation to the Heaviside function Θ(x) and define the following optimization to maximize diversity for a given set of parameter values, NE,NC,σ:

Diversity(σ,NC,NE):=maxμ,R,Li=1Ns(1+eNs(cicth))1s.t.RminRα(j)Rmax,μminμ(j)μmax,

where, as before, (μmax,μmin)=(1,0.01)/min, and (Rmax,Rmin)=(20,0.018)/min, and cth=1/Ns is the threshold. See Appendix 2 for details on the parameter estimation.

The results displayed in Figure 7a show that for a fixed specificity σ the diversity at first increases with the number of cisternae NC, and then saturates at a value that depends on σ. For very-high-specificity enzymes, one can achieve very high diversity by appropriately increasing NC. This establishes the link between glycan diversity and cisternal number. However, this link is correlational at best since there are many ways to achieve high glycan diversity – notably by increasing the number of enzymes.

Strategies for achieving high glycan diversity.

Diversity versus NC and transport rate μ at various values of specificity σ for fixed NE=3. (a) Diversity vs. NC at optimal transport rate μ. Diversity initially increases with NC, but eventually levels off. The levelling off starts at a higher NC when σ is increased. These curves are bounded by the σ=0 curve. (b) Diversity vs. cisternal residence time (μ-1) in units of the reaction time (Rmin-1) at various value of σ, for fixed NC=4 and NE=10.

On the other hand, one of the goals of glycoengineering is to produce a particular glycan profile with low heterogeneity (Fisher et al., 2019; Jaiman and Thattai, 2018). For low-specificity enzymes, the diversity remains unchanged upon increasing the cisternal residence time. For enzymes with high specificity, the diversity typically shows a non-monotonic variation with the cisternal residence time. At small cisternal residence time, the diversity decreases from the peak because of the early exit of incomplete oligomers. At large cisternal residence time, the diversity again decreases as more reactions are taken to completion. Note that the peak is generally very flat, which is consistent with the results in Fisher et al., 2019. To get a sharper peak, as advocated for instance by Jaiman and Thattai, 2018, one might need to increase the number of high-specificity enzymes NE further.

Discussion

The precision of the stereochemistry and enzymatic kinetics of these N-glycosylation reactions (Varki, 2009) has inspired a number of mathematical models (Umaña and Bailey, 1997; Krambeck et al., 2009; Krambeck and Betenbaugh, 2005) that predict the N-glycan distribution based on the activities and levels of processing enzymes distributed in the Golgi cisternae of mammalian cells and compare these predictions with N-glycan mass spectrum data. Models such as the KB2005 model (Umaña and Bailey, 1997; Krambeck and Betenbaugh, 2005; Krambeck et al., 2009) are extremely elaborate (with a network of 22,871 chemical reactions and 7565 oligosaccharide structures) and require many chemical input parameters. These models have an important practical role to play, that of being able to predict the impact of the various chemical parameters on the glycan distribution, and to evaluate appropriate metabolic strategies to recover the original glycoprofile. Additionally, a recent study by Ungar and coworkers (Fisher et al., 2019; Fisher and Ungar, 2016) shows how physical parameters, such as overall Golgi transit time and cisternal number, can be tuned to engineer a homogeneous glycan distribution. Overall, such models can help predict glycosylation patterns and direct glycoengineering projects to optimize glycoform distributions.

Our focus is different. We are interested in the role of glycans as a marker or molecular code of cell identity (Gabius, 2018; Varki, 2017; Pothukuchi et al., 2019), and in particular, understanding enzymatic and transport processes located in the secretory apparatus of the cell that ensure that this code is generated with high fidelity. To do this, we have had to develop a new formal apparatus that allows us to address these questions and discuss trade-offs between competing drives. Since our analysis draws on many diverse fields, we provide a short summary of the assumptions, methods, and results of the article before discussing the implications of our work.

  1. The glycan profile on the cell surface is a marker of cell-type identity (Varki, 2009; Gabius, 2018; Varki, 2017; Pothukuchi et al., 2019). We define the complexity of a glycan profile to be the minimum number of GMM components required to approximate the profile to within the noise floor. We show that with this definition of complexity more complex organisms correlate with higher complexity glycan profiles. We use this to analyse the complexity of the glycan profiles of planaria, hydra, and mammalian cells (Drickamer and Taylor, 1998).

  2. The glycans at the cell surface are the end product of a sequential chemical processing via a set of enzymes resident in the Golgi cisternae and transport across cisternae (Varki, 2017; Varki, 1998; Pothukuchi et al., 2019). We have proposed a general model for chemical synthesis and transport that, in principle, allows us to compute the synthesized glycan distribution at the cell surface as a function of the enzymes NE, reaction rates R, enzyme configurations L, specificity of enzymes σ, number of cisternae NC, and transport rates μ. However the large dimension of the search space makes this optimization intractable. We thus use a simplified synthesis model with fewer parameters; while our quantitative results are based on this simplified model, we believe that at a qualitative level our results have more general validity.

  3. We define the fidelity of a synthesis mechanism as the minimum normalized KL divergence (Cover and Thomas, 2012; MacKay, 2003) between synthesized glycan distribution on the cell surface and a ‘target’ profile.

  4. The results of the optimization over rates R and enzyme configurations L for a given value of (NC,NE,σ) and a target distribution c of given complexity are given in Figure 3 and Figure 4. Here, we highlight some qualitative consequences of the model:

    1. Keeping the number of enzymes fixed, a more elaborate transport mechanism (via control of NC and μ) is essential for synthesizing high-complexity target distributions to within a high fidelity, or equivalently, low error (Figure 4a and b). Fewer cisternae cannot be compensated for by optimizing the enzymatic synthesis via control of parameters R, L, and σ. An empirical verification of this would involve a coordinated analysis of the glycan profiles, ultrastructure of Golgi, and the number of glycosylation enzymes across many species.

    2. Thus, our study suggests that the requirement that a glycan code of a given complexity be synthesized with sufficiently high fidelity imposes functional control on the Golgi cisternal number. It also provides an argument for the evolutionary requirement of multiple compartments by demonstrating that the fidelity and robustness of the glycan code arising from a chemical synthesis that involves multiple cisternae are higher than the one that involves a single cisterna (keeping everything else fixed) (see Figure 4a and b and Figure 6) This feature, that with multiple cisternae and precise enzyme partitioning one may generically achieve a highly accurate representation of the target distribution, has been highlighted in an algorithmic model of glycan synthesis Jaiman and Thattai, 2018.

      Combining (a) and (b), our study quantitatively shows that constructing a high-fidelity representation of a complex target distribution, such as those observed in real cells, requires a complex Golgi machinery with multiple cisternae, precise enzyme partitioning, and control on enzyme specificity. This definition of fidelity of the glycan code allows us to provide a quantitative argument for the evolutionary requirement of multiple compartments. While it is possible to produce complex glycan distributions in one compartment using a large number of enzymes, such a design would inevitably require a more elaborate genetic cost.

    3. Organisms such as plants and algae have a diverse repertoire of glycans that are utilized in a variety of functions (McFarlane et al., 2014; Koch et al., 2015; O’Neill et al., 2004; Hayashi and Kaida, 2011; Kumar et al., 2011; Gow and Hube, 2012; Atmodjo et al., 2013; Free, 2013; Pauly et al., 2013; Burton and Fincher, 2014). Our study shows that it is optimal to use low-specificity enzymes to synthesize target distributions with high diversity (Figure 7). However, this compromises on the complexity of the glycan distribution, revealing a tension between complexity and diversity. One way of relieving this tension is to have larger NE and NC.

    4. Our study shows that for a fixed NC and NE, there is an optimal enzyme specificity that achieves the lowest distance from a given target distribution. As we see in Figure 4d, this optimal enzyme specificity can be very high for highly complex target distributions. Such high specificity can lower fitness when the environment, and hence the target glycan distribution, fluctuates rapidly, and the synthesis parameters cannot change rapidly enough to track the environment (Nam et al., 2012; Peracchi, 2018). This compromise, between robustness to a changing environment and high fidelity in synthesizing high-complexity glycan profiles, is achievable by sloppy enzymes coupled with error-correcting mechanisms (Nam et al., 2012; Peracchi, 2018). However, sloppy enzymes create ‘wrong’ glycans, and therefore, ex-post error-correcting mechanisms must be in place to correct synthesis errors to ensure high fidelity of the glycan code. A task for the future is to understand the role of intracellular transport in providing non-equilibrium proofreading mechanisms to reduce such coding errors, and its optimal adaptive strategies and plasticity in a time-varying environment.

      Combining (c) and (d), we find that keeping the number of enzymes fixed, having low specificity or sloppy enzymes, and larger cisternal number could give rise to a diverse repertoire of functional glycans. Sloppy or promiscuous enzymes bring in the potential for evolvability (Kirschner and Gerhart, 2008), and sloppiness allows the system to be stable to random mutations in proteins or variations in the target distribution.

    5. The model solution is degenerate, in the sense that there are many equally good global minimas. These degeneracies are both continuous and discrete. The continuous degeneracies correspond to regions in the reaction rate – transport rate space, moving along which will not change the concentration profile, thus ensuring robustness to internal noise. This suggests that the distribution is robust to slight cell-to-cell variations in these kinetic parameters.

    6. Our model implies that close to a local minima the inter-cisternal transport rate μ and the specificity of the enzymes σ are stiff directions, that is, the cell should exercise tighter control on μ and σ as compared to the other parameters. The reaction rates close to the local minima are sloppy directions, and moving along these directions does not change the glycan profile much.

  5. Taken together, our quantitative analysis of the trade-offs has deep implications for non-equilibrium self-assembly of the Golgi cisternae, and suggests that the non-equilibrium control of cisternal number must involve a coupling of non-equilibrium self-assembly of cisternae with enzymatic chemical reaction kinetics (Glick and Malhotra, 1998).

Admittedly the chemical network that we have considered here is much simpler than the chemical network associated with the possible protein modifications in the secretory pathway. For instance, typical N-glycosylation pathways would involve the glycosylation of a variety of GBPs. Further, apart from N-glycosylation, there are other glycoprotein, proteoglycan, and glycolipid synthesis pathways (Alberts, 2002; Varki, 2009; Pothukuchi et al., 2019). Our task has been to get at a qualitative understanding using quantitative methods and thereby to arrive at general principles. We believe our analysis is generalizable and that the qualitative results we have arrived at would still hold. To conclude, our work establishes the link between the cisternal machinery (chemical and transport) and high-fidelity synthesis of a complex glycan code. We find that the pressure to achieve the target glycan code for a given cell type places strong constraints on the cisternal number and enzyme specificity (Sengupta and Linstedt, 2011). An important implication is that a description of the non-equilibrium self-assembly of a fixed number of Golgi cisternae must combine the dynamics of chemical processing and membrane dynamics involving fission, fusion, and transport (Sengupta and Linstedt, 2011; Sachdeva et al., 2016; Sens and Rao, 2013). We believe that this is a promising direction for future research.

Appendix 1

Constructing target distributions for glycans of a given cell type

The distribution of the glycans on the cell surface is obtained via mass spectrometry. The x-axis of mass spectroscopy (MS) graphs is mass/charge of the ionized sample molecules, and the y-axis is relative intensity corresponding to each mass/charge value, taking the highest intensity as 100%. This relative intensity roughly correlates with the relative abundances of the molecules in the sample.

The raw MS data is noisy and cannot be directly used as the target distribution in our optimization problem. There are three major sources of noise in the MS data (Du et al., 2008): chemical noise in the sample, the Poisson noise associated with detecting discrete events, and the Nyquist–Johnson noise associated with any charge system. We propose a simple model that accounts for the chemical noise and the Poisson sampling noise. Using this noise model and the available MS data, we generate parametric bootstrap samples of glycan measurements and fit a GMM on this sample to approximate the glycan distribution. This GMM probability distribution is used as the target distribution in our numerical experiments.

The MS data obtained from Cummings and Crocker, 2020; Subramanian et al., 2018; Sahadevan et al., 2014 had mass ranging between 500 and 5000 Da with intensity reported at every 0.0153 Da. We first bin this MS data into 180 bins and take the maximum value within each bin as the value of intensity for that bin. Appendix 1—figure 1 plots the raw MS data and the binned distribution. Next, we describe the parametric bootstrap model that we used to generate the glycan data. Let I¯k represent the relative intensity of the kth bin in the binned MS graph. We generate a sample population of glycans using the MS data in the following way:

  1. Poisson sampling noise: The MS data does not have absolute count information. We assume an arbitrary maximum count Imax and define the intensity Ik=ImaxI¯k. The plots in Appendix 1—figure 2a show that the results are not sensitive to the specific value of Imax.

  2. Chemical noise: The sample used for MS analysis also contains small amounts of molecules that are not glycans. These appear as very small peaks in the MS data. We assume that the probability pk that the peak at index k corresponds to a glycan is given by

    pk=1-e-IkImax=1-e-I¯k

    which adequately suppresses this chemical noise.

  3. Bootstrapped glycan data: The count nk at the glycan index k is distributed according to the following distribution:

    nk={0(1-pk)n=0npke-Ik(Ik)nn!n1.

    We assume that the MS data was generated from N different cells. Thus, the total count at glycan index k is given by the sum of N i.i.d. samples distributed according to the distribution above. In Appendix 1—figure 2b, we show that results are insensitive to N. We normalize the count distribution by the total number of counts across all the bins to obtain the bootstrapped probability mass function PT.

Appendix 1—figure 1
The binned mass spectroscopy (MS) data (blue) approximates the raw MS data (red) very well.

We use this binned data for Gaussian mixture model (GMM) approximation of the MS data.

Appendix 1—figure 2
Log likelihood vs. number of components (N) in the Gaussian mixture model (GMM).

We see that the log likelihood saturates at around,N=20, thus 20-GMM is a very good representation of the mass spectroscopy (MS) data from human T-cells. The different symbols are for (a) different values of the maximum intensity Imax=50,100,200 and (b) different values of the number of i.i.d. samples,Ni=500,1000,2000 showing the insensitivity of the log likelihood to the value of Imax and.Ni

The bootstrapped distribution pT is noisy, and hence cannot be used directly as the target distribution. We use a GMM-based approach to denoise the raw data. The advantage of using a GMM-based approach is that it creates an easily interpretable hierarchy of increasingly more detailed distributions to approximate the mass spectrometry profile. We define the complexity of a mass spectrometry profile as the minimum number of components (individual Gaussians) in the GMM model required to approximate it. The details of the GMM calculations are as follows. We fix the number of components m. We want to approximate the bootstrapped probability pT by the m-component mixture of Gaussian distributions pGMM(θ)=i=1mwiNηi,Δi, where Nηi,Δi denotes the Gaussian distribution with mean ηi and variance Δi, wi0 and i=1mwi=1, the parameter vector θ=(w,η,Δ) . We compute the optimal m-component GMM approximation by minimizing the KL divergence D(pT||pGMM(θ)) as a function of parameter vector θ. Since

D(pT||pGMM(θ)):=k=1NspT(k)log(pT(k)imwiNηi,Δi(k))=k=1NspT(k)logpT(k)k=1NspT(k)log(imwiNηi,Δi(k)),

the optimization problem minθD(pT||pGMM(θ)) is equivalent to

maxθg(θ):=k=1NSpT(k)log(i=1mwiNηi,Δi(k))

This is a non-convex optimization. We use an expectation-maximization (EM)-based iterative heuristic to compute a local maximum. Let θ(t) denote the current value of the parameters. For each component i=1,,m, and index k=1,,Ns, define

zi(t)(k)=wi(t)Nηi(t),Δi(t)(k)j=1mwj(t)Nηj(t),Δj(t)(k).

Then zi(t)(k)0, and i=1mzi(t)(k)=1. We interpret zi(t)(k) as the probability that the count in bin k came from component i. Define

Q(θ,θ(t))=k=1NSi=1mpT(k)zi(t)(k)log(wiNηi,Δi(k)zi(t)(k))

Then, we have that

Q(θ^,θ^)=k=1NSi=1mpT(k)z^i(k)log(i=1mwiNηi,Δi(k))=k=1NSpT(k)log(i=1mwiNηi,Δi(k))=g(θ^),

and

g(θ)=k=1NSpT(k)log(i=1mwiNηi,Δi(k)zi(t)(k)zi(t)(k))k=1NSi=1mpT(k)zi(t)(k)log(wiNηi,Δi(k)zi(t)(k))=Q(θ,θ(t)).

Define

(8) θ(t+1)=argmaxθQ(θ,θ^)

Then, we have that

g(θ(t+1))Q(θ(t+1),θ(t))Q(θ(t),θ(t))=g(θ(t)).

Therefore, the iterative algorithm in (Equation 11) generates a sequence {θ(t):t1} with non-decreasing values of g, and the sequence converges to a local maximum. Next, we show that the optimization in (8) can be computed efficiently.

  1. w-update

    (9) w(t+1)=argmaxwk=1NSi=1mpT(k)zi(t)(k)log(wi)wi(t+1)=k=1NSzi(t)(k)pT(k)i=1mk=1NSzi(t)(k)pT(k)
  2. η-update

    (10) ηi(t+1)=argminηk=1NSpT(k)z^i(k)|kηi|2ηi(t+1)=k=1Nszi(t)(k)kk=1Nszi(t)(k).
  3. Δ-update

    (11) Δi(t+1)=argmaxΔΔcut{k=1NspT(k)zi(t)(k)|kηit+1|22Δlog(Δ)}=max(k=1NSpT(k)zi(t)(k)|kηi(t+1)|2k=1NspT(k)zi(t)(k),Δcut),

where Δcut is the minimum allowed width of the Gaussians, in our case Δcut=1 since glycan index, k{1,2,NS}, takes integer values with spacing 1.

Since this is a heuristic algorithm for a non-convex optimization, we performed several initializations of the algorithm to identify the best local maximum.

The number of components m in a GMM is a free parameter. The KL divergence between the true and GMM approximated (D(pT||pGMM)), shown in Figure 1, saturates at some value of the number of components, and adding components beyond this only increases model complexity without increasing the quality of approximation. We define the complexity of a mass spectrometry data by the number of components at which saturation is reached. We compare the complexity of glycan profiles of hydra, planaria, and humans. The numbers of cell types in hydra, planaria, and humans are around 41 (Siebert et al., 2019), 44 (Fincher et al., 2018), and 103 (Han et al., 2020), respectively, based on transcriptome analysis (these are lower bounds based on the main cell types, and especially for planaria and hydra, are subject to constant revision). Our analysis of the MSMS data of these organisms suggests that organisms with fewer cell types have less complex glycan distribution.

Appendix 2

Parameter estimation

The typical transport time of glycoproteins across the Golgi complex is estimated to be in the range 15–20 mins (Umaña and Bailey, 1997), which corresponds to the transport rate μ=0.18/min. We bound the transport rate for our optimization between 0.01/min and 1/min.

Next, we estimate the range of values for the chemical reaction rates. The injection rate q is in the range 100–1500 pmol/106 cell 24 hr (Umaña and Bailey, 1997; Krambeck et al., 2009). For our calculation, we set q=387.30 pmol/106 cells 24 hr = 0.27 pmol/106 cells min, where 387.30 is the geometric mean of 100 and 1500. We set the range for the enzymatic rate R to be

Rmin=minα{Vmax(α)/νKM(α)+1νqμ}RRmax=maxα{Vmax(α)/νKM(α)}.

where KM(α) and Vmax(α) denote the Michaelis constants and Vmax of the αth enzyme. The conversion from 1 pmol/106 cells to concentration can be obtained by taking cisternal volume (ν) to be 2.5 µm3 (Umaña and Bailey, 1997; Krambeck et al., 2009). This gives

(12) 1 pmoles/106 cells=1012moles106×2.5×1018×103litre=400μM.

In Appendix 2—table 1, we report the parameters for the eight enzymes taken from Table 3 in Umaña and Bailey, 1997. From these parameters, it follows that

Rmin=minα{Vmax(α)/νKM(α)+1νqμ}=Vmax(7)/νKM(7)+1νqμ=.16×400μM/min3400μM+149.4μM=0.018min1
Rmax=maxα{Vmax(α)/νKM(α)}=Vmax(1)/νKM(1)=5×400μM/min100μM=20min1
Appendix 2—table 1
Enzyme parameters taken from Table 3 in Umaña and Bailey, 1997 that we use to calculate the bounds on the reaction rate R.

Here, KM(α) and Vmax(α) denote the Michaelis constant and Vmax of the αth enzyme.

αKM(α)(µmol)Vmax(α)(pmol/106 cell-min)
11005
22607.5
32005
41005
51902.33
6130.16
73400.16
840009.66

Appendix 3

Kinetics of sequential chemical reactions and transport

On including the rates of injection (q) and inter-cisternal transport μ(j) from the cisterna j to j+1 into the chemical reaction kinetics, the change in substrate concentrations ck(j) with time is given by

(13) dc1(1)dt=qα=1NEV(1,1,α)P(1)(1,α)c1(1)M(1,1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))μ(1)c1(1)dck(1)dt=α=1NEV(1,k1,α)P(1)(k1,α)ck1(1)M(1,k1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))α=1NEV(1,k,α)P(1)(k,α)ck(1)M(1,k,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))μ(1)ck(1)dcNs(1)dt=α=1NEV(1,Ns1,α)P(1)(Ns1,α)cNs1(1)M(1,Ns1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))μ(1)cNs(1)

for cisterna-1, and

(14) dc1(j)dt=μ(j1)c1(j1)α=1NEV(j,1,α)P(j)(1,α)c1(j)M(j,1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))μ(j)c1(j)dck(j)dt=μ(j1)ck(j1)+ α=1NEV(j,k1,α)P(j)(k1,α)ck1(j)M(j,k1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))α=1NEV(j,k,α)P(j)(k,α)ck(j)M(j,k,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))μ(j)ck(j)dcNs(j)dt=μ(j1)cNs(j1)+α=1NEV(j,Ns1,α)P(j)(Ns1,α)cNs1(j)M(j,Ns1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))μ(j)cNs(j)

for cisternae j=2,3,,NC. These set of dynamical equations (13)-(14), with initial conditions, can be solved to obtain the concentration ck(j)(t) for t0. Equations (13)-(14) automatically obey the conservation law for the protein concentration (p), that is, the total protein concentration p(j)=k=1Nsck(j) in the jth cisterna automatically satisfies

dp(1)dt=qμ(1)p(1)dp(j)dt=μ(j1)p(j1)μ(j)p(j)

for j=2,3,NC.

At steady state, the left-hand side of equations (13)-(14) is set to zero, which, after rescaling the kinetic parameters in terms of the injection rate q, that is, V(j,k,α)=V(j,k,α)/q and μ(j)=μ(j)/q, gives the following recursion relations for the steady-state concentrations of the glycans in each cisterna. In the first cisterna,

(15) c1(1)=1μ(1)+α=1NEV(1,1,α)P(1)(1,α)c1(1)M(1,1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))ck(1)=α=1NEV(1,k1,α)P(1)(k1,α)ck1(1)M(1,k1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))μ(1)+α=1NEV(1,k,α)P(1)(k,α)ck(1)M(1,k,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))cNs(1)=α=1NEV(1,Ns1,α)P(1)(Ns1,α)cNs1(1)M(1,Ns1,α)(1+k=1NsP(1)(k,α)ck(1)M(1,k,α))μ(1)

and in cisternae j2,

(16) c1(j)=μ(j1)c1(j1)μ(j)+α=1NEV(j,1,α)P(j)(1,α)c1(j)M(j,1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))ck(j)=μ(j1)ck(j1)+α=1NEV(j,k1,α)P(j)(k1,α)ck1(j)M(j,k1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))μ(j)+α=1NEV(j,k,α)P(j)(k,α)ck(j)M(j,k,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))cNs(j)=μ(j1)cNs(j1)+α=1NEV(j,Ns1,α)P(j)(Ns1,α)cNs1(j)M(j,Ns1,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))μ(j)

Equations (15)-(16) automatically imply that the total steady-state glycan concentration in each cisterna j=1,,Nc is given by

k=1Nsck(j)=1μ(j).

Appendix 4

Reformulation of Optimization A

Define a new set of parameters,

(17) R(j,k,α)=α=1NEV(j,k,α)M(j,k,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))

where c denotes the steady-state glycan concentration corresponding to a specific (μ,M,V,L). Define v by the following set of linear equations:

(18) v1(1)=1μ(1)+α=1NER(1,1,α)P(1)(1,α)vk(1)=vk1(1)α=1NER(1,k1,α)P(1)(k1,α)μ(1)+α=1NER(1,k,α)P(1)(k,α)vNs(1)=vNs1(1)α=1NER(1,Ns1,α)P(1)(Ns1,α)μ(1)

for j=1, and

(19) v1(j)=v1(j1)μ(j1)μ(j)+α=1NER(j,1,α)P(j)(1,α)vk(j)=vk(j1)μ(j1)μ(j)+α=1NER(j,k,α)P(j)(k,α)+vk1(j)α=1NER(j,k1,α)P(j)(k1,α)μ(j)+α=1NER(j,k,α)P(j)(k,α)vNs(j)=vNs(j1)α=1NER(j,Ns1,α)P(j)(Ns1,α)μ(j)+vNs(j1)μ(j1)μ(j)

for j=2,,NC. Then, by the definition of R in (17), it trivially follows that the steady-state concentration c corresponding to (μ,M,V,L) is a solution for (18)–(19).

Next, we show that for v obtained from (18)–(19) for any parameter (μ,R,L), there exists parameter (μ,M,V,L) such that (15)–(16) are automatically satisfied when we set c=v, that is, v is the steady-state concentration for (μ,M,V,L), and vice versa. Let

A={[ck(j)]j,k:μ(j)0,M(j,k,α)0,V(j,k,α)0,lα(j)0,[ck(j)]jk given by (15) and (16),}

and let

B={[vk(j)]j,k:μ(j)0,R(j,k,α)0,lα(j)0[vk(j)]j,k given by (18) and (19)}.

Then, our task is to show that A=B. Suppose [ck(j)]j,kA. Let [M(j,k,α)], [V(j,k,α)], and [lα(j)] be the corresponding parameters. Define

R(j,k,α)=α=1NEV(j,k,α)M(j,k,α)(1+k=1NsP(j)(k,α)ck(j)M(j,k,α))0

Then [ck(j)]j,k.

Next, suppose [vk(j)]j,k. Let [R(j,k,α)], [lα(j)] denote the corresponding parameters. Since k=1Nsvk(j)=1/μ(j)<, it follows that k=1NsP(j)(k,α)vk(j)<1/μ(j)<. Thus, there exist parameters [M(j,k,α)], [V(j,k,α)], and [lα(j)] such that

(20) R(j,k,α)=α=1NEV(j,k,α)M(j,k,α)(1+k=1NsP(j)(k,α)vk(j)M(j,k,α))

Therefore, [vk(j)]j,k satisfy (15) and (16), that is, [vk(j)]j,kA.

Thus, the set of all concentration profiles defined by (18)–(19) as a function of all possible values of the parameters (μ,R,L) is identical to the set defined by (15)–(16) as a function of (μ,M,V,L). This is a crucial insight since it allows us to search the entire parameter space using (18)–(19), where the concentration is known explicitly in terms of (μ,R,L).

To pose the new optimization problem, it is convenient to define v¯i(j)=μ(j)vi(j), for j=1,,NC, and i=1,,Ns. Furthermore, the glycan distribution of the surface is given by μNCviNC=v¯i. Thus, it follows that Optimization B

(21) D¯(σ,NE,NC,c):=minμ,R,L 0F(cv¯)

is equivalent to (6). Since v is explicitly known as a function of (R,L), Optimization B (21) is a more tractable optimization problem than (6). Note that in this setting the function D(σ,NE,NC,c) in (21) is independent of the rates μ. See Appendix 4—figure 1 for a flow chart of the two optimization schemes.

Appendix 4—figure 1
Flow chart showing the optimization schemes for Optimization A and B.

We prove that DminA=DminB by showing the set of all c is equal to the set of all v. We additionally establish that the optimum vmin=cmin.

While 21 is easy to implement, we note that the parameters (e.g. reaction rates, specificity) are not constrained to take only physically relevant values; a legitimate concern is that the absence of such physicochemical constraints might drive this optimization to physically unrealistic solutions.

There are two possible ways to impose these parameter constraints. One is to impose constraints on the ‘microscopic’ chemical parameters, such as the rate of individual reactions R(j,k,α) and the inter-cisternal transport rate μ(j). These take into consideration constraints arising from molecular enzymatic processes. The other is to impose constraints on ‘global’ physical parameters, such as the total transport time across the Golgi cisternae and the average enzymatic reaction time. Here, we impose constraints on the microscopic reaction and transport parameters to define constrained Optimization B:

D¯(σ,NC,NE,c):=minμ,R,LF(cv¯)s.t.RminR(j,k,α)Rmax,μminμ(j)μmax1α(j)NS.

The upper and lower bounds on the rates R and μ are estimated in Appendix 2: μmax=1/min (resp. μmin=.01/min) and Rmax=20/min (resp. Rmin=.018/min).

Appendix 5

Numerical scheme for performing the non-convex optimization

We solve Optimization C using the numerical scheme detailed below. The optimization problem consists of minimizing a non-convex objective with linear box constraints. We use the MATLAB FMINCON function to solve this optimization. We use SQP, a gradient-based iterative optimization scheme for solving optimizations with non-linear differentiable objective and constraints. Since our problem is non-convex and SQP only gives local minima, we initialize the algorithm with many random initial points. We use SOBOLSET function of MATLAB to generate space filling pseudo-random numbers. We have taken 1000 initializations for each NE,NC, and σ value. We have taken 50 equally spaced points between 0 and 1 to explore the σ-space for Figure 3. Some minor fluctuations in D due to non-convexity of the objective function in the final results were smoothed out by taking the convex hull of the D vs. σ graph. The results for σmin(NE,NC) and D(σmin,NE,NC) (Figure 4) were obtained by adding σ to the optimization vector and then performing the optimization.

A similar numerical scheme was used to optimize diversity.

Appendix 6

Analytical solution when Ns1

It is possible to obtain analytical expressions for the steady-state glycan distribution in the limit Ns1 when the glycan index k can be approximated by a continuous variable. In this case, (15)–(16) can be cast as differential equations,

(22) dck(1)dkck(1)ck1(1)=(α=1NER(1,k1,α)exp(σ|k1lα(1)|)μ(1)+α=1NER(1,k,α)exp(σ|klα(1)|)1)ck1(1)(μ(1)+ddkα=1NER(1,k,α)exp(σ|klα(1)|)μ(1)+α=1NER(1,k,α)exp(σ|klα(1)|))ck(1),

and

(23) dck(j)dkck(j)ck1(j)=μ(j1)μ(j)+α=1NER(j,k,α)exp(σ|klα(j)|)ck(j1)(μ(j)+ddkα=1NER(j,k,α)exp(σ|klα(j)|)μ(j)+α=1NER(j,k,α)exp(σ|klα(j)|))ck(j)

for j=2,,NC. In equation 22 and equation 23,

(24) ddkα=1NER(j,k,α)exp(σ|klα(j)|)=α=1NER(j,k,α)σexp(σ|klα(j)|)(12I(klα))+R(j,k,α)exp(σ|klα(j)|)

where the indicator function I() is equal to 1 if the argument is true, and 0 otherwise and R(j,k,α) is the derivative of R(j,k,α) with respect to k.

Define a vector function C(k)cN of the continuous variable k by C(k)=[ck(1),ck(2),ck(NC)]. Then, (equation 22) and (equation 23) can be written as

(25) dC(k)dk=M(k)C(k)

where the matrix M(k) is given by

(26) M(k)=[A(1)(k)0000B(2)(k)A(2)(k)0000B(3)(k)A(3)(k)0000B(NC)(k)A(NC)(k)]

with

A(j)(k)=μ(j)+ddkα=1NER(j,k,α)exp(σ|klα(j)|)μ(j)+α=1NER(j,k,α)exp(σ|klα(j)|)B(j)(k)=μ(j1)μ(j)+α=1NER(j,k,α)exp(σ|klα(j)|)

The functions A(j)(k) and B(j)(k) involve absolute value and indicator functions; therefore, the differential equation has to be solved in a piecewise manner assuming continuity of solution C(k).

The general solution of (25)

(27) C(k)=C0exp(Ω(k))

is written in terms of the Magnus function Ω(k)=n=1Ω(n,k), obtained from the Baker–Campbell–Hausdorff formula (Blanes et al., 2009),

Ω(1,k)=0kM(k1)dk1Ω(2,k)=120kdk10k1dk2[M(k1),M(k2)]Ω(3,k)=160kdk10k1dk20k2dk3[M(k1),[M(k2),M(k3)]]+[M(k3),[M(k2),M(k1)]]

where is the commutator, and the higher order terms in … contain higher order nested commutators.

Here, we establish conditions under which the series n=1Ω(n,k) that defines solution C(k) to the differential equation (25) converges. We also solve (25) for some special cases.

The commutator

[M(k1),M(k2)]=[00000a210000a31a320000a42a43000an,n-2an,n-10]

where

ai,i1=A(i1)(k2)B(i)(k1)+A(i)(k1)B(i)(k2)A(i1)(k1)B(i)(k2)+A(i)(k2)B(i)(k1)ai,i2=B(i1)(k2)B(i)(k1)B(i1)(k1)B(i)(k2)

The general form of Ω(n,k) is given by Blanes et al., 2009

(28) Ω(n,k)=znn!0kdk10k1dk20kn2dkn10kn1dknlWlM(kp1l)M(kp2l)M(kpnl)

where (p1(l),p2(l)pn(l)) is a permutation of (1,2,3,n), Wl{-1,1}, and zn1,n.

Let A¯=maxk,l,m|Ml,m(k)|. Define

M¯=[A¯0000A¯A¯0000A¯A¯0000A¯A¯]

We can bound all the matrix elements of Ω(n,k) in the following way:

(29) Ωlm(n,k)znM¯l,mn0kdk10k1dk20kn1dkn=znM¯n|lmknn!

The matrix

M¯n=[a110000a21a22000a31a32a3300a41a42a43a440an1an,n-2an,n-1ann]

where alm=Slm(n)A¯n for appropriately defined polynomials Sl,m(n). Thus, it follows that ΩlmznSlm(n)(A*)nknn! and Ωl,m(k)n=1znSl,m(n)(A*)nknn!. Consequently, the series will converge if A¯k<1, that is, k1A¯. Assuming μ(j)=μj, we can bound A¯ as

(30) A¯maxj,k(μ+σα=1NER(j,k,α)exp(-σ|k-lα(j)|)μ+α=1NER(j,k,α)exp(-σ|k-lα(j)|)+α=1NER(j,k,α)exp(-σ|k-lα(j)|)μ+α=1NER(j,k,α)exp(-σ|k-lα(j)|))

Since the parameters μ,σ,R(j,k,α),lα(j), and NE are finite and positive, and R(j,k,α) is finite, A¯ has a finite upper bound, implying that k is always greater than zero, and the series has a finite radius of convergence.

While in principle we can obtain the glycan profile for any NE and NC with arbitrary accuracy, assuming R(j,k,α)=Rα(j), we provide explicit formulae for a few representative cases: (i) (NE=1,NC=1) and (ii) (NE=1,NC=2).

i. NE=1,NC=1: The solution of the differential equation is given by

(31) c(k)={c0ek(μ+Rexp(σ(lk))μ+Rexp(σl))(1/σ)1klc(l)e(kl)(μ+Rμ+Rexp(σ(kl)))(1/σ)+1k>l

A representative concentration profile is plotted in Appendix 6—figure 1. The concentration profile consists of two distinct components: an initial exponential decay, and then an exponential rise and fall concentrated around l. The relative weight of these two components is controlled by the sensitivity σ and the rate R. Such explicit formulae can be obtained for any NE>1, as long as NC=1.

Appendix 6—figure 1
Glycan concentration profile calculated from the model using (a) formula (31) for NE=NC=1 and (b) formulae (32)–(36) for NE=1,NC=2.

ii. NE=1,NC=2: The concentration profile c(2) in cisterna-2 can be obtained from the following calculation. Let l(j) denote the ‘length’ of the enzyme in cisterna j=1,2. For kmin{l(1),l(2)},

(32) c(2)(k)=c0μ(1)ek(μ(2)+R(2)exp(σ(l(2)k))μ(1)+R(1)eσl(1))(1/σ)10k(μ(1)+R(1)exp(σ(l(1)k)))(1/σ)1(μ(2)+R(2)exp(σ(l(2)k)))1/σdk+c(2)(0)ek(μ(2)+R(2)eσ(l(2)k)μ(2)+R(2)eσl(2))(1/σ)1

Next, consider the case where l(1)l(2). Then, for l(1)<kl(2)

(33) c(2)(k)=c(1)(l(1))μ(1)e(kl(1))(μ(1)+R(1))(1/σ)+1(μ(2)+R(2)exp(σ(l(2)k)))(1/σ)1l(1)k(μ(2)+R(2)exp(σ(l(2)k)))1/σ(μ(1)+R(1)exp(σ(kl(1))))(1/σ)+1dk+c(2)(l(1))e(kl(1))(μ(2)+R(2)eσ(l(2)k)μ(2)+R(2)eσ(l(2)l(1)))(1/σ)1

and for l(1)l(2)<k,

(34) c(2)(k)=c(1)(l(1))μ(1)e(kl(1))(μ(1)+R(1)μ(2)+R(2)exp(σ(kl(2))))(1/σ)+1l(2)k(μ(2)+R(2)exp(σ(kl(2))))1/σ(μ(1)+R(1)exp(σ(kl(1))))(1/σ)+1dk+c(2)(l(2))e(kl(2))(μ(2)+R(2)μ(2)+R(2)eσ(kl(2)))(1/σ)+1

Next, the case where l(1)l(2). For l(2)<kl(1),

(35) c(2)(k)=c0μ(1)ek(μ(1)+R(1)eσl(1))1(1/σ)(μ(2)+R(2)exp(σ(kl(2))))(1/σ)+1l(2)k(μ(1)+R(1)exp(σ(l(1)k)))(1/σ)1(μ(2)+R(2)exp(σ(kl(2))))1/σdk+c(2)(l(2))el(2)k(μ(2)+R(2)μ(2)+R(2)eσ(kl(2)))(1/σ)+1

For (2)l(1)<k,

(36) c(2)(k)=c(1)(l(1))μ(1)e(kl(1))(μ(1)+R(1)μ(2)+R(2)exp(σ(kl(2))))(1/σ)+1l(2)k(μ(2)+R(2)exp(σ(kl(2))))1/σ(μ(1)+R(1)exp(σ(kl(1))))(1/σ)+1dk+c(2)(l(1))e(kl(1))(μ(2)+R(2)eσ(l(1)l(2))μ(2)+R(2)eσ(kl(2)))(1/σ)+1

The integrals in (32)–(36) can be evaluated numerically. The result of the numerical computation is shown in Appendix 6—figure 1.

Appendix 6—figure 2a–d plots the glycan profile ckvs.k as one varies the enzyme specificity σ, the reaction rates R, and transport rates μ, for two different values of NE and NC. The results in the plots lead us to the following general observations:

  1. Very-low-specificity enzymes cannot generate complex glycan distributions. Keeping everything else fixed, intermediate or high-specificity enzymes can generate glycan distributions of higher complexity by increasing NE or NC (Appendix 6—figure 2a and c).

  2. Decreasing the specificity σ or increasing the rates R increases the proportion of higher index glycans. Keeping everything else fixed, changes in the rate R have a stronger impact on the relative weights of the higher index glycans to lower index glycans. The relative weight of the higher index glycans increases with increasing NE and NC (Appendix 6—figure 2b–d).

  3. Keeping everything else fixed, decreasing enzyme specificity increases the spread of the distribution around the peaks (Appendix 6—figure 2a and c).

Appendix 6—figure 2
Glycan profile {ck:k=1,,Ns} as a function of specificity σ (a, c) and reaction rates R (b, d).

(a) NE=NC=1,(R=50,μ=1,l=10). ck decreases exponentially with k for very low and very high σ; however, the decay rate is lower at low σ. For intermediate values of σ, the distribution has exactly two peaks, one of which is at k=0, and eventually decays exponentially. The width of the distribution is a decreasing function of σ. (b) NE=NC=1,(σ=0.1,μ=1,l=10). At low R, ck is concentrated at low k. The proportion of higher index glycans in an increasing function of R. (c) NE=NC=2,(R=40,μ=1,[l1(1),l2(1),l1(2),l2(2)]=[10,30,50,70]). As σ increases, the distribution becomes more complex – from a single-peaked distribution at low σ to a maximum of four-peaked distribution at high σ. The peaks gets sharper, and more well defined as σ increases. (d) NE=NC=2,(R=40,μ=1,[l1(1),l2(1),l1(2),l2(2)]=[10,30,50,70]). As in the plots in (b), increasing R shifts the peaks towards higher index glycans and the proportion of higher index glycan increases.

Appendix 7

Extended distortion model shows lack of apparent symmetry

The results in Figure 4 seem to imply that there is an approximate NE-NC symmetry in the model, that is, increasing either NE or NC affects the fidelity, optimal enzyme specificity, and the sensitivity in approximately the same way. This is a consequence of the distortion model we are using for calculating the binding probabilities of substrates with enzymes, which allows every enzyme α to in principle catalyze any reaction in any cisterna. This allowed for the ideal enzyme length lα(j) in Equation 3 to vary across the cisternae in an unconstrained manner, leading to simplification in the calculation. We now find that by changing this aspect of the model the apparent symmetry between NE-NC is lifted. A more reasonable model for the ideal enzyme length is given by

α(j)=α(0)+δα(j),δα(j)[-b(j),b(j)],

that is, the nominal length α(0) can be distorted in a cisterna by a correction δα(j) but within a specified bound b(j) that is not subject to optimization. One can render some enzymes inactive in certain cisternae by choosing appropriate values of α(0) and δα(j). For small values for the bound b, for example, lb/Ns0.2 (here, Ns-1 is the number of enzymatic reactions), the decrease in D¯ on increasing NC is small compared to increasing NE (see Appendix 7—figure 1). On the other hand, for large b, for example, lb/NS0.3, there is an approximate symmetry between NE and NC (see Appendix 7—figure 1). Here, we have taken the bounds to be compartment independent, that is, b(j)=b.

Appendix 7—figure 1
Optimum fidelity D¯KL as a function of (NE,NC) for different values of b/Ns, where b bounds the deformation in the ideal length α(0) of an enzyme α=1,,NE.

Small values of b restrict all enzymes from working in all cisternae and all substrates, where large value of b removes this constraint.

Appendix 8

Redundancies and non-convexity of the optimization

Validation of the numerical optimization scheme

In order to test whether our numerical optimization procedure is able to converge to the global minimum, we run the following test. We generate 100 random values of (μ,R,L,σ) within their respective ranges for a problem instance with (NE=2,NC=2). The sampled value for (μ,R,L,σ) is used to generate concentration profiles that are then used as the target distribution for the optimization. Since the target distribution is achievable, the optimal value of the constrained Optimization B for these sampled targets is D¯=0. We solve the constrained Optimization B using our numerical scheme. The average optimal value D¯ across all sampled values was 9.1835e-07, 30 out of 100 values were exactly zero, and the highest D¯ was 1.1761e-05. Therefore, the optimization scheme was able to recover the concentration profiles almost exactly. Next, we ask whether the optimization problem recovers the value of (μ,R,L,σ) that was used to create the particular target distribution. We were able to recover σ exactly, except in cases where the concentration profile was almost a delta function at the first glycan (see Appendix 8—figure 1). This is because σ decides the typical width of the empirical distribution, and hence the optimal σ is determined by the typical width of the target distribution, except in the pathological case of a concentration profile that is almost a delta function at the first glycan – such a concentration profile can be made produced for any value σ by simply making transport μ very fast as compared to the reaction rates.

Appendix 8—figure 1
Recovering the σ values for different target distribution.

Note that barring four data points all other optimized σ values (red dots) exactly overlap with the corresponding target σ (diamonds).

We note that the optimization in (μ,R,L) is not convex and leads to many equally good minimas corresponding to different values of (μ,R,L). The resulting redundancies in the model and their importance are discussed next.

Degeneracy in the model

Recall that in equation 7, we defined

Reff(j,k)=αRα(j)exp(σ|klα(j)|)μ(j)

In terms of these renormalized rates, the steady-state glycan concentration can be written as

(37) c¯k(j)=Reff(j,k)c¯k1(j)+c¯kj11+Reff(j,k),

that is, the concentration is only a function of Reff(k,j). Thus, any combination of (μ,R,L,σ) that maps to the same value of Reff will result in the same concentration profile and will be indistinguishable from the perspective of the objective function. Additionally, the mapping from Reff to the concentration profile c¯ also has degeneracy. We show these redundancies in the schematic below, which shows a systematic reduction in dimension to 1 (scalar), which is the quantity we optimize,

R,L,μ,σReffc¯FNC+2NENC+1NC(NS1)NS11

Since F(cT||c¯)=0 if and only if cT=c¯, it follows that the last mapping does not have redundancy. Some of the sources of degeneracies in the mapping from (μ,R,L,σ) to Reff are as follows:

  1. For fixed (σ,L), setting Rα(j)γRα(j) and μjγ-1μj leaves Reff invariant.

  2. Permutations in the α index leave Reff invariant. Thus, there are at least (NE!)NC distinct minima that map to the same value of Reff, and therefore, the same concentration c¯.

Additionally, there are degeneracies coming from the optimization which depend on the target distribution cT. Having discussed the sources of degeneracies of the optimized solution, we now discuss the distribution of the optimized solutions.

Distribution of minima

To study the behaviour of the optimization algorithm for different initial points, we numerically investigate the distribution of function values at different local minima. Since the dimension of the optimization problem is NC+2NENC+1, which is large, we divide the optimization space into a grid of I=np(NC+2NENC+1) points. We did this numerical experiment for (NE,NC){(1,1),(1,2),(2,1),(2,2)}. The value of np=3 for (NE,NC)=(2,2) and np=4 for the rest. The target distribution for all the cases is a single Gaussian with mean 20, standard deviation 5, with support on 1k20. The results of this numerical experiment are summarized in Appendix 8—figure 2 and Appendix 8—table 1, from which we deduce the following:

  1. A large fraction of the initial starting points converge to a set of degenerate minima with objective function value exactly equal to the global minimum. These minima are a result of the degeneracies of the optimization problem.

  2. There are other local minima with objective value very close to (but not equal) the global minimum. Most initial points converge to one of these two sets of minima.

  3. Finally, there are a small set of local minima with significantly higher objective values. These correspond to minima with σ=0. The fraction of initial points that converge to such minima reduces as the dimension of the optimization space increases.

Appendix 8—table 1
Distribution of local minima.
NENCminD¯KLmaxD¯KLFraction of initial conditions within D¯KL0.0228
110.02280.440.56
210.00810.440.73
120.00510.290.70
221.17e-40.290.84
Appendix 8—figure 2
D¯ for various initial conditions, sorted in increasing order for clarity.

This clearly shows the fraction of initial conditions for which the optimized D¯ is small (see Appendix 8—table 1).

Appendix 9

Robustness of optimization to small perturbations

We analyse the sensitivity to small perturbations around the optimal point by calculating the Hessian of the KL divergence,

(38) H(i,j)=2XiXjF|Xmin

Here, X=[μ,R,L,σ] denotes the entire set of optimization variables (note that the entries in X are normalized by their respective range and do not carry physical dimensions). We calculated the eigenvalues, denoted by λi, and eigenvectors, denoted by Vi, of the Hessian matrix to identify the stiff and sloppy directions (Gutenkunst et al., 2007; Machta et al., 2013) in the optimization space. The eigenvectors of the Hessian matrix can be grouped in R,L,μ, and σ directions by looking for the most dominant component in the eigenvector. We find that most of the eigenvectors have significant entries along the direction of only one of the optimization variables μ,R,L,σ , for example, in Figure 6a, the eigenvectors 21–36 have significant entries only in the L directions. There is, however, a small number of eigenvectors that have entries over more than one optimization direction, for example, the eigenvector with σ dominant direction has some μ component as well (Figure 6a).

Stiff and sloppy directions

We find that the eigenvalues of the eigenvectors dominated by σ and some μ, L directions are orders of magnitude higher than for those dominated by the R directions (see Figure 6b). This suggests that D¯ has a valley-like structure around the optimal, with R and some L being the flat or sloppy directions.

The fact that enzyme specificity σ and some of the L directions are stiff should not be surprising since the typical width and position of peaks in the synthesized distribution are primarily controlled by σ and L. We have already shown that D¯ is a sharp convex function of σ for low values of (NE,NC) (see Figure 3), which gradually flattens out as we increase (NE,NC).

The fact that transport rate μ is a stiff direction is surprising! The stiffness in μ is due to the fact that the optimal μ is always at the lower bound, and with even slight increase in μ, the transport becomes too fast for the reactions to be able to produce the intermediate products. For the (R,L)-dominated eigenvectors, there are bands of sloppy direction and stiff directions. We define the average stiffness in μ,R,L, and σ by a weighted average of eigenvalues, where the weight is given by the strength of the corresponding components of the eigenvector.

λμ=ln(iwi(μ)λi),λR=ln(iwi(R)λi),λL=ln(iwi(L)λi),λσ=ln(iwi(σ)λi)

Here, wi(μ)=jμ|Vi,j|/j|Vi,j|, wi(R)=jR|Vi,j|/j|Vi,j|, wi(L)=jL|Vi,j|/j|Vi,j| and wi(μ)=jσ|Vi,j|/j|Vi,j|.

Figure 6c shows λμ, λR, λL, and λσ as a function of NC for fixed NE=4. The average stiffness in R directions, λR, is considerably lower than the average stiffness in σ, μ and L directions. σ is the stiffest direction but the stiffness decreases on increasing the NC. Interestingly, the stiffness along L directions increases on increasing NC.

We now define the total average stiffness λ=log(λiNC+2NENC+1), that is, log of the sum of the eigenvalues divided by the dimension of the optimization problem, in the space of NE,NC . We find that the average stiffness is higher for low values of (NE,NC) as compared to higher values of (NE,NC), with a few exceptions; and eventually, the average stiffness settles to a fixed low value (Figure 6d).

Appendix 10

Insensitivity of diversity to threshold cth

Since the threshold cth used to count the number of glycan species in the definition of diversity is arbitrary (see section ‘Strategies to achieve high glycan diversity’), we here show that the qualitative results we obtain are independent of this choice (Appendix 10—figure 1).

Appendix 10—figure 1
Diversity vs. NC for different values of σ keeping NE=1 fixed, for three different values of the threshold, cth=1Ns,12Ns,14Ns.

Changing the value of the threshold cth only changes the saturation value of the diversity curve.

Appendix 11

Table of symbols used in the article (Appendix 11—table 1)

Appendix 11—table 1
Table of symbols and their definitions.
SymbolDefinition
ck(j)Concentration of kth glycan in jth compartment
μ(j)Transport rate from jth to j+1 compartment
σSpecificity of the enzymes
α(j)Ideal substrate length for αth enzyme in jth compartment
M(j,k,α)Enzyme parameter related to the Michaelis constant KM
V(j,k,α)Enzyme parameter related to Vmax
R(j,α)Reaction parameter for Optimization B
D(c*c)KL divergence between c and c
F(c||c)Fidelity, KL divergence normalized by the entropy of the target c
D¯(σ,NE,NC,c)minμ,R,LF(c||c)
Reff(j, k)Effective reaction rate of kth reaction in jth compartment

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. The following repository on github contains the code and the data (numerical data + Mass Spec data) that are used in the paper: https://github.com/alkeshyadav/Glycosylation, (copy archived at URL swh:1:rev:a46c6eb76c5f07458d07e44267f48bbaaff6fc5a).

The following data sets were generated
    1. Yadav A
    (2022) GitHub
    ID Glycosylation. Glycan processing in the Golgi: optimal information coding and constraints on cisternal number and enzyme specificity.

References

  1. Book
    1. Alberts B
    (2002)
    Molecular Biology of the Cell
    Garland Science.
  2. Book
    1. Cover TM
    2. Thomas JA
    (2012)
    Elements of Information Theory
    John Wiley & Sons.
  3. Data
    1. Cummings RD
    2. Crocker P
    (authors) (2020) Functional Glycomics Database
    Consortium for Functional Glycomics.
  4. Book
    1. Kirschner MW
    2. Gerhart JC
    (2008)
    The Plausibility of Life
    Yale University Press.
  5. Book
    1. MacKay DJ
    (2003)
    Information Theory, Inference and Learning Algorithms
    Cambridge University Press.
  6. Book
    1. Price N.
    2. Stevens L.
    (1999)
    Fundamentals of Enzymology: The cell and molecular biology of catalytic proteins
    Oxford University Press.
    1. Roseman S
    (2001) Reflections on glycobiology
    The Journal of Biological Chemistry 276:41527–41542.
    https://doi.org/10.1074/jbc.R100053200
  7. Book
    1. Varki A
    (2009)
    Essentials of Glycobiology
    Cold Spring Harbor Laboratory Press.

Article and author information

Author details

  1. Alkesh Yadav

    Raman Research Institute, Bangalore, India
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4268-8873
  2. Quentin Vagne

    Laboratoire Physico Chimie Curie, Institut Curie, CNRS UMR168, Paris, France
    Contribution
    Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Pierre Sens

    Laboratoire Physico Chimie Curie, Institut Curie, CNRS UMR168, Paris, France
    Contribution
    Funding acquisition, Project administration, Writing - review and editing
    Competing interests
    Reviewing editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4523-3791
  4. Garud Iyengar

    Industrial Engineering and Operations Research, Columbia University, New York, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    garud@ieor.columbia.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6546-4154
  5. Madan Rao

    Simons Centre for the Study of Living Machines, National Centre for Biological Sciences, Bangalore, India
    Contribution
    Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    madan@ncbs.res.in
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6210-6386

Funding

Department of Atomic Energy, Government of India (RTI4006)

  • Madan Rao

Simons Foundation (287975)

  • Madan Rao

JC Bose Fellowship (DST-SERB)

  • Madan Rao

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank M Thattai, A Jaiman, S Ramaswamy, and A Varki for discussions, and S Krishna and R Bhat for very useful suggestions on the manuscript. We thank our group members at the Simons Centre for many incisive inputs. We are very grateful to P Babu and PS Sabarinath for consultations on the MSMS data and literature. MR acknowledges support from the Department of Atomic Energy (India), under project no. RTI4006, the Simons Foundation (grant no. 287975), and a JC Bose Fellowship from DST-SERB (India). We acknowledge the computational facilities at NCBS. This work has received support under the program Investissements d’Avenir launched by the French Government and implemented by ANR with the references ANR-10-LABX-0038 and ANR-10-IDEX-0001-02 PSL. MR thanks Institut Curie for hosting a visit under the Labex program, and QV thanks the Simons Centre (NCBS) for hosting his visit.

Copyright

© 2022, Yadav et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,010
    views
  • 246
    downloads
  • 4
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Alkesh Yadav
  2. Quentin Vagne
  3. Pierre Sens
  4. Garud Iyengar
  5. Madan Rao
(2022)
Glycan processing in the Golgi as optimal information coding that constrains cisternal number and enzyme specificity
eLife 11:e76757.
https://doi.org/10.7554/eLife.76757

Share this article

https://doi.org/10.7554/eLife.76757

Further reading

    1. Physics of Living Systems
    Tommaso Amico, Samuel Toluwanimi Dada ... Amos Maritan
    Research Article

    Many proteins have been recently shown to undergo a process of phase separation that leads to the formation of biomolecular condensates. Intriguingly, it has been observed that some of these proteins form dense droplets of sizeable dimensions already below the critical concentration, which is the concentration at which phase separation occurs. To understand this phenomenon, which is not readily compatible with classical nucleation theory, we investigated the properties of the droplet size distributions as a function of protein concentration. We found that these distributions can be described by a scale-invariant log-normal function with an average that increases progressively as the concentration approaches the critical concentration from below. The results of this scaling analysis suggest the existence of a universal behaviour independent of the sequences and structures of the proteins undergoing phase separation. While we refrain from proposing a theoretical model here, we suggest that any model of protein phase separation should predict the scaling exponents that we reported here from the fitting of experimental measurements of droplet size distributions. Furthermore, based on these observations, we show that it is possible to use the scale invariance to estimate the critical concentration for protein phase separation.

    1. Computational and Systems Biology
    2. Physics of Living Systems
    Ju Kang, Shijie Zhang ... Xin Wang
    Research Article

    Explaining biodiversity is a fundamental issue in ecology. A long-standing puzzle lies in the paradox of the plankton: many species of plankton feeding on a limited variety of resources coexist, apparently flouting the competitive exclusion principle (CEP), which holds that the number of predator (consumer) species cannot exceed that of the resources at a steady state. Here, we present a mechanistic model and demonstrate that intraspecific interference among the consumers enables a plethora of consumer species to coexist at constant population densities with only one or a handful of resource species. This facilitated biodiversity is resistant to stochasticity, either with the stochastic simulation algorithm or individual-based modeling. Our model naturally explains the classical experiments that invalidate the CEP, quantitatively illustrates the universal S-shaped pattern of the rank-abundance curves across a wide range of ecological communities, and can be broadly used to resolve the mystery of biodiversity in many natural ecosystems.