Abstract
A cornerstone of theoretical neuroscience is the circuit model: a system of equations that captures a hypothesized neural mechanism. Such models are valuable when they give rise to an experimentally observed phenomenon  whether behavioral or a pattern of neural activity  and thus can offer insights into neural computation. The operation of these circuits, like all models, critically depends on the choice of model parameters. A key step is then to identify the model parameters consistent with observed phenomena: to solve the inverse problem. In this work, we present a novel technique, emergent property inference (EPI), that brings the modern probabilistic modeling toolkit to theoretical neuroscience. When theorizing circuit models, theoreticians predominantly focus on reproducing computational properties rather than a particular dataset. Our method uses deep neural networks to learn parameter distributions with these computational properties. This methodology is introduced through a motivational example of parameter inference in the stomatogastric ganglion. EPI is then shown to allow precise control over the behavior of inferred parameters and to scale in parameter dimension better than alternative techniques. In the remainder of this work, we present novel theoretical findings in models of primary visual cortex and superior colliculus, which were gained through the examination of complex parametric structure captured by EPI. Beyond its scientific contribution, this work illustrates the variety of analyses possible once deep learning is harnessed towards solving theoretical inverse problems.
Introduction
The fundamental practice of theoretical neuroscience is to use a mathematical model to understand neural computation, whether that computation enables perception, action, or some intermediate processing. A neural circuit is systematized with a set of equations – the model – and these equations are motivated by biophysics, neurophysiology, and other conceptual considerations (Kopell and Ermentrout, 1988; Marder, 1998; Abbott, 2008; Wang, 2010; O'Leary et al., 2015). The function of this system is governed by the choice of model parameters, which when configured in a particular way, give rise to a measurable signature of a computation. The work of analyzing a model then requires solving the inverse problem: given a computation of interest, how can we reason about the distribution of parameters that give rise to it? The inverse problem is crucial for reasoning about likely parameter values, uniquenesses and degeneracies, and predictions made by the model (Gutenkunst et al., 2007; Erguler and Stumpf, 2011; Mannakee et al., 2016).
Ideally, one carefully designs a model and analytically derives how computational properties determine model parameters. Seminal examples of this gold standard include our field’s understanding of memory capacity in associative neural networks (Hopfield, 1982), chaos and autocorrelation timescales in random neural networks (Sompolinsky et al., 1988), central pattern generation (Olypher and Calabrese, 2007), the paradoxical effect (Tsodyks et al., 1997), and decision making (Wong and Wang, 2006). Unfortunately, as circuit models include more biological realism, theory via analytical derivation becomes intractable. Absent this analysis, statistical inference offers a toolkit by which to solve the inverse problem by identifying, at least approximately, the distribution of parameters that produce computations in a biologically realistic model (Foster et al., 1993; Prinz et al., 2004; Achard and De Schutter, 2006; Fisher et al., 2013; O'Leary et al., 2014; Alonso and Marder, 2019).
Statistical inference, of course, requires quantification of the sometimes vague term computation. In neuroscience, two perspectives are dominant. First, often we directly use an exemplar dataset: a collection of samples that express the computation of interest, this data being gathered either experimentally in the lab or from a computer simulation. Although a natural choice given its connection to experiment (Paninski and Cunningham, 2018), some drawbacks exist: these data are well known to have features irrelevant to the computation of interest (Niell and Stryker, 2010; Saleem et al., 2013; Musall et al., 2019), confounding inferences made on such data. Related to this point, use of a conventional dataset encourages conventional data likelihoods or loss functions, which focus on some global metric like squared error or marginal evidence, rather than the computation itself.
Alternatively, researchers often quantify an emergent property (EP): a statistic of data that directly quantifies the computation of interest, wherein the dataset is implicit. While such a choice may seem esoteric, it is not: the above ‘gold standard’ examples (Hopfield, 1982; Sompolinsky et al., 1988; Olypher and Calabrese, 2007; Tsodyks et al., 1997; Wong and Wang, 2006) all quantify and focus on some derived feature of the data, rather than the data drawn from the model. An emergent property is of course a dataset by another name, but it suggests different approach to solving the same inverse problem: here, we directly specify the desired emergent property – a statistic of data drawn from the model – and the value we wish that property to have, and we set up an optimization program to find the distribution of parameters that produce this computation. This statistical framework is not new: it is intimately connected to the literature on approximate bayesian computation (Beaumont et al., 2002; Marjoram et al., 2003; Sisson et al., 2007), parameter sensitivity analyses (Raue et al., 2009; Karlsson et al., 2012; Hines et al., 2014; Raman et al., 2017), maximum entropy modeling (Elsayed and Cunningham, 2017; Savin and Tkačik, 2017; Młynarski et al., 2020), and approximate bayesian inference (Tran et al., 2017; Gonçalves et al., 2019); we detail these connections in Section 'Related approaches'.
The parameter distributions producing a computation may be curved or multimodal along various parameter axes and combinations. It is by quantifying this complex structure that emergent property inference offers scientific insight. Traditional approximation families (e.g. meanfield or mixture of gaussians) are limited in the distributional structure they may learn. To address such restrictions on expressivity, advances in machine learning have used deep probability distributions as flexible approximating families for such complicated distributions (Rezende and Mohamed, 2015; Papamakarios et al., 2019a) (see Section 'Deep probability distributions and normalizing flows'). However, the adaptation of deep probability distributions to the problem of theoretical circuit analysis requires recent developments in deep learning for constrained optimization (LoaizaGanem et al., 2017), and architectural choices for efficient and expressive deep generative modeling (Dinh et al., 2017; Kingma and Dhariwal, 2018). We detail our method, which we call emergent property inference (EPI) in Section 'Emergent property inference via deep generative models'.
Equipped with this method, we demonstrate the capabilities of EPI and present novel theoretical findings from its analysis. First, we show EPI’s ability to handle biologically realistic circuit models using a fiveneuron model of the stomatogastric ganglion (Gutierrez et al., 2013): a neural circuit whose parametric degeneracy is closely studied (Goldman et al., 2001). Then, we show EPI’s scalability to high dimensional parameter distributions by inferring connectivities of recurrent neural networks that exhibit stable, yet amplified responses – a hallmark of neural responses throughout the brain (Murphy and Miller, 2009; Hennequin et al., 2014; Bondanelli et al., 2019). In a model of primary visual cortex (LitwinKumar et al., 2016; Palmigiano et al., 2020), EPI reveals how the recurrent processing across different neurontype populations shapes excitatory variability: a finding that we show is analytically intractable. Finally, we investigated the possible connectivities of a superior colliculus model that allow execution of different tasks on interleaved trials (Duan et al., 2021). EPI discovered a rich distribution containing two connectivity regimes with different solution classes. We queried the deep probability distribution learned by EPI to produce a mechanistic understanding of neural responses in each regime. Intriguingly, the inferred connectivities of each regime reproduced results from optogenetic inactivation experiments in markedly different ways. These theoretical insights afforded by EPI illustrate the value of deep inference for the interrogation of neural circuit models.
Results
Motivating emergent property inference of theoretical models
Consideration of the typical workflow of theoretical modeling clarifies the need for emergent property inference. First, one designs or chooses an existing circuit model that, it is hypothesized, captures the computation of interest. To ground this process in a wellknown example, consider the stomatogastric ganglion (STG) of crustaceans, a small neural circuit which generates multiple rhythmic muscle activation patterns for digestion (Marder and Thirumalai, 2002). Despite full knowledge of STG connectivity and a precise characterization of its rhythmic pattern generation, biophysical models of the STG have complicated relationships between circuit parameters and computation (Goldman et al., 2001; Prinz et al., 2004).
A subcircuit model of the STG (Gutierrez et al., 2013) is shown schematically in Figure 1A. The fast population (f1 and f2) represents the subnetwork generating the pyloric rhythm and the slow population (s1 and s2) represents the subnetwork of the gastric mill rhythm. The two fast neurons mutually inhibit one another, and spike at a greater frequency than the mutually inhibiting slow neurons. The hub neuron couples with either the fast or slow population, or both depending on modulatory conditions. The jagged connections indicate electrical coupling having electrical conductance ${g}_{\text{el}}$, smooth connections in the diagram are inhibitory synaptic projections having strength ${g}_{\text{synA}}$ onto the hub neuron, and ${g}_{\text{synB}}=5$ nS for mutual inhibitory connections. Note that the behavior of this model will be critically dependent on its parameterization – the choices of conductance parameters $\mathbf{\mathbf{z}}=[{g}_{\text{el}},{g}_{\text{synA}}]$.
Second, once the model is selected, one must specify what the model should produce. In this STG model, we are concerned with neural spiking frequency, which emerges from the dynamics of the circuit model (Figure 1B). An emergent property studied by Gutierrez et al. is the hub neuron firing at an intermediate frequency between the intrinsic spiking rates of the fast and slow populations. This emergent property (EP) is shown in Figure 1C at an average frequency of 0.55 Hz. To be precise, we define intermediate hub frequency not strictly as 0.55 Hz, but frequencies of moderate deviation from 0.55 Hz between the fast (.35Hz) and slow (.68Hz) frequencies.
Third, the model parameters producing the emergent property are inferred. By precisely quantifying the emergent property of interest as a statistical feature of the model, we use emergent property inference (EPI) to condition directly on this emergent property. Before presenting technical details (in the following section), let us understand emergent property inference schematically. EPI (Figure 1D) takes, as input, the model and the specified emergent property, and as its output, returns the parameter distribution (Figure 1E). This distribution – represented for clarity as samples from the distribution – is a parameter distribution constrained such that the circuit model produces the emergent property. Once EPI is run, the returned distribution can be used to efficiently generate additional parameter samples. Most importantly, the inferred distribution can be efficiently queried to quantify the parametric structure that it captures. By quantifying the parametric structure governing the emergent property, EPI informs the central question of this inverse problem: what aspects or combinations of model parameters have the desired emergent property?
Emergent property inference via deep generative models
EPI formalizes the threestep procedure of the previous section with deep probability distributions (Rezende and Mohamed, 2015; Papamakarios et al., 2019a). First, as is typical, we consider the model as a coupled set of noisy differential equations. In this STG example, the model activity (or state) $\mathbf{\mathbf{x}}=[{x}_{\text{f1}},{x}_{\text{f2}},{x}_{\text{hub}},{x}_{\text{s1}},{x}_{\text{s2}}]$ is the membrane potential for each neuron, which evolves according to the biophysical conductancebased equation:
where ${C}_{m}$ = 1nF, and $\mathbf{\mathbf{h}}$ is a sum of the leak, calcium, potassium, hyperpolarization, electrical, and synaptic currents, all of which have their own complicated dependence on activity $\mathbf{\mathbf{x}}$ and parameters $\mathbf{\mathbf{z}}=[{g}_{\text{el}},{g}_{\text{synA}}]$, and $d\mathbf{\mathbf{B}}$ is white gaussian noise (Gutierrez et al., 2013; see Section 'STG model' for more detail).
Second, we determine that our model should produce the emergent property of ‘intermediate hub frequency’ (Figure 1C). We stipulate that the hub neuron’s spiking frequency – denoted by statistic ${\omega}_{\text{hub}}(\mathbf{\mathbf{x}})$ – is close to a frequency of 0.55 Hz, between that of the slow and fast frequencies. Mathematically, we define this emergent property with two constraints: that the mean hub frequency is 0.55 Hz,
and that the variance of the hub frequency is moderate
In the emergent property of intermediate hub frequency, the statistic of hub neuron frequency is an expectation over the distribution of parameters $\mathbf{\mathbf{z}}$ and the distribution of the data $\mathbf{\mathbf{x}}$ that those parameters produce. We define the emergent property $\mathcal{X}$ as the collection of these two constraints. In general, an emergent property is a collection of constraints on statistical moments that together define the computation of interest.
Third, we perform emergent property inference: we find a distribution over parameter configurations $\mathbf{\mathbf{z}}$ of models that produce the emergent property; in other words, they satisfy the constraints introduced in Equations 2 and 3. This distribution will be chosen from a family of probability distributions $\mathcal{Q}=\{{q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}):\mathit{\bm{\theta}}\in \mathrm{\Theta}\}$, defined by a deep neural network (Rezende and Mohamed, 2015; Papamakarios et al., 2019a; Figure 1D, EPI box). Deep probability distributions map a simple random variable ${\mathbf{\mathbf{z}}}_{0}$ (e.g. an isotropic gaussian) through a deep neural network with weights and biases $\mathit{\bm{\theta}}$ to parameters $\mathbf{\mathbf{z}}={g}_{\mathit{\bm{\theta}}}({\mathbf{\mathbf{z}}}_{0})$ of a suitably complicated distribution (see Section 'Deep probability distributions and normalizing flows' for more details). Many distributions in $\mathcal{Q}$ will respect the emergent property constraints, so we select the most random (highest entropy) distribution, which also means this approach is equivalent to bayesian variational inference (see Section 'EPI as variational inference'). In EPI optimization, stochastic gradient steps in $\mathit{\bm{\theta}}$ are taken such that entropy is maximized, and the emergent property $\mathcal{X}$ is produced (see Section 'Emergent property inference (EPI)'). We then denote the inferred EPI distribution as ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$, since the structure of the learned parameter distribution is determined by weights and biases $\mathit{\bm{\theta}}$, and this distribution is conditioned upon emergent property $\mathcal{X}$.
The structure of the inferred parameter distributions of EPI can be analyzed to reveal key information about how the circuit model produces the emergent property. As probability in the EPI distribution decreases away from the mode of ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ (Figure 1E yellow star), the emergent property deteriorates. Perturbing $\mathbf{\mathbf{z}}$ along a dimension in which ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ changes little will not disturb the emergent property, making this parameter combination robust with respect to the emergent property. In contrast, if $\mathbf{\mathbf{z}}$ is perturbed along a dimension with strongly decreasing ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$, that parameter combination is deemed sensitive (Raue et al., 2009; Raman et al., 2017). By querying the secondorder derivative (Hessian) of $\mathrm{log}{q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ at a mode, we can quantitatively identify how sensitive (or robust) each eigenvector is by its eigenvalue; the more negative, the more sensitive and the closer to zero, the more robust (see Section 'Hessian sensitivity vectors'). Indeed, samples equidistant from the mode along these dimensions of sensitivity (${\mathbf{\mathbf{v}}}_{1}$, smaller eigenvalue) and robustness (${\mathbf{\mathbf{v}}}_{2}$, greater eigenvalue) (Figure 1E, arrows) agree with error contours (Figure 1E contours) and have diminished or preserved hub frequency, respectively (Figure 1F activity traces). The directionality of ${\mathbf{\mathbf{v}}}_{2}$ suggests that changes in conductance along this parameter combination will most preserve hub neuron firing between the intrinsic rates of the pyloric and gastric mill rhythms. Importantly and unlike alternative techniques, once an EPI distribution has been learned, the modes and Hessians of the distribution can be measured with trivial computation (see Section 'Deep probability distributions and normalizing flows').
In the following sections, we demonstrate EPI on three neural circuit models across ranges of biological realism, neural system function, and network scale. First, we demonstrate the superior scalability of EPI compared to alternative techniques by inferring highdimensional distributions of recurrent neural network connectivities that exhibit amplified, yet stable responses. Next, in a model of primary visual cortex (LitwinKumar et al., 2016; Palmigiano et al., 2020), we show how EPI discovers parametric degeneracy, revealing how input variability across neuron types affects the excitatory population. Finally, in a model of superior colliculus (Duan et al., 2021), we used EPI to capture multiple parametric regimes of task switching, and queried the dimensions of parameter sensitivity to characterize each regime.
Scaling inference of recurrent neural network connectivity with EPI
To understand how EPI scales in comparison to existing techniques, we consider recurrent neural networks (RNNs). Transient amplification is a hallmark of neural activity throughout cortex and is often thought to be intrinsically generated by recurrent connectivity in the responding cortical area (Murphy and Miller, 2009; Hennequin et al., 2014; Bondanelli et al., 2019). It has been shown that to generate such amplified, yet stabilized responses, the connectivity of RNNs must be nonnormal (Goldman, 2009; Murphy and Miller, 2009), and satisfy additional constraints (Bondanelli and Ostojic, 2020). In theoretical neuroscience, RNNs are optimized and then examined to show how dynamical systems could execute a given computation (Sussillo, 2014; Barak, 2017), but such biologically realistic constraints on connectivity (Goldman, 2009; Murphy and Miller, 2009; Bondanelli and Ostojic, 2020) are ignored for simplicity or because constrained optimization is difficult. In general, access to distributions of connectivity that produce theoretical criteria like stable amplification, chaotic fluctuations (Sompolinsky et al., 1988), or low tangling (Russo et al., 2018) would add scientific value to existing research with RNNs. Here, we use EPI to learn RNN connectivities producing stable amplification, and demonstrate the superior scalability and efficiency of EPI to alternative approaches.
We consider a rank2 RNN with $N$ neurons having connectivity $W=U{V}^{\top}$ and dynamics
where $U=\left[\begin{array}{cc}\hfill {\mathbf{\mathbf{U}}}_{1}\hfill & \hfill {\mathbf{\mathbf{U}}}_{2}\hfill \end{array}\right]+g{\chi}^{(U)}$, $V=\left[\begin{array}{cc}\hfill {\mathbf{\mathbf{V}}}_{1}\hfill & \hfill {\mathbf{\mathbf{V}}}_{2}\hfill \end{array}\right]+g{\chi}^{(V)}$, ${\mathbf{\mathbf{U}}}_{1}{\mathbf{\mathbf{U}}}_{2},{\mathbf{\mathbf{V}}}_{1},{\mathbf{\mathbf{V}}}_{2}\in {[1,1]}^{N}$, and ${\chi}_{i,j}^{(U)},{\chi}_{i,j}^{(V)}\sim \mathcal{N}(0,1)$. We infer connectivity parameters $\mathbf{\mathbf{z}}=[{\mathbf{\mathbf{U}}}_{1},{\mathbf{\mathbf{U}}}_{2},{\mathbf{\mathbf{V}}}_{1},{\mathbf{\mathbf{V}}}_{2}]$ that produce stable amplification. Two conditions are necessary and sufficient for RNNs to exhibit stable amplification (Bondanelli and Ostojic, 2020): $\text{real}({\lambda}_{1})<1$ and ${\lambda}_{1}^{s}>1$, where ${\lambda}_{1}$ is the eigenvalue of $W$ with greatest real part and ${\lambda}^{s}$ is the maximum eigenvalue of ${W}^{s}=\frac{W+{W}^{\top}}{2}$. RNNs with $\text{real}({\lambda}_{1})=0.5\pm 0.5$ and ${\lambda}_{1}^{s}=1.5\pm 0.5$ will be stable with modest decay rate ($\text{real}({\lambda}_{1})$ close to its upper bound of 1) and exhibit modest amplification (${\lambda}_{1}^{s}$ close to its lower bound of 1). EPI can naturally condition on this emergent property
Variance constraints predicate that the majority of the distribution (within two standard deviations) are within the specified ranges.
For comparison, we infer the parameters $\mathbf{\mathbf{z}}$ likely to produce stable amplification using two alternative simulationbased inference approaches. Sequential Monte Carlo approximate bayesian computation (SMCABC) (Sisson et al., 2007) is a rejection sampling approach that uses SMC techniques to improve efficiency, and sequential neural posterior estimation (SNPE) (Gonçalves et al., 2019) approximates posteriors with deep probability distributions (see Section 'Related approaches'). Unlike EPI, these statistical inference techniques do not constrain the predictions of the inferred distribution, so they were run by conditioning on an exemplar dataset ${\mathbf{\mathbf{x}}}_{0}=\mathit{\bm{\mu}}$, following standard practice with these methods (Sisson et al., 2007; Gonçalves et al., 2019). To compare the efficiency of these different techniques, we measured the time and number of simulations necessary for the distance of the predictive mean to be less than 0.5 from $\mathit{\bm{\mu}}={\mathbf{\mathbf{x}}}_{0}$ (see Section 'Scaling EPI for stable amplification in RNNs').
As the number of neurons $N$ in the RNN, and thus the dimension of the parameter space $\mathbf{\mathbf{z}}\in {[1,1]}^{4N}$, is scaled, we see that EPI converges at greater speed and at greater dimension than SMCABC and SNPE (Figure 2A). It also becomes most efficient to use EPI in terms of simulation count at $N=50$ (Figure 2B). It is well known that ABC techniques struggle in parameter spaces of modest dimension (Sisson et al., 2018), yet we were careful to assess the scalability of SNPE, which is a more closely related methodology to EPI. Between EPI and SNPE, we closely controlled the number of parameters in deep probability distributions by dimensionality (Figure 2—figure supplement 1), and tested more aggressive SNPE hyperparameter choices when SNPE failed to converge (Figure 2—figure supplement 2). In this analysis, we see that deep inference techniques EPI and SNPE are far more amenable to inference of high dimensional RNN connectivities than rejection sampling techniques like SMCABC, and that EPI outperforms SNPE in both wall time (elapsed real time) and simulation count.
No matter the number of neurons, EPI always produces connectivity distributions with mean and variance of $\text{real}({\lambda}_{1})$ and ${\lambda}_{1}^{s}$ according to $\mathcal{X}$ (Figure 2C, blue). For the dimensionalities in which SMCABC is tractable, the inferred parameters are concentrated and offset from the exemplar dataset ${\mathbf{\mathbf{x}}}_{0}$ (Figure 2C, green). When using SNPE, the predictions of the inferred parameters are highly concentrated at some RNN sizes and widely varied in others (Figure 2C, orange). We see these properties reflected in simulations from the inferred distributions: EPI produces a consistent variety of stable, amplified activity norms $\mathbf{\mathbf{x}}(t)$, SMCABC produces a limited variety of responses, and the changing variety of responses from SNPE emphasizes the control of EPI on parameter predictions (Figure 2D). Even for moderate neuron counts, the predictions of the inferred distribution of SNPE are highly dependent on $N$ and $g$, while EPI maintains the emergent property across choices of RNN (see Section 'Effect of RNN parameters on EPI and SNPE inferred distributions').
To understand these differences, note that EPI outperforms SNPE in high dimensions by using gradient information (from ${\nabla}_{\mathbf{\mathbf{z}}}{[\text{real}({\lambda}_{1}),{\lambda}_{1}^{s}]}^{\top}$). This choice agrees with recent speculation that such gradient information could improve the efficiency of simulationbased inference techniques (Cranmer et al., 2020), as well as reflecting the classic tradeoff between gradientbased and samplingbased estimators (scaling and speed versus generality). Since gradients of the emergent property are necessary in EPI optimization, gradient tractability is a key criteria when determining the suitability of a simulationbased inference technique. If the emergent property gradient is efficiently calculated, EPI is a clear choice for inferring high dimensional parameter distributions. In the next two sections, we use EPI for novel scientific insight by examining the structure of inferred distributions.
EPI reveals how recurrence with multiple inhibitory subtypes governs excitatory variability in a V1 model
Dynamical models of excitatory (E) and inhibitory (I) populations with supralinear inputoutput function have succeeded in explaining a host of experimentally documented phenomena in primary visual cortex (V1). In a regime characterized by inhibitory stabilization of strong recurrent excitation, these models give rise to paradoxical responses (Tsodyks et al., 1997), selective amplification (Goldman, 2009; Murphy and Miller, 2009), surround suppression (Ozeki et al., 2009), and normalization (Rubin et al., 2015). Recent theoretical work (Hennequin et al., 2018) shows that stabilized EI models reproduce the effect of variability suppression (Churchland et al., 2010). Furthermore, experimental evidence shows that inhibition is composed of distinct elements – parvalbumin (P), somatostatin (S), VIP (V) – composing 80% of GABAergic interneurons in V1 (Markram et al., 2004; Rudy et al., 2011; Tremblay et al., 2016), and that these inhibitory cell types follow specific connectivity patterns (Figure 3A; Pfeffer et al., 2013). Here, we use EPI on a model of V1 with biologically realistic connectivity to show how the structure of input across neuron types affects the variability of the excitatory population – the population largely responsible for projecting to other brain areas (Felleman and Van Essen, 1991).
We considered response variability of a nonlinear dynamical V1 circuit model (Figure 3A) with a state comprised of each neurontype population’s rate $\mathbf{\mathbf{x}}={[{x}_{E},{x}_{P},{x}_{S},{x}_{V}]}^{\top}$. Each population receives recurrent input $W\mathbf{\mathbf{x}}$, where $W$ is the effective connectivity matrix (see Section 'Primary visual cortex') and an external input with mean $\mathbf{\mathbf{h}}$, which determines population rate via supralinear nonlinearity $\varphi (\cdot )={[\cdot ]}_{+}^{2}$. The external input has an additive noisy component $\mathit{\u03f5}$ with variance ${\mathit{\bm{\sigma}}}^{2}=[{\sigma}_{E}^{2},{\sigma}_{P}^{2},{\sigma}_{S}^{2},{\sigma}_{V}^{2}]$. This noise has a slower dynamical timescale ${\tau}_{\text{noise}}>\tau $ than the population rate, allowing fluctuations around a stimulusdependent steadystate (Figure 3B). This model is the stochastic stabilized supralinear network (SSSN) (Hennequin et al., 2018)
generalized to have multiple inhibitory neuron types. It introduces stochasticity to four neurontype models of V1 (LitwinKumar et al., 2016). Stochasticity and inhibitory multiplicity introduce substantial complexity to the mathematical treatment of this problem (see Section 'Primary visual cortex: Mathematical intuition and challenges') motivating the analysis of this model with EPI. Here, we consider fixed weights $W$ and input $\mathbf{\mathbf{h}}$ (Palmigiano et al., 2020), and study the effect of input variability $\mathbf{\mathbf{z}}={[{\sigma}_{E},{\sigma}_{P},{\sigma}_{S},{\sigma}_{V}]}^{\top}$ on excitatory variability.
We quantify levels of Epopulation variability by studying two emergent properties
where ${s}_{E}(\mathbf{x};\mathbf{z})$ is the standard deviation of the stochastic $E$population response about its steady state (Figure 3C). In the following analyses, we select 1 Hz^{2} variance such that the two emergent properties do not overlap in ${s}_{E}(\mathbf{\mathbf{z}};\mathbf{\mathbf{x}})$.
First, we ran EPI to obtain parameter distribution ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X}(5\text{Hz}))$ producing Epopulation variability around 5 Hz (Figure 3D). From the marginal distribution of ${\sigma}_{E}$ and ${\sigma}_{P}$ (Figure 3D, topleft), we can see that ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ is sensitive to various combinations of ${\sigma}_{E}$ and ${\sigma}_{P}$. Alternatively, both ${\sigma}_{S}$ and ${\sigma}_{V}$ are degenerate with respect to ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ evidenced by the unexpectedly high variability in those dimensions (Figure 3D, bottomright). Together, these observations imply a curved path with respect to ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ of 5 Hz, which is indicated by the modes along ${\sigma}_{P}$ (Figure 3E).
Figure 3E suggests a quadratic relationship in Epopulation fluctuations and the standard deviation of E and Ppopulation input; as the square of either ${\sigma}_{E}$ or ${\sigma}_{P}$ increases, the other compensates by decreasing to preserve the level of ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$. This quadratic relationship is preserved at greater level of Epopulation variability $\mathcal{X}(10\text{Hz})$ (Figure 3E and Figure 3—figure supplement 1). Indeed, the sum of squares of ${\sigma}_{E}$ and ${\sigma}_{P}$ is larger in ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X}(10\text{Hz}))$ than ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X}(5\text{Hz}))$ (Figure 3F, $p<1\times {10}^{10}$), while the sum of squares of ${\sigma}_{S}$ and ${\sigma}_{V}$ are not significantly different in the two EPI distributions (Figure 3—figure supplement 3, $p=.40$), in which parameters were bounded from 0 to 0.5. The strong interaction between E and Ppopulation input variability on excitatory variability is intriguing, since this circuit exhibits a paradoxical effect in the Ppopulation (and no other inhibitory types) (Figure 3—figure supplement 4), meaning that the Epopulation is Pstabilized. Future research may uncover a link between the population of network stabilization and compensatory interactions governing excitatory variability.
EPI revealed the quadratic dependence of excitatory variability on input variability to the E and Ppopulations, as well as its independence to input from the other two inhibitory populations. In a simplified model ($\tau ={\tau}_{\text{noise}}$), it can be shown that surfaces of equal variance are ellipsoids as a function of $\mathit{\bm{\sigma}}$ (see Section 'Primary visual cortex: Mathematical intuition and challenges'). Nevertheless, the sensitive and degenerate parameters are intractable to predict mathematically, since the covariance matrix depends on the steadystate solution of the network (Hennequin et al., 2018; Gardiner, 2009), and terms in the covariance expression increase quadratically with each additional neurontype population (see also Section 'Primary visual cortex: Mathematical intuition and challenges'). By pointing out this mathematical complexity, we emphasize the value of EPI for gaining understanding about theoretical models when mathematical analysis becomes onerous or impractical.
EPI identifies two regimes of rapid task switching
It has been shown that rats can learn to switch from one behavioral task to the next on randomly interleaved trials (Duan et al., 2015), and an important question is what neural mechanisms produce this computation. In this experimental setup, rats were given an explicit task cue on each trial, either Pro or Anti. After a delay period, rats were shown a stimulus, and made a context (task) dependent response (Figure 4A). In the Pro task, rats were required to orient toward the stimulus, while in the Anti task, rats were required to orient away from the stimulus. Pharmacological inactivation of the SC impaired rat performance, and timespecific optogenetic inactivation revealed a crucial role for the SC on the cognitively demanding Anti trials (Duan et al., 2021). These results motivated a nonlinear dynamical model of the SC containing four functionally defined neurontype populations. In Duan et al., 2021, a computationally intensive procedure was used to obtain a set of 373 connectivity parameters that qualitatively reproduced these optogenetic inactivation results. To build upon the insights of this previous work, we use the probabilistic tools afforded by EPI to identify and characterize two linked, yet distinct regimes of rapid task switching connectivity.
In this SC model, there are Pro and Antipopulations in each hemisphere (left (L) and right (R)) with activity variables $\mathbf{\mathbf{x}}={[{x}_{LP},{x}_{LA},{x}_{RP},{x}_{RA}]}^{\top}$ (Duan et al., 2021). The connectivity of these populations is parameterized by self $sW$, vertical $vW$, diagonal $dW$ and horizontal $hW$ connections (Figure 4B). The input $\mathbf{\mathbf{h}}$ is comprised of a positive cuedependent signal to the Pro or Antipopulations, a positive stimulusdependent input to either the Left or Right populations, and a choiceperiod input to the entire network (see Section 'SC model'). Model responses are bounded from 0 to 1 as a function $\varphi $ of an internal variable $\mathbf{\mathbf{u}}$
The model responds to the side with greater Pro neuron activation; for example the response is left if ${x}_{LP}>{x}_{RP}$ at the end of the trial. Here, we use EPI to determine the network connectivity $\mathbf{\mathbf{z}}={[sW,vW,dW,hW]}^{\top}$ that produces rapid task switching.
Rapid task switching is formalized mathematically as an emergent property with two statistics: accuracy in the Pro task ${p}_{P}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ and Anti task ${p}_{A}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$. We stipulate that accuracy be on average 0.75 in each task with variance ${.075}^{2}$
Seventyfive percent accuracy is a realistic level of performance in each task, and with the chosen variance, inferred models will not exhibit fully random responses (50%), nor perfect performance (100%).
The EPI inferred distribution (Figure 4C) produces Pro and Antitask accuracies (Figure 4C, bottomleft) consistent with rapid task switching (Equation 9). This parameter distribution has rich structure that is not captured well by simple linear correlations (Figure 4—figure supplement 1). Specifically, the shape of the EPI distribution is sharply bent, matching ground truth structure indicated by bruteforce sampling (Figure 4—figure supplement 5). This is most saliently observed in the marginal distribution of $sW$$hW$ (Figure 4C topright), where anticorrelation between $sW$ and $hW$ switches to correlation with decreasing $sW$. By identifying the modes of the EPI distribution ${\mathbf{\mathbf{z}}}^{*}(sW)$ at different values of $sW$ (Figure 4C red/purple dots), we can quantify this change in distributional structure with the sensitivity dimension ${\mathbf{\mathbf{v}}}_{1}(\mathbf{\mathbf{z}})$ (Figure 4C red/purple arrows). Note that the directionality of these sensitivity dimensions at ${\mathbf{\mathbf{z}}}^{*}(sW)$ changes distinctly with $sW$, and are perpendicular to the robust dimensions of the EPI distribution that preserve rapid task switching. These two directionalities of sensitivity motivate the distinction of connectivity into two regimes, which produce different types of responses in the Pro and Anti tasks (Figure 4—figure supplement 2).
When perturbing connectivity along the sensitivity dimension away from the modes
Proaccuracy monotonically increases in both regimes (Figure 4D, topleft). However, there is a stark difference between regimes in Antiaccuracy. Antiaccuracy falls in either direction of ${\mathbf{\mathbf{v}}}_{1}$ in regime 1, yet monotonically increases along with Pro accuracy in regime 2 (Figure 4D, bottomleft). The sharp change in local structure of the EPI distribution is therefore explained by distinct sensitivities: Antiaccuracy diminishes in only one or both directions of the sensitivity perturbation.
To understand the mechanisms differentiating the two regimes, we can make connectivity perturbations along dimensions that only modify a single eigenvalue of the connectivity matrix. These eigenvalues ${\lambda}_{\text{all}}$, ${\lambda}_{\text{side}}$, ${\lambda}_{\text{task}}$, and ${\lambda}_{\text{diag}}$ correspond to connectivity eigenmodes with intuitive roles in processing in this task (Figure 4—figure supplement 3A). For example, greater ${\lambda}_{\text{task}}$ will strengthen internal representations of task, while greater ${\lambda}_{\text{diag}}$ will amplify dominance of Pro and Anti pairs in opposite hemispheres (Section 'Connectivity eigendecomposition and processing modes'). Unlike the sensitivity dimension, the dimensions ${\mathbf{\mathbf{v}}}_{a}$ that perturb isolated connectivity eigenvalues ${\lambda}_{a}$ for $a\in \{\text{all},\text{side},\text{task},\text{diag}\}$ are independent of ${\mathbf{\mathbf{z}}}^{*}(sW)$ (see Section 'Connectivity eigendecomposition and processing modes'), e.g.
Connectivity perturbation analyses reveal that decreasing ${\lambda}_{\text{task}}$ has a very similar effect on Anti accuracy as perturbations along the sensitivity dimension (Figure 4D, middle). The similar effects of perturbations along the sensitivity dimension ${\mathbf{\mathbf{v}}}_{1}({\mathbf{\mathbf{z}}}^{*})$ and reduction of task eigenvalue (via perturbations along ${\mathbf{\mathbf{v}}}_{\text{task}}$) suggest that there is a carefully tuned strength of task representation in connectivity regime 1, which if disturbed results in random Antitrial responses. Finally, we recognize that increasing ${\lambda}_{\text{diag}}$ has opposite effects on Antiaccuracy in each regime (Figure 4D, right). In the next section, we build on these mechanistic characterizations of each regime by examining their resilience to optogenetic inactivation.
EPI inferred SC connectivities reproduce results from optogenetic inactivation experiments
During the delay period of this task, the circuit must prepare to execute the correct task according to the presented cue. The circuit must then maintain a representation of task throughout the delay period, which is important for correct execution of the Antitask. Duan et al. found that bilateral optogenetic inactivation of SC during the delay period consistently decreased performance in the Antitask, but had no effect on the Protask (Figure 5A; Duan et al., 2021). The distribution of connectivities inferred by EPI exhibited this same effect in simulation at high optogenetic strengths γ, which reduce the network activities $\mathbf{\mathbf{x}}(t)$ by a factor $1\gamma $ (Figure 5B) (see Section 'Modeling optogenetic silencing').
To examine how connectivity affects response to delay period inactivation, we grouped connectivities of the EPI distribution along the continuum linking regimes 1 and 2 of Section 'EPI identifies two regimes of rapid task switching'. $Z(sW)$ is the set of EPI samples for which the closest mode was ${\mathbf{\mathbf{z}}}^{*}(sW)$ (see Section 'Mode identification with EPI'). In the following analyses, we examine how error, and the influence of connectivity eigenvalue on Antierror change along this continuum of connectivities. Obtaining the parameter samples for these analysis with the learned EPI distribution was more than 20,000 times faster than a brute force approach (see Section 'Sample grouping by mode').
The mean increase in Antierror of the EPI distribution is closest to the experimentally measured value of 7% at $\gamma =0.675$ (Figure 5B, black dot). At this level of optogenetic strength, regime 1 exhibits an increase in Antierror with delay period silencing (Figure 5C, left), while regime 2 does not. In regime 1, greater ${\lambda}_{\text{task}}$ and ${\lambda}_{\text{diag}}$ decrease Antierror (Figure 5C, right). In other words, stronger task representations and diagonal amplification make the SC model more resilient to delay period silencing in the Antitask. This complements the finding from Duan et al., 2021 (Duan et al., 2021) that ${\lambda}_{\text{task}}$ and ${\lambda}_{\text{diag}}$ improve Anti accuracy.
At roughly $\gamma =0.85$ (Figure 5B, gray dot), the Antierror saturates, while Proerror remains at zero. Following delay period inactivation at this optogenetic strength, there are strong similarities in the responses of Pro and Antitrials during the choice period (Figure 5D, left). We interpreted these similarities to suggest that delay period inactivation at this saturated level flips the internal representation of task (from Anti to Pro) in the circuit model. A flipped task representation would explain why the Antierror saturates at 50%: the average Antiaccuracy in EPI inferred connectivities is 75%, but average Anti accuracy would be 25% (100%  ${\mathbb{E}}_{\mathbf{z}}\left[{p}_{P}\right]$) if the internal representation of task is flipped during the delay period.This hypothesis prescribes a model of Antiaccuracy during delay period silencing of ${p}_{A,\text{opto}}=100\%{p}_{P}$, which is fit closely across both regimes of the EPI inferred connectivities (Figure 5D, right). Similarities between Pro and Antitrial responses were not present at the experimentmatching level of $\gamma =0.675$ (Figure 5—figure supplement 2 left) and neither was anticorrelation in ${p}_{P}$ and ${p}_{A,\text{opto}}$ (Figure 5—figure supplement 2 right).
In summary, the connectivity inferred by EPI to perform rapid task switching replicated results from optogenetic silencing experiments. We found that at levels of optogenetic strength matching experimental levels of Antierror, only one regime actually exhibited the effect. This connectivity regime is less resilient to optogenetic perturbation, and perhaps more biologically realistic. Finally, we characterized the pathology in Antierror that occurs in both regimes when optogenetic strength is increased to high levels, leading to a mechanistic hypothesis that is experimentally testable. The probabilistic tools afforded by EPI yielded this insight: we identified two regimes and the continuum of connectivities between them by taking gradients of parameter probabilities in the EPI distribution, we identified sensitivity dimensions by measuring the Hessian of the EPI distribution, and we obtained many parameter samples at each step along the continuum at an efficient rate.
Discussion
In neuroscience, machine learning has primarily been used to reveal structure in neural datasets (Paninski and Cunningham, 2018). Careful inference procedures are developed for these statistical models allowing precise, quantitative reasoning, which clarifies the way data informs beliefs about the model parameters. However, these statistical models often lack resemblance to the underlying biology, making it unclear how to go from the structure revealed by these methods, to the neural mechanisms giving rise to it. In contrast, theoretical neuroscience has primarily focused on careful models of neural circuits and the production of emergent properties of computation, rather than measuring structure in neural datasets. In this work, we improve upon parameter inference techniques in theoretical neuroscience with emergent property inference, harnessing deep learning towards parameter inference in neural circuit models (see Section 'Related approaches').
Methodology for statistical inference in circuit models has evolved considerably in recent years. Early work used rejection sampling techniques (Beaumont et al., 2002; Marjoram et al., 2003; Sisson et al., 2007), but EPI and another recently developed methodology (Gonçalves et al., 2019) employ deep learning to improve efficiency and provide flexible approximations. SNPE has been used for posterior inference of parameters in circuit models conditioned upon exemplar data used to represent computation, but it does not infer parameter distributions that only produce the computation of interest like EPI (see Section 'Scaling inference of recurrent neural network connectivity with EPI'). When strict control over the predictions of the inferred parameters is necessary, EPI uses a constrained optimization technique (LoaizaGanem et al., 2017) (see Section 'Augmented lagrangian optimization') to make inference conditioned on the emergent property possible.
A key difference between EPI and SNPE, is that EPI uses gradients of the emergent property throughout optimization. In Section 'Scaling inference of recurrent neural network connectivity with EPI', we showed that such gradients confer beneficial scaling properties, but a concern remains that emergent property gradients may be too computationally intensive. Even in a case of close biophysical realism with an expensive emergent property gradient, EPI was run successfully on intermediate hub frequency in a fiveneuron subcircuit model of the STG (Section 'Motivating emergent property inference of theoretical models'). However, conditioning on the pyloric rhythm (Marder and Selverston, 1992) in a model of the pyloric subnetwork model (Prinz et al., 2004) proved to be prohibitive with EPI. The pyloric subnetwork requires many time steps for simulation and many key emergent property statistics (e.g. burst duration and phase gap) are not calculable or easily approximated with differentiable functions. In such cases, SNPE, which does not require differentiability of the emergent property, has proven useful (Gonçalves et al., 2019). In summary, choice of deep inference technique should consider emergent property complexity and differentiability, dimensionality of parameter space, and the importance of constraining the model behavior predicted by the inferred parameter distribution.
In this paper, we demonstrate the value of deep inference for parameter sensitivity analyses at both the local and global level. With these techniques, flexible deep probability distributions are optimized to capture global structure by approximating the full distribution of suitable parameters. Importantly, the local structure of this deep probability distribution can be quantified at any parameter choice, offering instant sensitivity measurements after fitting. For example, the global structure captured by EPI revealed two distinct parameter regimes, which had different local structure quantified by the deep probability distribution (see Section 'Superior colliculus'). In comparison, bayesian MCMC is considered a popular approach for capturing global parameter structure (Girolami and Calderhead, 2011), but there is no variational approximation (the deep probability distribution in EPI), so sensitivity information is not queryable and sampling remains slow after convergence. Local sensitivity analyses (e.g. Raue et al., 2009) may be performed independently at individual parameter samples, but these methods alone do not capture the full picture in nonlinear, complex distributions. In contrast, deep inference yields a probability distribution that produces a wholistic assessment of parameter sensitivity at the local and global level, which we used in this study to make novel insights into a range of theoretical models. Together, the abilities to condition upon emergent properties, the efficient inference algorithm, and the capacity for parameter sensitivity analyses make EPI a useful method for addressing inverse problems in theoretical neuroscience.
Code availability statement
All software written for this study is available at https://github.com/cunninghamlab/epi (copy archived at swh:1:rev:38febae7035ca921334a616b0f396b3767bf18d4), Bittner, 2021.
Materials and methods
Emergent property inference (EPI)
Request a detailed protocolSolving inverse problems is an important part of theoretical neuroscience, since we must understand how neural circuit models and their parameter choices produce computations. Recently, research on machine learning methodology for neuroscience has focused on finding latent structure in largescale neural datasets, while research in theoretical neuroscience generally focuses on developing precise neural circuit models that can produce computations of interest. By quantifying computation into an emergent property through statistics of the emergent activity of neural circuit models, we can adapt the modern technique of deep probabilistic inference towards solving inverse problems in theoretical neuroscience. Here, we introduce a novel method for statistical inference, which uses deep networks to learn parameter distributions constrained to produce emergent properties of computation.
Consider model parameterization $\mathbf{\mathbf{z}}$, which is a collection of scientifically meaningful variables that govern the complex simulation of data $\mathbf{\mathbf{x}}$. For example (see Section 'Motivating emergent property inference of theoretical models'), $\mathbf{\mathbf{z}}$ may be the electrical conductance parameters of an STG subcircuit, and $\mathbf{\mathbf{x}}$ the evolving membrane potentials of the five neurons. In terms of statistical modeling, this circuit model has an intractable likelihood $p(\mathbf{\mathbf{x}}\mid \mathbf{\mathbf{z}})$, which is predicated by the stochastic differential equations that define the model. From a theoretical perspective, we are less concerned about the likelihood of an exemplar dataset $\mathbf{\mathbf{x}}$, but rather the emergent property of intermediate hub frequency (which implies a consistent dataset $\mathbf{\mathbf{x}}$).
In this work, emergent properties $\mathcal{X}$ are defined through the choice of emergent property statistic $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ (which is a vector of one or more statistics), and its means $\mathit{\bm{\mu}}$, and variances ${\mathit{\bm{\sigma}}}^{2}$:
In general, an emergent property may be a collection of first, second, or higherorder moments of a group of statistics, but this study focuses on the case written in Equation 12. In the STG example, intermediate hub frequency is defined by mean and variance constraints on the statistic of hub neuron frequency ${\omega}_{\text{hub}}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ (Equations 2 and 3). Precisely, the emergent property statistics $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ must have means $\mathit{\bm{\mu}}$ and variances ${\mathit{\bm{\sigma}}}^{2}$ over the EPI distribution of parameters ($\mathbf{\mathbf{z}}\sim {q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$) and the data produced by those parameters ($\mathbf{\mathbf{x}}\sim p(\mathbf{\mathbf{x}}\mid \mathbf{\mathbf{z}})$), where the inferred parameter distribution ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ itself is parameterized by deep network weights and biases $\mathit{\bm{\theta}}$.
In EPI, a deep probability distribution ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ is optimized to approximate the parameter distribution producing the emergent property $\mathcal{X}$. In contrast to simpler classes of distributions like the gaussian or mixture of gaussians, deep probability distributions are far more flexible and capable of fitting rich structure (Rezende and Mohamed, 2015; Papamakarios et al., 2019a). In deep probability distributions, a simple random variable ${\mathbf{\mathbf{z}}}_{0}\sim {q}_{0}({\mathbf{\mathbf{z}}}_{0})$ (we choose an isotropic gaussian) is mapped deterministically via a sequence of deep neural network layers (g_{1}, . g_{l}) parameterized by weights and biases $\mathit{\bm{\theta}}$ to the support of the distribution of interest:
Such deep probability distributions embed the inferred distribution in a deep network. Once optimized, this deep network representation of a distribution has remarkably useful properties: fast sampling and probability evaluations. Importantly, fast probability evaluations confer fast gradient and Hessian calculations as well.
Given this choice of circuit model and emergent property $\mathcal{X}$, ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ is optimized via the neural network parameters $\mathit{\bm{\theta}}$ to find a maximally entropic distribution ${q}_{\mathit{\bm{\theta}}}^{*}$ within the deep variational family $\mathcal{Q}=\{{q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}):\mathit{\bm{\theta}}\in \mathrm{\Theta}\}$ that produces the emergent property $\mathcal{X}$:
where $H({q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}))={\mathbb{E}}_{\mathbf{\mathbf{z}}}\left[\mathrm{log}{q}_{\mathit{\bm{\theta}}}(z)\right]$ is entropy. By maximizing the entropy of the inferred distribution ${q}_{\mathit{\bm{\theta}}}$, we select the most random distribution in family $\mathcal{Q}$ that satisfies the constraints of the emergent property. Since entropy is maximized in Equation 14, EPI is equivalent to bayesian variational inference (see Section 'EPI as variational inference'), which is why we specify the inferred distribution of EPI as conditioned upon emergent property $\mathcal{X}$ with the notation ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$. To run this constrained optimization, we use an augmented lagrangian objective, which is the standard approach for constrained optimization (Bertsekas, 2014), and the approach taken to fit Maximum Entropy Flow Networks (MEFNs) (LoaizaGanem et al., 2017). This procedure is detailed in Section 'Augmented lagrangian optimization' and the pseudocode in Algorithm 'Augmented lagrangian optimization'.
In the remainder of Section 'Emergent property inference (EPI)', we will explain the finer details and motivation of the EPI method. First, we explain related approaches and what EPI introduces to this domain (Section 'Related approaches'). Second, we describe the special class of deep probability distributions used in EPI called normalizing flows (Section 'Deep probability distributions and normalizing flows'). Then, we establish the known relationship between maximum entropy distributions and exponential families (Section 'Maximum entropy distributions and exponential families'). Next, we explain the constrained optimization technique used to solve Equation 14 (Section 'Augmented lagrangian optimization'). Then, we demonstrate the details of this optimization in a toy example (Section 'Example: 2D LDS'). Finally, we explain how EPI is equivalent to variational inference (Section 'EPI as variational inference').
Related approaches
Request a detailed protocolWhen bayesian inference problems lack conjugacy, scientists use approximate inference methods like variational inference (VI) (Saul and Jordan, 1998) and Markov chain Monte Carlo (MCMC) (Metropolis et al., 1953; Hastings, 1970). After optimization, variational methods return a parameterized posterior distribution, which we can analyze. Also, the variational approximation is often chosen such that it permits fast sampling. In contrast MCMC methods only produce samples from the approximated posterior distribution. No parameterized distribution is estimated, and additional samples are always generated with the same sampling complexity. Inference in models defined by systems of differential has been demonstrated with MCMC (Girolami and Calderhead, 2011), although this approach requires tractable likelihoods. Advancements have introduced sampling (Calderhead and Girolami, 2011), likelihood approximation (Golightly and Wilkinson, 2011), and uncertainty quantification techniques (Chkrebtii et al., 2016) to make MCMC approaches more efficient and expand the class of applicable models.
Simulationbased inference (Cranmer et al., 2020) is model parameter inference in the absence of a tractable likelihood function. The most prevalent approach to simulationbased inference is approximate bayesian computation (ABC) (Beaumont et al., 2002), in which satisfactory parameter samples are kept from random prior sampling according to a rejection heuristic. The obtained set of parameters do not have a probabilities, and further insight about the model must be gained from examination of the parameter set and their generated activity. Methodological advances to ABC methods have come through the use of Markov chain Monte Carlo (MCMCABC) (Marjoram et al., 2003) and sequential Monte Carlo (SMCABC) (Sisson et al., 2007) sampling techniques. SMCABC is considered stateoftheart ABC, yet this approach still struggles to scale in dimensionality (Sisson et al., 2018; Figure 2). Still, this method has enjoyed much success in systems biology (Liepe et al., 2014). Furthermore, once a parameter set has been obtained by SMCABC from a finite set of particles, the SMCABC algorithm must be run again from scratch with a new population of initialized particles to obtain additional samples.
For scientific model analysis, we seek a parameter distribution represented by an approximating distribution as in variational inference (Saul and Jordan, 1998): a variational approximation that once optimized yields fast analytic calculations and samples. For the reasons described above, ABC and MCMC techniques are not suitable, because they only produce a set of parameter samples lacking probabilities and have unchanging sampling rate. EPI infers parameters in circuit models using the MEFN (LoaizaGanem et al., 2017) algorithm with a deep variational approximation. The deep neural network of EPI (Figure 1E) defines the parametric form (with weights and biases as variational parameters $\mathit{\bm{\theta}}$) of the variational approximation of the inferred parameter distribution ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathbf{\mathbf{x}})$. The EPI optimization is enabled using stochastic gradient techniques in the spirit of likelihoodfree variational inference (Tran et al., 2017). The analytic relationship between EPI and variational inference is explained in Section 'EPI as variational inference'.
We note that, during our preparation and early presentation of this work (Bittner et al., 2019a; Bittner et al., 2019b), another work has arisen with broadly similar goals: bringing statistical inference to mechanistic models of neural circuits (Nonnenmacher et al., 2018; Michael et al., 2019; Gonçalves et al., 2019). We are encouraged by this general problem being recognized by others in the community, and we emphasize that these works offer complementary neuroscientific contributions (different theoretical models of focus) and use different technical methodologies (ours is built on our prior work [LoaizaGanem et al., 2017], theirs similarly [Lueckmann et al., 2017]).
The method EPI differs from SNPE in some key ways. SNPE belongs to a ‘sequential’ class of recently developed simulationbased inference methods in which two neural networks are used for posterior inference. This first neural network is a deep probability distribution (normalizing flow) used to estimate the posterior $p(\mathbf{\mathbf{z}}\mid \mathbf{\mathbf{x}})$ (SNPE) or the likelihood $p(\mathbf{\mathbf{x}}\mid \mathbf{\mathbf{z}})$ (sequential neural likelihood (SNL) [Papamakarios et al., 2019b]). A recent approach uses an unconstrained neural network to estimate the likelihood ratio (sequential neural ratio estimation (SNRE) [Hermans et al., 2020]). In SNL and SNRE, MCMC sampling techniques are used to obtain samples from the approximated posterior. This contrasts with EPI and SNPE, which use deep probability distributions to model parameters, which facilitates immediate measurements of sample probability, gradient, or Hessian for system analysis. The second neural network in this sequential class of methods is the amortizer. This unconstrained deep network maps data $\mathbf{\mathbf{x}}$ (or statistics $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ or model parameters $\mathbf{\mathbf{z}}$) to the weights and biases of the first neural network. These methods are optimized on a conditional density (or ratio) estimation objective. The data used to optimize this objective are generated via an adaptive procedure, in which training data pairs (${\mathbf{\mathbf{x}}}_{i}$, ${\mathbf{\mathbf{z}}}_{i}$) become sequentially closer to the true data and posterior.
The approximating fidelity of the deep probability distribution in sequential approaches is optimized to generalize across the training distribution of the conditioning variable. This generalization property of the sequential methods can reduce the accuracy at the singular posterior of interest. Whereas in EPI, the entire expressivity of the deep probability distribution is dedicated to learning a single distribution as well as possible. The wellknown inverse mapping problem of exponential families (Wainwright and Jordan, 2008) prohibits an amortizationbased approach in EPI, since EPI learns an exponential family distribution parameterized by its mean (in contrast to its natural parameter, see Section 'Maximum entropy distributions and exponential families'). However, we have shown that the same twonetwork architecture of the sequential simulationbased inference methods can be used for amortized inference in intractable exponential family posteriors when using their natural parameterization (Bittner and Cunningham, 2019).
Finally, one important differentiating factor between EPI and sequential simulationbased inference methods is that EPI leverages gradients ${\nabla}_{\mathbf{\mathbf{z}}}f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ during optimization. These gradients can improve convergence time and scalability, as we have shown on an example conditioning lowrank RNN connectivity on the property of stable amplification (see Section 'Scaling inference of recurrent neural network connectivity with EPI'). With EPI, we prove out the suggestion that a deep inference technique can improve efficiency by leveraging these emergent property gradients when they are tractable. Sequential simulationbased inference techniques may be better suited for scientific problems where ${\nabla}_{\mathbf{\mathbf{z}}}f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ is intractable or unavailable, like when there is a nondifferentiable emergent property. However, the sequential simulationbased inference techniques cannot constrain the predictions of the inferred distribution in the manner of EPI.
Structural identifiability analysis involves the measurement of sensitivity and unidentifiabilities in scientific models. Around a single parameter choice, one can measure the Jacobian. One approach for this calculation that scales well is EAR (Karlsson et al., 2012). A popular efficient approach for systems of ODEs has been neural ODE adjoint (Chen et al., 2018) and its stochastic adaptation (Li et al., 2020). Casting identifiability as a statistical estimation problem, the profile likelihood works via iterated optimization while holding parameters fixed (Raue et al., 2009). An exciting recent method is capable of recovering the functional form of such unidentifiabilities away from a point by following degenerate dimensions of the fisher information matrix (Raman et al., 2017). Global structural nonidentifiabilities can be found for models with polynomial or rational dynamics equations using DAISY (Pia Saccomani et al., 2003), or through mean optimal transformations (Hengl et al., 2007). With EPI, we have all the benefits given by a statistical inference method plus the ability to query the first or secondorder gradient of the probability of the inferred distribution at any chosen parameter value. The secondorder gradient of the log probability (the Hessian), which is directly afforded by EPI distributions, produces quantified information about parametric sensitivity of the emergent property in parameter space (see Section 'Emergent property inference via deep generative models').
Deep probability distributions and normalizing flows
Request a detailed protocolDeep probability distributions are comprised of multiple layers of fully connected neural networks (Equation 13). When each neural network layer is restricted to be a bijective function, the sample density can be calculated using the change of variables formula at each layer of the network. For ${\mathbf{\mathbf{z}}}_{i}={g}_{i}({\mathbf{\mathbf{z}}}_{\mathbf{\mathbf{i}}\mathrm{\U0001d7cf}})$,
However, this computation has cubic complexity in dimensionality for fully connected layers. By restricting our layers to normalizing flows (Rezende and Mohamed, 2015; Papamakarios et al., 2019a) – bijective functions with fast log determinant Jacobian computations, which confer a fast calculation of the sample log probability. Fast log probability calculation confers efficient optimization of the maximum entropy objective (see Section 'Augmented lagrangian optimization').
We use the real NVP (Dinh et al., 2017) normalizing flow class, because its coupling architecture confers both fast sampling (forward) and fast log probability evaluation (backward). Fast probability evaluation facilitates fast gradient and Hessian evaluation of log probability throughout parameter space. Glow permutations were used in between coupling stages (Kingma and Dhariwal, 2018). This is in contrast to autoregressive architectures (Papamakarios et al., 2017; Kingma et al., 2016), in which only one of the forward or backward passes can be efficient. In this work, normalizing flows are used as flexible parameter distribution approximations ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ having weights and biases $\mathit{\bm{\theta}}$. We specify the architecture used in each application by the number of real NVP affine coupling stages, and the number of neural network layers and units per layer of the conditioning functions.
When calculating Hessians of log probabilities in deep probability distributions, it is important to consider the normalizing flow architecture. With autoregressive architectures (Kingma et al., 2016; Papamakarios et al., 2017), fast sampling and fast log probability evaluations are mutually exclusive. That makes these architectures undesirable for EPI, where efficient sampling is important for optimization, and log probability evaluation speed predicates the efficiency of gradient and Hessian calculations. With real NVP coupling architectures, we get both fast sampling and fast Hessians making both optimization and scientific analysis efficient.
Maximum entropy distributions and exponential families
Request a detailed protocolThe inferred distribution of EPI is a maximum entropy distribution, which have fundamental links to exponential family distributions. A maximum entropy distribution of form:
where $T(\mathbf{\mathbf{z}})$ is the sufficient statistics vector and ${\mathit{\bm{\mu}}}_{\text{opt}}$ a vector of their mean values, will have probability density in the exponential family:
The mappings between the mean parameterization ${\mathit{\bm{\mu}}}_{\text{opt}}$ and the natural parameterization $\mathit{\bm{\eta}}$ are formally hard to identify except in special cases (Wainwright and Jordan, 2008).
In this manuscript, emergent properties are defined by statistics $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ having a fixed mean $\mathit{\bm{\mu}}$ and variance ${\mathit{\bm{\sigma}}}^{2}$ as in Equation 12. The variance constraint is a second moment constraint on $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$:
As a general maximum entropy distribution (Equation 16), the sufficient statistics vector contains both first and second order moments of $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$
which are constrained to the chosen means and variances
Thus, ${\mathit{\bm{\mu}}}_{\text{opt}}$ is used to denote the mean parameter of the maximum entropy distribution defined by the emergent property (all constraints), while $\mathit{\bm{\mu}}$ is only the mean of $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$. The subscript ‘opt’ of ${\mathit{\bm{\mu}}}_{\text{opt}}$ is chosen since it contains all the constraint values to which the EPI optimization algorithm must adhere.
Augmented lagrangian optimization
Request a detailed protocolTo optimize ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ in Equation 14, the constrained maximum entropy optimization is executed using the augmented lagrangian method. The following objective is minimized:
where there are average constraint violations
${\mathit{\bm{\eta}}}_{\text{opt}}\in {\mathbb{R}}^{m}$ are the lagrange multipliers where $m$ is the number of total constraints
and $c$ is the penalty coefficient. The mean parameter ${\mathit{\bm{\mu}}}_{\text{opt}}$ and sufficient statistics $T(\mathbf{\mathbf{z}})$ are determined by the means $\mathit{\bm{\mu}}$ and variances ${\mathit{\bm{\sigma}}}^{2}$ of the emergent property statistics $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ defined in Equation 14. Specifically, $T(\mathbf{\mathbf{z}})$ is a concatenation of the first and second moments (Equation 19) and ${\mathit{\bm{\mu}}}_{\text{opt}}$ is a concatenation of their constraints $\mathit{\bm{\mu}}$ and ${\mathit{\bm{\sigma}}}^{2}$ (Equation 20). (Although, note that this algorithm is written for general $T(\mathbf{\mathbf{z}})$ and ${\mathit{\bm{\mu}}}_{\text{opt}}$ to satisfy the more general class of emergent properties.) The lagrange multipliers ${\mathit{\bm{\eta}}}_{\text{opt}}$ are closely related to the natural parameters $\mathit{\bm{\eta}}$ of exponential families (see Section 'EPI as variational inference'). Weights and biases $\mathit{\bm{\theta}}$ of the deep probability distribution are optimized according to Equation 21 using the Adam optimizer with learning rate 10^{−3} (Kingma and Ba, 2015).
The gradient with respect to entropy $H({q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}))$ can be expressed using the reparameterization trick as an expectation of the negative log density of parameter samples $\mathbf{\mathbf{z}}$ over the randomness in the parameterless initial distribution ${q}_{0}({\mathbf{\mathbf{z}}}_{0}$):
Thus, the gradient of the entropy of the deep probability distribution can be estimated as an average of gradients with respect to the base distribution ${\mathbf{\mathbf{z}}}_{0}$:
The gradients of the log density of the deep probability distribution are tractable through the use of normalizing flows (see Section 'Deep probability distributions and normalizing flows').
The full EPI optimization algorithm is detailed in Algorithm 1. The lagrangian parameters ${\mathit{\bm{\eta}}}_{\text{opt}}$ are initialized to zero and adapted following each augmented lagrangian epoch, which is a period of optimization with fixed (${\mathit{\bm{\eta}}}_{\text{opt}}$, $c$) for a given number of stochastic gradient descent (SGD) iterations. A low value of $c$ is used initially, and conditionally increased after each epoch based on constraint error reduction. The penalty coefficient is updated based on the result of a hypothesis test regarding the reduction in constraint violation. The pvalue of $\mathbb{E}[R({\mathit{\bm{\theta}}}_{k+1})]>\gamma \mathbb{E}\left[R({\mathit{\bm{\theta}}}_{k})\right]$ is computed, and ${c}_{k+1}$ is updated to $\beta {c}_{k}$ with probability $1p$. The other update rule is ${\mathit{\bm{\eta}}}_{\text{opt},k+1}={\mathit{\bm{\eta}}}_{\text{opt},k}+{c}_{k}\frac{1}{n}{\sum}_{i=1}^{n}(T({\mathbf{\mathbf{z}}}^{(i)}){\mathit{\bm{\mu}}}_{\text{opt}})$ given a batch size $n$ and ${\mathbf{\mathbf{z}}}^{(i)}\sim {q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$. Throughout the study, $\gamma =0.25$, while β was chosen to be either 2 or 4. The batch size of EPI also varied according to application.
Algorithm 1. Emergent property inference 

1 initialize $\mathit{\bm{\theta}}$ by fitting ${q}_{\mathit{\bm{\theta}}}$ to an isotropic gaussian of mean ${\mathit{\bm{\mu}}}_{\text{init}}$ and variance ${\mathit{\bm{\sigma}}}_{\text{init}}^{2}$ 2 initialize ${c}_{0}>0$ and ${\mathit{\bm{\eta}}}_{\text{opt},0}=\mathrm{}$. 3 for Augmented lagrangian epoch $k=1,\mathrm{\dots},{k}_{\text{max}}$ do 4 for SGD iteration $i=1,\mathrm{\dots},{i}_{\text{max}}$ do 5 Sample ${\mathbf{\mathbf{z}}}_{0}^{(1)},\mathrm{\dots},{\mathbf{\mathbf{z}}}_{0}^{(n)}\sim {q}_{0}$, get transformed variable ${\mathbf{\mathbf{z}}}^{(j)}={g}_{\mathit{\bm{\theta}}}({\mathbf{\mathbf{z}}}_{0}^{(j)})$, $j=1,\mathrm{\dots},n$ 6 Update $\mathit{\bm{\theta}}$ by descending its stochastic gradient (using ADAM optimizer [Kingma and Ba, 2015]). $\begin{array}{ll}{\displaystyle {\mathrm{\nabla}}_{\mathit{\theta}}L(\mathit{\theta};{\mathit{\eta}}_{\text{opt},k},c)}& {\displaystyle =\frac{1}{n}\sum _{j=1}^{n}{\mathrm{\nabla}}_{\mathit{\theta}}\mathrm{log}{q}_{\mathit{\theta}}({\mathbf{z}}^{(j)})+\frac{1}{n}\sum _{j=1}^{n}{\mathrm{\nabla}}_{\mathit{\theta}}\left(T\left({\mathbf{z}}^{(j)}\right){\mathit{\mu}}_{\text{opt}}\right){\mathit{\eta}}_{\text{opt},k}}\\ & {\displaystyle +{c}_{k}\frac{2}{n}\sum _{j=1}^{\frac{n}{2}}{\mathrm{\nabla}}_{\mathit{\theta}}\left(T\left({\mathbf{z}}^{(j)}\right){\mathit{\mu}}_{\text{opt}}\right)\cdot \frac{2}{n}\sum _{j=\frac{n}{2}+1}^{n}\left(T\left({\mathbf{z}}^{(j)}\right){\mathit{\mu}}_{\text{opt}}\right)}\end{array}$ 7 end 8 Sample ${\mathbf{\mathbf{z}}}_{0}^{(1)},\mathrm{\dots},{\mathbf{\mathbf{z}}}_{0}^{(n)}\sim {q}_{0}$, get transformed variable ${\mathbf{\mathbf{z}}}^{(j)}={g}_{\mathit{\bm{\theta}}}({\mathbf{\mathbf{z}}}_{0}^{(j)})$, $j=1,\mathrm{\dots},n$ 9 Update ${\mathit{\bm{\eta}}}_{\text{opt},k+1}={\mathit{\bm{\eta}}}_{\text{opt},k}+{c}_{k}\frac{1}{n}{\sum}_{j=1}^{n}\left(T\left({\mathbf{\mathbf{z}}}^{(j)}\right){\mathit{\bm{\mu}}}_{\text{opt}}\right)$. 10 Update ${c}_{k+1}>{c}_{k}$ (see text for detail). 11 end 
In general, $c$ and ${\mathit{\bm{\eta}}}_{\text{opt}}$ should start at values encouraging entropic growth early in optimization. With each training epoch in which the update rule for $c$ is invoked, the constraint satisfaction terms are increasingly weighted, which generally results in decreased entropy (e.g. see Figure 1—figure supplement 1C). This encourages the discovery of suitable regions of parameter space, and the subsequent refinement of the distribution to produce the emergent property. The momentum parameters of the Adam optimizer are reset at the end of each augmented lagrangian epoch, which proceeds for ${i}_{\text{max}}$ iterations. In this work, we used a maximum number of augmented lagrangian epochs ${k}_{\text{max}}\ge 5$.
Rather than starting optimization from some $\mathit{\bm{\theta}}$ drawn from a randomized distribution, we found that initializing ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ to approximate an isotropic gaussian distribution conferred more stable, consistent optimization. The parameters of the gaussian initialization were chosen on an applicationspecific basis. Throughout the study, we chose isotropic Gaussian initializations with mean ${\mathit{\bm{\mu}}}_{\text{init}}$ at the center of the support of the distribution and some variance ${\mathit{\bm{\sigma}}}_{\text{init}}^{2}$, except for one case, where an initialization informed by random search was used (see Section 'Stomatogastric ganglion'). Deep probability distributions were fit to these gaussian initializations using 10,000 iterations of stochastic gradient descent on the evidence lower bound (as in Bittner and Cunningham, 2019) with Adam optimizer and a learning rate of ${10}^{3}$.
To assess whether the EPI distribution ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ produces the emergent property, we assess whether each individual constraint on the means and variances of $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ is satisfied. We consider the EPI to have converged when a null hypothesis test of constraint violations $R{(\mathit{\bm{\theta}})}_{i}$ being zero is accepted for all constraints $i\in \{1,\mathrm{\dots},m\}$ at a significance threshold $\alpha =0.05$. This significance threshold is adjusted through Bonferroni correction according to the number of constraints $m$. The pvalues for each constraint are calculated according to a twotailed nonparametric test, where 200 estimations of the sample mean $R{(\mathit{\bm{\theta}})}^{i}$ are made using ${N}_{\text{test}}$ samples of $\mathbf{\mathbf{z}}\sim {q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}})$ at the end of the augmented lagrangian epoch. Of all ${k}_{\text{max}}$ augmented lagrangian epochs, we select the EPI inferred distribution as that which satisfies the convergence criteria and has greatest entropy.
When assessing the suitability of EPI for a particular modeling question, there are some important technical considerations. First and foremost, as in any optimization problem, the defined emergent property should always be appropriately conditioned (constraints should not have wildly different units). Furthermore, if the program is underconstrained (not enough constraints), the distribution grows (in entropy) unstably unless mapped to a finite support. If overconstrained, there is no parameter set producing the emergent property, and EPI optimization will fail (appropriately).
Example: 2D LDS
Request a detailed protocolTo gain intuition for EPI, consider a twodimensional linear dynamical system (2D LDS) model (Figure 1—figure supplement 1A):
with
To run EPI with the dynamics matrix elements as the free parameters $\mathbf{\mathbf{z}}=[{a}_{1,1},{a}_{1,2},{a}_{2,1},{a}_{2,2}]$ (fixing $\tau =1$ s), the emergent property statistics $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ were chosen to contain parts of the primary eigenvalue of $A$, which predicate frequency, $\text{imag}({\lambda}_{1})$, and the growth/decay, $\text{real}({\lambda}_{1})$, of the system
${\lambda}_{1}$ is the eigenvalue of greatest real part when the imaginary component is zero, and alternatively that of positive imaginary component when the eigenvalues are complex conjugate pairs. To learn the distribution of real entries of $A$ that produce a band of oscillating systems around 1 Hz, we formalized this emergent property as $\text{real}({\lambda}_{1})$ having mean zero with variance ${0.25}^{2}$, and the oscillation frequency $\frac{\text{imag}({\lambda}_{1})}{2\pi}$ having mean 1 Hz with variance 0.1 Hz^{2}:
To write the emergent property $\mathcal{X}$ in the form required for the augmented lagrangian optimization (Section 'Augmented lagrangian optimization'), we concatenate these first and second moment constraints into a vector of sufficient statistics $T(\mathbf{\mathbf{z}})$ and constraint values ${\mathit{\bm{\mu}}}_{\text{opt}}$.
From now on in all scientific applications (Sections 'Stomatogastric ganglion', 'Scaling EPI for stable amplification in RNNs', 'Primary visual cortex', 'Superior colliculus'), we specify how the EPI optimization was setup by specifying $f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$, $\mathit{\bm{\mu}}$, and ${\mathit{\bm{\sigma}}}^{2}$.
Unlike the models we presented in the main text, this model admits an analytical form for the mean emergent property statistics given parameter $\mathbf{\mathbf{z}}$, since the eigenvalues can be calculated using the quadratic formula:
We study this example, because the inferred distribution is curved and multimodal, and we can compare the result of EPI to analytically derived contours of the emergent property statistics.
Despite the simple analytic form of the emergent property statistics, the EPI distribution in this example is not simply determined. Although ${\mathbb{E}}_{\mathbf{\mathbf{z}}}\left[T(\mathbf{\mathbf{z}})\right]$ is calculable directly via a closed form function, the distribution ${q}_{\mathit{\bm{\theta}}}^{*}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ cannot be derived directly. This fact is due to the formally hard problem of the backward mapping: finding the natural parameters $\mathit{\bm{\eta}}$ from the mean parameters $\mathit{\bm{\mu}}$ of an exponential family distribution (Wainwright and Jordan, 2008). Instead, we used EPI to approximate this distribution (Figure 1—figure supplement 1B). We used a real NVP normalizing flow architecture three coupling layers and twolayer neural networks of 50 units per layer, mapped onto a support of ${z}_{i}\in [10,10]$. (see Section 'Deep probability distributions and normalizing flows').
Even this relatively simple system has nontrivial (although intuitively sensible) structure in the parameter distribution. To validate our method, we analytically derived the contours of the probability density from the emergent property statistics and values. In the ${a}_{1,1}$${a}_{2,2}$ plane, the black line at $\text{real}({\lambda}_{1})=\frac{{a}_{1,1}+{a}_{2,2}}{2}=0$, dashed black line at the standard deviation $\text{real}({\lambda}_{1})=\frac{{a}_{1,1}+{a}_{2,2}}{2}\pm 0.25$, and the dashed gray line at twice the standard deviation $\text{real}({\lambda}_{1})=\frac{{a}_{1,1}+{a}_{2,2}}{2}\pm 0.5$ follow the contour of probability density of the samples (Figure 1—figure supplement 2A). The distribution precisely reflects the desired statistical constraints and model degeneracy in the sum of ${a}_{1,1}$ and ${a}_{2,2}$. Intuitively, the parameters equivalent with respect to emergent property statistic $\text{real}({\lambda}_{1})$ have similar log densities.
To explain the bimodality of the EPI distribution, we examined the imaginary component of ${\lambda}_{1}$. When $\text{real}({\lambda}_{1})={a}_{1,1}+{a}_{2,2}=0$ (which is the case on average in $\mathcal{X}$), we have
In Figure 1—figure supplement 2B, we plot the contours of $\text{imag}({\lambda}_{1})$ where ${a}_{1,1}{a}_{2,2}$ is fixed to 0 at one standard deviation ($\frac{\pi}{5}$, black dashed) and two standard deviations ($\frac{2\pi}{5}$, gray dashed) from the mean of $2\pi $. This validates the curved multimodal structure of the inferred distribution learned through EPI. Subtler combinations of model and emergent property will have more complexity, further motivating the use of EPI for understanding these systems. As we expect, the distribution results in samples of twodimensional linear systems oscillating near 1 Hz (Figure 1—figure supplement 3).
EPI as variational inference
Request a detailed protocolIn variational inference, a posterior approximation ${q}_{\mathit{\bm{\theta}}}^{*}$ is chosen from within some variational family $\mathcal{Q}$ to be as close as possible to the posterior under the KL divergence criteria
This KL divergence can be written in terms of entropy of the variational approximation:
Since the marginal distribution of the data $p(\mathbf{\mathbf{x}})$ (or ‘evidence’) is independent of $\mathit{\bm{\theta}}$, variational inference is executed by optimizing the remaining expression. This is usually framed as maximizing the evidence lower bound (ELBO)
Now, we will show how the maximum entropy problem of EPI is equivalent to variational inference. In general, a maximum entropy problem (as in Equation 16) has an equivalent lagrange dual form:
with lagrange multipliers ${\mathit{\bm{\eta}}}^{*}$. By moving the lagrange multipliers within the expectation
inserting a $\mathrm{log}\mathrm{exp}(\cdot )$ within the expectation,
and finally choosing $T(\cdot )$ to be likelihood averaged statistics as in EPI
we can compare directly to the objective used in variational inference (Equation 36). We see that EPI is exactly variational inference with an exponential family likelihood defined by sufficient statistics $T(\mathbf{\mathbf{z}})={\mathbb{E}}_{\mathbf{\mathbf{x}}\sim p(\mathbf{\mathbf{x}}\mid \mathbf{\mathbf{z}})}\left[\varphi (\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})\right]$, and where the natural parameter ${\mathit{\bm{\eta}}}^{*}$ is predicated by the mean parameter ${\mathit{\bm{\mu}}}_{\text{opt}}$. Equation 40 implies that EPI uses an improper (or uniform) prior, which is easily changed.
This derivation of the equivalence between EPI and variational inference emphasizes why defining a statistical inference program by its mean parameterization ${\mathit{\bm{\mu}}}_{\text{opt}}$ is so useful. With EPI, one can clearly define the emergent property $\mathcal{X}$ that the model of interest should produce through intuitive selection of ${\mathit{\bm{\mu}}}_{\text{opt}}$ for a given $T(\mathbf{\mathbf{z}})$. Alternatively, figuring out the correct natural parameters ${\mathit{\bm{\eta}}}^{*}$ for the same $T(\mathbf{\mathbf{z}})$ that produces $\mathcal{X}$ is a formally hard problem.
Stomatogastric ganglion
Request a detailed protocolIn Section 'Motivating emergent property inference of theoretical models' and 'Emergent property inference via deep generative models', we used EPI to infer conductance parameters in a model of the stomatogastric ganglion (STG) (Gutierrez et al., 2013). This fiveneuron circuit model represents two subcircuits: that generating the pyloric rhythm (fast population) and that generating the gastric mill rhythm (slow population). The additional neuron (the IC neuron of the STG) receives inhibitory synaptic input from both subcircuits, and can couple to either rhythm dependent on modulatory conditions. There is also a parametric regime in which this neuron fires at an intermediate frequency between that of the fast and slow populations (Gutierrez et al., 2013), which we infer with EPI as a motivational example. This model is not to be confused with an STG subcircuit model of the pyloric rhythm (Marder and Selverston, 1992), which has been statistically inferred in other studies (Prinz et al., 2004; Gonçalves et al., 2019).
STG model
Request a detailed protocolWe analyze how the parameters $\mathbf{\mathbf{z}}=[{g}_{\text{el}},{g}_{\text{synA}}]$ govern the emergent phenomena of intermediate hub frequency in a model of the stomatogastric ganglion (STG) (Gutierrez et al., 2013) shown in Figure 1A with activity $\mathbf{\mathbf{x}}=[{x}_{\text{f1}},{x}_{\text{f2}},{x}_{\text{hub}},{x}_{\text{s1}},{x}_{\text{s2}}]$, using the same hyperparameter choices as Gutierrez et al. Each neuron’s membrane potential ${x}_{\alpha}(t)$ for $\alpha \in \{\text{f1},\text{f2},\text{hub},\text{s1},\text{s2}\}$ is the solution of the following stochastic differential equation:
The input current of each neuron is the sum of the leak, calcium, potassium, hyperpolarization, electrical and synaptic currents. Each current component is a function of all membrane potentials and the conductance parameters $\mathbf{\mathbf{z}}$. Finally, we include gaussian noise $dB$ to the model of Gutierrez et al. so that the model stochastic, although this is not required by EPI.
The capacitance of the cell membrane was set to ${C}_{m}=1nF$. Specifically, the currents are the difference in the neuron’s membrane potential and that current type’s reversal potential multiplied by a conductance:
The reversal potentials were set to ${V}_{leak}=40mV$, ${V}_{Ca}=100mV$, ${V}_{K}=80mV$, ${V}_{hyp}=20mV$, and ${V}_{syn}=75mV$. The other conductance parameters were fixed to ${g}_{leak}=1\times {10}^{4}\mu S$. $g}_{Ca$, ${g}_{K}$, and ${g}_{hyp}$ had different values based on fast, intermediate (hub) or slow neuron. The fast conductances had values ${g}_{Ca}=1.9\times {10}^{2}$, ${g}_{K}=3.9\times {10}^{2}$, and ${g}_{hyp}=2.5\times {10}^{2}$. The intermediate conductances had values ${g}_{Ca}=1.7\times {10}^{2}$, ${g}_{K}=1.9\times {10}^{2}$, and ${g}_{hyp}=8.0\times {10}^{3}$. Finally, the slow conductances had values ${g}_{Ca}=8.5\times {10}^{3}$, ${g}_{K}=1.5\times {10}^{2}$, and ${g}_{hyp}=1.0\times {10}^{2}$.
Furthermore, the Calcium, Potassium, and hyperpolarization channels have timedependent gating dynamics dependent on steadystate gating variables ${M}_{\mathrm{\infty}}$, ${N}_{\mathrm{\infty}}$ and ${H}_{\mathrm{\infty}}$, respectively:
where we set ${v}_{1}=0mV$, ${v}_{2}=20mV$, ${v}_{3}=0mV$, ${v}_{4}=15mV$, ${v}_{5}=78.3mV$, ${v}_{6}=10.5mV$, ${v}_{7}=42.2mV$, ${v}_{8}=87.3mV$, ${v}_{9}=5mV$, and ${v}_{th}=25mV$.
Finally, there is a synaptic gating variable as well:
When the dynamic gating variables are considered, this is actually a 15dimensional nonlinear dynamical system. The gaussian noise $d\mathbf{\mathbf{B}}$ has variance ${(1\times {10}^{12})}^{2}$ A^{2}, and introduces variability in frequency at each parameterization $\mathbf{\mathbf{z}}$.
Hub frequency calculation
Request a detailed protocolIn order to measure the frequency of the hub neuron during EPI, the STG model was simulated for $T=300$ time steps of $dt=25\text{ms}$. The chosen $dt$ and $T$ were the most computationally convenient choices yielding accurate frequency measurement. We used a basis of complex exponentials with frequencies from 0.0 to 1.0 Hz at 0.01 Hz resolution to measure frequency from simulated time series
To measure spiking frequency, we processed simulated membrane potentials with a relu (spike extraction) and lowpass filter with averaging window of size 20, then took the frequency with the maximum absolute value of the complex exponential basis coefficients of the processed timeseries. The first 20 temporal samples of the simulation are ignored to account for initial transients.
To differentiate through the maximum frequency identification, we used a softargmax Let ${X}_{\alpha}\in {\mathcal{C}}^{\mathrm{\Phi}}$ be the complex exponential filter bank dot products with the signal ${x}_{\alpha}\in {\mathbb{R}}^{N}$, where $\alpha \in \{\text{f1},\text{f2},\text{hub},\text{s1},\text{s2}\}$. The softargmax is then calculated using temperature parameter ${\beta}_{\psi}=100$
where $i=[0,1,\mathrm{\dots},100]$. The frequency is then calculated as
Intermediate hub frequency, like all other emergent properties in this work, is defined by the mean and variance of the emergent property statistics. In this case, we have one statistic, hub neuron frequency, where the mean was chosen to be 0.55 Hz,(Equation 2) and variance was chosen to be 0.025^{2} Hz^{2} (Equation 3).
EPI details for the STG model
Request a detailed protocolEPI was run for the STG model using
and
(see Sections 'Maximum entropy distributions and exponential families', 'Augmented lagrangian optimization', and example in Section 'Example: 2D LDS'). Throughout optimization, the augmented lagrangian parameters η and $c$, were updated after each epoch of ${i}_{\text{max}}=5,000$ iterations (see Section 'Augmented lagrangian optimization'). The optimization converged after five epochs (Figure 1—figure supplement 4).
For EPI in Figure 1E, we used a real NVP architecture with three coupling layers and twolayer neural networks of 25 units per layer. The normalizing flow architecture mapped ${\mathbf{\mathbf{z}}}_{0}\sim \mathcal{N}(\mathrm{},I)$ to a support of $\mathbf{\mathbf{z}}=[{g}_{\text{el}},{g}_{\text{synA}}]\in [4,8]\times [0.01,4]$, initialized to a gaussian approximation of samples returned by a preliminary ABC search. We did not include ${g}_{\text{synA}}<0.01$, for numerical stability. EPI optimization was run with an augmented lagrangian coefficient of ${c}_{0}={10}^{5}$, hyperparameter $\beta =2$, a batch size $n=400$, and we simulated one ${\mathbf{\mathbf{x}}}^{(i)}$ per ${\mathbf{\mathbf{z}}}^{(i)}$. The architecture converged with criteria ${N}_{\text{test}}=100$.
Hessian sensitivity vectors
Request a detailed protocolTo quantify the secondorder structure of the EPI distribution, we evaluated the Hessian of the log probability $\frac{{\partial}^{2}\mathrm{log}q(\mathbf{\mathbf{z}}\mid \mathcal{X})}{\partial {\mathbf{\mathbf{z}\mathbf{z}}}^{\top}}$. The eigenvector of this Hessian with most negative eigenvalue is defined as the sensitivity dimension ${\mathbf{\mathbf{v}}}_{1}$, and all subsequent eigenvectors are ordered by increasing eigenvalue. These eigenvalues are quantifications of how fast the emergent property deteriorates via the parameter combination of their associated eigenvector. In Figure 1D, the sensitivity dimension v_{1} (solid) and the second eigenvector of the Hessian v_{2} (dashed) are shown evaluated at the mode of the distribution. Since the Hessian eigenvectors have sign degeneracy, the visualized directions in 2D parameter space were chosen to have positive ${g}_{\text{synA}}$. The length of the arrows is inversely proportional to the square root of the absolute value of their eigenvalues ${\lambda}_{1}=10.7$ and ${\lambda}_{2}=3.22$. For the same magnitude perturbation away from the mode, intermediate hub frequency only diminishes along the sensitivity dimension ${\mathbf{\mathbf{v}}}_{1}$ (Figure 1E–F).
Scaling EPI for stable amplification in RNNs
Request a detailed protocolRank2 RNN model
Request a detailed protocolWe examined the scaling properties of EPI by learning connectivities of RNNs of increasing size that exhibit stable amplification. Rank2 RNN connectivity was modeled as $W=U{V}^{\top}$, where $U=\left[\begin{array}{cc}\hfill {\mathbf{\mathbf{U}}}_{1}\hfill & \hfill {\mathbf{\mathbf{U}}}_{2}\hfill \end{array}\right]+g{\chi}^{(W)}$, $V=\left[\begin{array}{cc}\hfill {\mathbf{\mathbf{V}}}_{1}\hfill & \hfill {\mathbf{\mathbf{V}}}_{2}\hfill \end{array}\right]+g{\chi}^{(V)}$, and ${\chi}_{i,j}^{(W)},{\chi}_{i,j}^{(V)}\sim \mathcal{N}(0,1)$. This RNN model has dynamics
In this analysis, we inferred connectivity parameterizations $\mathbf{\mathbf{z}}={[{\mathbf{\mathbf{U}}}_{1}^{\top},{\mathbf{\mathbf{U}}}_{2}^{\top},{\mathbf{\mathbf{V}}}_{1}^{\top},{\mathbf{\mathbf{V}}}_{2}^{\top}]}^{\top}\in {[1,1]}^{(4N)}$ that produced stable amplification using EPI, SMCABC (Sisson et al., 2007), and SNPE (Gonçalves et al., 2019) (see Section Related methods).
Stable amplification
Request a detailed protocolFor this RNN model to be stable, all real eigenvalues of $W$ must be less than 1: $\text{real}({\lambda}_{1})<1$, where ${\lambda}_{1}$ denotes the greatest real eigenvalue of $W$. For a stable RNN to amplify at least one input pattern, the symmetric connectivity ${W}^{s}=\frac{W+W\top}{2}$ must have an eigenvalue greater than 1: ${\lambda}_{1}^{s}>1$, where ${\lambda}^{s}$ is the maximum eigenvalue of ${W}^{s}$. These two conditions are necessary and sufficient for stable amplification in RNNs (Bondanelli and Ostojic, 2020).
EPI details for RNNs
Request a detailed protocolWe defined the emergent property of stable amplification with means of these eigenvalues (0.5 and 1.5, respectively) that satisfy these conditions. To complete the emergent property definition, we chose variances (${0.25}^{2}$) about those means such that samples rarely violate the eigenvalue constraints. To write the emergent property of Equation 5 in terms of the EPI optimization, we have
and
(see Sections 'Maximum entropy distributions and exponential families', 'Augmented lagrangian optimization', and example in Section 'Example: 2D LDS'). Gradients of maximum eigenvalues of Hermitian matrices like ${W}^{s}$ are available with modern automatic differentiation tools. To differentiate through the $\text{real}({\lambda}_{1})$, we solved the following equation for eigenvalues of rank2 matrices using the rank reduced matrix ${W}^{r}={V}^{\top}U$
For EPI in Figure 2, we used a real NVP architecture with three coupling layers of affine transformations parameterized by twolayer neural networks of 100 units per layer. The initial distribution was a standard isotropic gaussian ${\mathbf{\mathbf{z}}}_{0}\sim \mathcal{N}(\mathrm{},I)$ mapped to the support of ${\mathbf{\mathbf{z}}}_{i}\in [1,1]$. We used an augmented lagrangian coefficient of ${c}_{0}={10}^{3}$, a batch size $n=200$, $\beta =4$, and we simulated one ${\mathbf{\mathbf{W}}}^{(i)}$ per ${\mathbf{\mathbf{z}}}^{(i)}$. We chose to use ${i}_{\text{max}}=500$ iterations per augmented lagrangian epoch and emergent property constraint convergence was evaluated at ${N}_{\text{test}}=200$ (Figure 2B blue line, and Figure 2C–D blue). It was fastest to initialize the EPI distribution on a Tesla V100 GPU, and then subsequently optimize it on a CPU with 32 cores. EPI timing measurements accounted for this initialization period.
Methodological comparison
Request a detailed protocolWe compared EPI to two alternative simulationbased inference techniques, since the likelihood of these eigenvalues given $\mathbf{\mathbf{z}}$ is not available. Approximate bayesian computation (ABC) (Beaumont et al., 2002) is a rejection sampling technique for obtaining sets of parameters $\mathbf{\mathbf{z}}$ that produce activity $\mathbf{\mathbf{x}}$ close to some observed data ${\mathbf{\mathbf{x}}}_{0}$. Sequential Monte Carlo approximate bayesian computation (SMCABC) is the stateoftheart ABC method, which leverages SMC techniques to improve sampling speed. We ran SMCABC with the pyABC package (Klinger et al., 2018) to infer RNNs with stable amplification: connectivities having eigenvalues within an $\u03f5$defined $l$−2 distance of
SMCABC was run with a uniform prior over $\mathbf{\mathbf{z}}\in {[1,1]}^{(4N)}$, a population size of 1000 particles with simulations parallelized over 32 cores, and a multivariate normal transition model.
SNPE, the next approach in our comparison, is far more similar to EPI. Like EPI, SNPE treats parameters in mechanistic models with deep probability distributions, yet the two learning algorithms are categorically different. SNPE uses a twonetwork architecture to approximate the posterior distribution of the model conditioned on observed data ${\mathbf{\mathbf{x}}}_{0}$. The amortizing network maps observations ${\mathbf{\mathbf{x}}}_{i}$ to the parameters of the deep probability distribution. The weights and biases of the parameter network are optimized by sequentially augmenting the training data with additional pairs (${\mathbf{\mathbf{z}}}_{i}$, ${\mathbf{\mathbf{x}}}_{i}$) based on the most recent posterior approximation. This sequential procedure is important to get training data ${\mathbf{\mathbf{z}}}_{i}$ to be closer to the true posterior, and ${\mathbf{\mathbf{x}}}_{i}$ to be closer to the observed data. For the deep probability distribution architecture, we chose a masked autoregressive flow with affine couplings (the default choice), three transforms, 50 hidden units, and a normalizing flow mapping to the support as in EPI. This architectural choice closely tracked the size of the architecture used by EPI (Figure 2—figure supplement 1). As in SMCABC, we ran SNPE with ${\mathbf{\mathbf{x}}}_{0}=\mu $. All SNPE optimizations were run for a limit of 1.5 days, or until two consecutive rounds resulted in a validation log probability lower than the maximum observed for that random seed. It was always faster to run SNPE on a CPU with 32 cores rather than on a Tesla V100 GPU.
To compare the efficiency of these algorithms for inferring RNN connectivity distributions producing stable amplification, we develop a convergence criteria that can be used across methods. While EPI has its own hypothesis testing convergence criteria for the emergent property, it would not make sense to use this criteria on SNPE and SMCABC which do not constrain the means and variances of their predictions. Instead, we consider EPI and SNPE to have converged after completing its most recent optimization epoch (EPI) or round (SNPE) in which the distance ${{\mathbb{E}}_{\mathbf{\mathbf{z}},\mathbf{\mathbf{x}}}\left[f(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})\right]\mathit{\bm{\mu}}}_{2}$ is less than 0.5. We consider SMCABC to have converged once the population produces samples within the $\u03f5=0.5$ ball ensuring stable amplification.
When assessing the scalability of SNPE, it is important to check that alternative hyperparamterizations could not yield better performance. Key hyperparameters of the SNPE optimization are the number of simulations per round ${n}_{\text{round}}$, the number of atoms used in the atomic proposals of the SNPEC algorithm (Greenberg, 2019), and the batch size $n$. To match EPI, we used a batch size of $n=200$ for $N\le 25$, however we found $n=1,000$ to be helpful for SNPE in higher dimensions. While ${n}_{\text{round}}=1,000$ yielded SNPE convergence for $N\le 25$, we found that a substantial increase to ${n}_{\text{round}}=25,000$ yielded more consistent convergence at $N=50$ (Figure 2—figure supplement 2A). By increasing ${n}_{\text{round}}$, we also necessarily increase the duration of each round. At $N=100$, we tried two hyperparameter modifications. As suggested in Greenberg, 2019, we increased ${n}_{\text{atom}}$ by an order of magnitude to improve gradient quality, but this had little effect on the optimization (much overlap between same random seeds) (Figure 2—figure supplement 2B). Finally, we increased ${n}_{\text{round}}$ by an order of magnitude, which yielded convergence in one case, but no others. We found no way to improve the convergence rate of SNPE without making more aggressive hyperparameter choices requiring high numbers of simulations. In Figure 2C–D, we show samples from the random seed resulting in emergent property convergence at greatest entropy (EPI), the random seed resulting in greatest validation log probability (SNPE), and the result of all converged random seeds (SMC).
Effect of RNN parameters on EPI and SNPE inferred distributions
Request a detailed protocolTo clarify the difference in objectives of EPI and SNPE, we show their results on RNN models with different numbers of neurons $N$ and random strength $g$. The parameters inferred by EPI consistently produces the same mean and variance of $\text{real}({\lambda}_{1})$ and ${\lambda}_{1}^{s}$, while those inferred by SNPE change according to the model definition (Figure 2—figure supplement 3A). For $N=2$ and $g=0.01$, the SNPE posterior has greater concentration in eigenvalues around ${\mathbf{\mathbf{x}}}_{0}$ than at $g=0.1$, where the model has greater randomness (Figure 2—figure supplement 3B top, orange). At both levels of $g$ when $N=2$, the posterior of SNPE has lower entropy than EPI at convergence (Figure 2—figure supplement 3B top). However at $N=10$, SNPE results in a predictive distribution of more widely dispersed eigenvalues (Figure 2—figure supplement 3A bottom), and an inferred posterior with greater entropy than EPI (Figure 2—figure supplement 3B bottom). We highlight these differences not to focus on an insightful trend, but to emphasize that these methods optimize different objectives with different implications.
Note that SNPE converges when it’s validation log probability has saturated after several rounds of optimization (Figure 2—figure supplement 3C), and that EPI converges after several epochs of its own optimization to enforce the emergent property constraints (Figure 2—figure supplement 3D blue). Importantly, as SNPE optimizes its posterior approximation, the predictive means change, and at convergence may be different than ${\mathbf{\mathbf{x}}}_{0}$ (Figure 2—figure supplement 3D orange, left). It is sensible to assume that predictions of a wellapproximated SNPE posterior should closely reflect the data on average (especially given a uniform prior and a low degree of stochasticity); however, this is not a given. Furthermore, no aspect of the SNPE optimization controls the variance of the predictions (Figure 2—figure supplement 3D orange, right).
Primary visual cortex
Request a detailed protocolV1 model
Request a detailed protocolEI circuit models, rely on the assumption that inhibition can be studied as an indivisible unit, despite ample experimental evidence showing that inhibition is instead composed of distinct elements (Tremblay et al., 2016). In particular three types of genetically identified inhibitory celltypes – parvalbumin (P), somatostatin (S), VIP (V) – compose 80% of GABAergic interneurons in V1 (Markram et al., 2004; Rudy et al., 2011; Tremblay et al., 2016), and follow specific connectivity patterns (Figure 3A; Pfeffer et al., 2013), which lead to celltypespecific computations (Mossing et al., 2021; Palmigiano et al., 2020). Currently, how the subdivision of inhibitory celltypes, shapes correlated variability by reconfiguring recurrent network dynamics is not understood.
In the stochastic stabilized supralinear network (Hennequin et al., 2018), population rate responses $\mathbf{\mathbf{x}}$ to mean input $\mathbf{\mathbf{h}}$, recurrent input $W\mathbf{\mathbf{x}}$ and slow noise $\mathit{\u03f5}$ are governed by
where $\varphi (\cdot )={[\cdot ]}_{+}^{2}$, and the noise is an OrnsteinUhlenbeck process $\mathit{\u03f5}\sim OU({\tau}_{\text{noise}},\mathit{\bm{\sigma}})$
with ${\tau}_{\text{noise}}=5\text{ms}>\tau =1\text{ms}$. The noisy process is parameterized as
so that $\mathit{\bm{\sigma}}$ parameterizes the variance of the noisy input in the absence of recurrent connectivity ($W=\mathrm{}$). As contrast $c\in [0,1]$ increases, input to the E and Ppopulations increases relative to a baseline input $\mathbf{\mathbf{h}}={\mathbf{\mathbf{h}}}_{b}+c{\mathbf{\mathbf{h}}}_{c}$. Connectivity (${W}_{\text{fit}}$) and input (${\mathbf{\mathbf{h}}}_{b,\text{fit}}$ and ${\mathbf{\mathbf{h}}}_{c,\text{fit}}$) parameters were fit using the deterministic V1 circuit model (Palmigiano et al., 2020)
and
To obtain rates on a realistic scale (100fold greater), we map these fitted parameters to an equivalence class
and
Circuit responses are simulated using $T=200$ time steps at $dt=0.5\text{ms}$ from an initial condition drawn from $\mathbf{\mathbf{x}}(0)\sim U[10\text{Hz},25\text{Hz}]$. Standard deviation of the Epopulation ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ is calculated as the square root of the temporal variance from ${t}_{ss}=75\text{ms}$ to $Tdt=100\text{ms}$
EPI details for the V1 model
Request a detailed protocolTo write the emergent properties of Equation 7 in terms of the EPI optimization, we have
(or $\mathit{\bm{\mu}}=\left[\begin{array}{c}\hfill 10\hfill \end{array}\right]$), and
(see Sections 'Maximum entropy distributions and exponential families', 'Augmented lagrangian optimization', and example in Section 'Example: 2D LDS').
For EPI in Figure 3D–E and Figure 3—figure supplement 1, we used a real NVP architecture with three coupling layers and twolayer neural networks of 50 units per layer. The normalizing flow architecture mapped ${z}_{0}\sim \mathcal{N}(\mathrm{},I)$ to a support of $\mathbf{\mathbf{z}}=[{\sigma}_{E},{\sigma}_{P},{\sigma}_{S},{\sigma}_{V}]\in {[0.0,0.5]}^{4}$. EPI optimization was run using three different random seeds for architecture initialization $\mathit{\bm{\theta}}$ with an augmented lagrangian coefficient of ${c}_{0}={10}^{1}$, $\beta =2$, a batch size $n=100$, and simulated 100 trials to calculate average ${s}_{E}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ for each ${\mathbf{\mathbf{z}}}^{(i)}$. We used ${i}_{\text{max}}=2,000$ iterations per epoch. The distributions shown are those of the architectures converging with criteria ${N}_{\text{test}}=100$ at greatest entropy across three random seeds. Optimization details are shown in Figure 3—figure supplement 2. The sums of squares of each pair of parameters are shown for each EPI distribution in Figure 3—figure supplement 3. The plots are histograms of 500 samples from each EPI distribution from which the significance pvalues of Section 'EPI reveals how recurrence with multiple inhibitory subtypes governs excitatory variability in a V1 model' are determined.
Sensitivity analyses
Request a detailed protocolIn Figure 3E, we visualize the modes of ${q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ throughout the ${\sigma}_{E}$${\sigma}_{P}$ marginal. At each local mode ${\mathbf{\mathbf{z}}}^{*}({\sigma}_{P})$, where ${\sigma}_{P}$ is fixed, we calculated the Hessian and visualized the sensitivity dimension in the direction of positive ${\sigma}_{E}$.
Testing for the paradoxical effect
Request a detailed protocolThe paradoxical effect occurs when a populations steady state rate is decreased (or increased) when an increase (decrease) in current is applied to that population (Tsodyks et al., 1997). To see which, if any, populations exhibited a paradoxical effect, we examined responses to changes in input to individual neurontype populations, where the initial condition was the steady state response to $\mathbf{h}$ (Figure 3—figure supplement 4). Input magnitudes were chosen so that the effect is salient (0.002 for E and P, but 0.02 for S and V). Only the Ppopulation exhibited the paradoxical effect at this connectivity $W$ and input $\mathbf{\mathbf{h}}$.
Primary visual cortex: Mathematical intuition and challenges
Request a detailed protocolWe write the original Equations 68 and 69 in the following way:
where in this paper we chose $\mathrm{\Sigma}}_{\u03f5$, the covariance of the noise to be
where ${\stackrel{~}{\sigma}}_{\alpha}$ is the reparameterized standard deviation of the noise for population α from Equation 70.
We are interested in computing the covariance of the activity. For that, first we define $v=\omega x+h+\u03f5$, the total input to each cell type, and the matrix $S$, the negative Jacobian $S=I\omega {f}^{\mathrm{\prime}}(v)$. Then, Equation 81 can be written as an 8dimensional system. Linearizing around the fixed point of the system without fluctuations, we find the equations that describe the fluctuations of the input to each cell type:
where $d\mathbf{\mathbf{W}}$ is a vector with the private noise of each variable. The $d\mathbf{\mathbf{W}}$ term is multiplied by a nondiagonal matrix, because the noise that the voltage receives is the exact same as the one that comes from the OU process and not another process. The covariance of the inputs ${\mathrm{\Lambda}}_{v}=\u27e8\delta \mathbf{v}\delta {\mathbf{v}}^{T}\u27e9$ can be found as the solution the following Lyapunov equation (Hennequin et al., 2018; Gardiner, 2009):
Where ${\mathrm{\Lambda}}_{c}=\u27e8\delta \mathbf{v}\delta {\u03f5}^{T}\u27e9$ can be eliminated by solving this block matrix multiplication:
The equation above is another Lyapunov Equation, now in 4 dimensions. In the simplest case in which ${\tau}_{\text{noise}}=\tau $, the voltage is directly driven by white noise, and ${\mathrm{\Lambda}}_{v}$ can be expressed in powers of $S$ and ${S}^{T}$. Because $S$ satisfies its own polynomial equation (Cayley Hamilton theorem), there will be four coefficients for the expansion of $S$ and four for ${S}^{T}$, resulting in 16 coefficients that define ${\mathrm{\Lambda}}_{v}$ for a given $S$. Due to symmetry arguments (Gardiner, 2009), in this case the diagonal elements of the covariance matrix of the voltage will have the form:
These coefficients ${g}_{i}(S)$ are complicated functions of the Jacobian of the system. Although expressions for these coefficients can be found explicitly, only numerical evaluation of those expressions determine which components of the noisy input are going to strongly influence the variability of excitatory population. Showing the generality of this dependence in more complicated noise scenarios (e.g. ${\tau}_{\text{noise}}>\tau $ as in Section 'EPI reveals how recurrence with multiple inhibitory subtypes governs excitatory variability in a V1 model'), is the focus of current research.
Superior colliculus
SC model
Request a detailed protocolThe ability to switch between two separate tasks throughout randomly interleaved trials, or ‘rapid task switching,’ has been studied in rats, and midbrain superior colliculus (SC) has been show to play an important in this computation (Duan et al., 2015). Neural recordings in SC exhibited two populations of neurons that simultaneously represented both task context (Pro or Anti) and motor response (contralateral or ipsilateral to the recorded side), which led to the distinction of two functional classes: the Pro/Contra and Anti/Ipsi neurons (Duan et al., 2021). Given this evidence, Duan et al. proposed a model with four functionallydefined neurontype populations: two in each hemisphere corresponding to the Pro/Contra and Anti/Ipsi populations. We study how the connectivity of this neural circuit governs rapid task switching ability.
The four populations of this model are denoted as left Pro (LP), left Anti (LA), right Pro (RP) and right Anti (RA). Each unit has an activity (${x}_{\alpha}$) and internal variable (${u}_{\alpha}$) related by
where $\alpha \in \{LP,LA,RA,RP\}$, $a=0.05$ and $b=0.5$ control the position and shape of the nonlinearity. We order the neural populations of $x$ and $u$ in the following manner
which evolve according to
with time constant $\tau =0.09s$, step size 24 ms and Gaussian noise $d\mathbf{\mathbf{B}}$ of variance ${0.2}^{2}$. These hyperparameter values are motivated by modeling choices and results from Duan et al., 2021.
The weight matrix has four parameters for self $sW$, vertical $vW$, horizontal $hW$, and diagonal $dW$ connections:
We study the role of parameters $\mathbf{\mathbf{z}}={[sW,vW,hW,dW]}^{\top}$ in rapid task switching.
The circuit receives four different inputs throughout each trial, which has a total length of 1.8 s.
There is a constant input to every population,
a bias to the Pro populations
rulebased input depending on the condition
a choiceperiod input
and an input to the right or leftside depending on where the light stimulus is delivered
The input parameterization was fixed to ${I}_{\text{constant}}=0.75$, ${I}_{\text{P,bias}}=0.5$, ${I}_{\text{P,rule}}=0.6$, ${I}_{\text{A,rule}}=0.6$, ${I}_{\text{choice}}=0.25$, and ${I}_{\text{light}}=0.5$.
Task accuracy calculation
Request a detailed protocolThe accuracies of the Pro and Antitasks are calculated as
and
where ${d}_{P}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ and ${d}_{A}(\mathbf{\mathbf{x}};\mathbf{\mathbf{z}})$ calculate the decision made in each trial (approximately 1 for correct and 0 for incorrect choices). Specifically,
in Protrials where the stimulus is on the left side, and Θ approximates the Heaviside step function. Similarly,
in Antitrials where the stimulus was on the left side. Our accuracy calculation only considers one stimulus presentation (Left), since the model is leftright symmetric. The accuracy is averaged over 200 independent trials, and the Heaviside step function is approximated as
where ${\beta}_{\mathrm{\Theta}}=100$.
EPI details for the SC model
Request a detailed protocolTo write the emergent properties of Equation 9 in terms of the EPI optimization, we have
and
(see Sections 'Maximum entropy distributions and exponential families', 'Augmented lagrangian optimization', and example in Section 'Example: 2D LDS').
Throughout optimization, the augmented lagrangian parameters η and $c$, were updated after each epoch of ${i}_{\text{max}}=2,000$ iterations (see Section 'Augmented lagrangian optimization'). The optimization converged after ten epochs (Figure 4—figure supplement 4).
For EPI in Figure 4C, we used a real NVP architecture with three coupling layers of affine transformations parameterized by twolayer neural networks of 50 units per layer. The initial distribution was a standard isotropic gaussian ${\mathbf{\mathbf{z}}}_{0}\sim \mathcal{N}(\mathrm{},I)$ mapped to a support of ${\mathbf{\mathbf{z}}}_{i}\in [5,5]$. We used an augmented lagrangian coefficient of ${c}_{0}={10}^{2}$, a batch size $n=100$, and $\beta =2$. The distribution was the greatest EPI distribution to converge across five random seeds with criteria ${N}_{\text{test}}=25$.
The bend in the EPI distribution is not a spurious result of the EPI optimization. The structure discovered by EPI matches the shape of the set of points returned from bruteforce random sampling (Figure 4—figure supplement 5A) These connectivities were sampled from a uniform distribution over the range of each connectivity parameter, and all parameters producing accuracy in each task within the range of 60% to 90% were kept. This set of connectivities will not match the distribution of EPI exactly, since it is not conditioned on the emergent property. For example, the parameter set returned by the bruteforce search is biased toward lower accuracies (Figure 4—figure supplement 5B).
Mode identification with EPI
Request a detailed protocolWe found one mode of the EPI distribution for fixed values of $sW$ from 1 to −1 in steps of 0.25. To begin, we chose an initial parameter value from 500 parameter samples $\mathbf{\mathbf{z}}\sim {q}_{\mathit{\bm{\theta}}}(\mathbf{\mathbf{z}}\mid \mathcal{X})$ that had closest $sW$ value to 1. We then optimized this estimate of the mode (for fixed $sW$) using probability gradients of the deep probability distribution for 500 steps of gradient ascent with a learning rate of $5\times {10}^{3}$. The next mode (at $sW=0.75$) was found using the previous mode as the initialization. This and all subsequent optimizations used 200 steps of gradient ascent with a learning rate of $1\times {10}^{3}$, except at $sW=1$ where a learning rate of $5\times {10}^{4}$ was used. During all mode identification optimizations, the learning rate was reduced by half (decay = 0.5) after every 100 iterations.
Sample grouping by mode
Request a detailed protocolFor the analyses in Figure 5C and Figure 5—figure supplement 1, we obtained parameters for each step along the continuum between regimes 1 and 2 by sampling from the EPI distribution. Each sample was assigned to the closest mode ${\mathbf{\mathbf{z}}}^{*}(sW)$. Sampling continued until 500 samples were assigned to each mode, which took 2.67 s (5.34 ms/samplepermode). It took 9.59 min to obtain just five samples for each mode with brute force sampling requiring accuracies between 60% and 90% in each task (115 s/samplepermode). This corresponds to a sampling speed increase of roughly 21,500 once the EPI distribution has been learned.
Sensitivity analysis
Request a detailed protocolAt each mode, we measure the sensitivity dimension (that of most negative eigenvalue in the Hessian of the EPI distribution) ${\mathbf{\mathbf{v}}}_{1}({\mathbf{\mathbf{z}}}^{*})$. To resolve sign degeneracy in eigenvectors, we chose ${\mathbf{\mathbf{v}}}_{1}({\mathbf{\mathbf{z}}}^{*})$ to have negative element in $hW$. This tells us what parameter combination rapid task switching is most sensitive to at this parameter choice in the regime.
Connectivity eigendecomposition and processing modes
Request a detailed protocolTo understand the connectivity mechanisms governing task accuracy, we took the eigendecomposition of the connectivity matrices $W=Q\mathrm{\Lambda}{Q}^{1}$, which results in the same eigenmodes ${\mathbf{\mathbf{q}}}_{i}$ for all $W$ parameterized by $\mathbf{\mathbf{z}}$ (Figure 4—figure supplement 3A). These eigenvectors are always the same, because the connectivity matrix is symmetric and the model also assumes symmetry across hemispheres, but the eigenvalues of connectivity (or degree of eigenmode amplification) change with $\mathbf{\mathbf{z}}$. These basis vectors have intuitive roles in processing for this task, and are accordingly named the all eigenmode  all neurons cofluctuate, side eigenmode  one side dominates the other, task eigenmode  the Pro or Antipopulations dominate the other, and diag mode  Pro and Antipopulations of opposite hemispheres dominate the opposite pair. Due to the parametric structure of the connectivity matrix, the parameters $\mathbf{\mathbf{z}}$ are a linear function of the eigenvalues $\mathit{\bm{\lambda}}={[{\lambda}_{\text{all}},{\lambda}_{\text{side}},{\lambda}_{\text{task}}{\lambda}_{\text{diag}}]}^{\top}$ associated with these eigenmodes.
We are interested in the effect of raising or lowering the amplification of each eigenmode in the connectivity matrix by perturbing individual eigenvalues λ. To test this, we calculate the unit vector of changes in the connectivity $\mathbf{\mathbf{z}}$ that result from a change in the associated eigenvalues
where
and for example ${\mathbf{\mathbf{e}}}_{\text{all}}={[1,0,0,0]}^{\top}$. So ${\mathbf{\mathbf{v}}}_{\text{a}}$ is the normalized column of A corresponding to eigenmode $a$. The parameter dimension ${\mathbf{\mathbf{v}}}_{a}$ ($a\in \{\text{all},\text{side},\text{task},\text{and diag}\}$) that increases the eigenvalue of connectivity ${\lambda}_{a}$ is $\mathbf{\mathbf{z}}$invariant (Equation 109) and ${\mathbf{\mathbf{v}}}_{a}\u27c2{\mathbf{\mathbf{v}}}_{b\ne a}$. By perturbing $\mathbf{\mathbf{z}}$ along ${\mathbf{\mathbf{v}}}_{a}$, we can examine how model function changes by directly modulating the connectivity amplification of specific eigenmodes, which have interpretable roles in processing in each task.
Modeling optogenetic silencing
Request a detailed protocolWe tested whether the inferred SC model connectivities could reproduce experimental effects of optogenetic inactivation in rats (Duan et al., 2021). During periods of simulated optogenetic inactivation, activity was decreased proportional to the optogenetic strength $\gamma \in [0,1]$
Delay period inactivation was from $0.8<t<1.2$.
Data availability
All software and scripts for emergent property inference and the analyses in this manuscript can be found on the Cunningham Lab github at this link: https://github.com/cunninghamlab/epi (copy archived at https://archive.softwareheritage.org/swh:1:rev:38febae7035ca921334a616b0f396b3767bf18d4).
References

Complex parameter landscape for a complex neuron modelPLOS Computational Biology 2:e94.https://doi.org/10.1371/journal.pcbi.0020094

Recurrent neural networks as versatile tools of neuroscience researchCurrent Opinion in Neurobiology 46:1–6.https://doi.org/10.1016/j.conb.2017.06.003

ConferenceDegenerate solution networks for theoretical neuroscienceComputational and Systems Neuroscience Meeting (COSYNE), Lisbon, Portugal.

ConferenceExamining models in theoretical neuroscience with degenerate solution networksBernstein Conference 2019 Germany.

Softwareepi, version swh:1:rev:38febae7035ca921334a616b0f396b3767bf18d4 https://archive.softwareheritage.org/swh:1:rev:38febae7035ca921334a616b0f396b3767bf18d4Software Heritage.

Coding with transient trajectories in recurrent neural networksPLOS Computational Biology 16:e1007655.https://doi.org/10.1371/journal.pcbi.1007655

ConferenceNeural ordinary differential equationsAdvances in Neural Information Processing Systems. pp. 6571–6583.

Bayesian solution uncertainty quantification for differential equationsBayesian Analysis 11:1239–1267.https://doi.org/10.1214/16BA1017

Stimulus onset quenches neural variability: a widespread cortical phenomenonNature Neuroscience 13:369–378.https://doi.org/10.1038/nn.2501

ConferenceDensity estimation using real nvpProceedings of the 5th International Conference on Learning Representations.

Collicular circuits for flexible sensorimotor routingNature Neuroscience 1:1–11.https://doi.org/10.1038/s4159302100865x

Structure in neural population recordings: an expected byproduct of simpler phenomena?Nature Neuroscience 20:1310–1318.https://doi.org/10.1038/nn.4617

Significance of conductances in HodgkinHuxley modelsJournal of Neurophysiology 70:2502–2518.https://doi.org/10.1152/jn.1993.70.6.2502

Riemann manifold Langevin and hamiltonian monte carlo methodsJournal of the Royal Statistical Society: Series B 73:123–214.https://doi.org/10.1111/j.14679868.2010.00765.x

Global structure, robustness, and modulation of neuronal modelsThe Journal of Neuroscience 21:5229–5238.https://doi.org/10.1523/JNEUROSCI.211405229.2001

ConferenceMarcel Nonnenmacher, and Jakob H Macke. automatic posterior transformation for likelihoodfree inferenceInternational Conference On Machine Learning.

Universally sloppy parameter sensitivities in systems biology modelsPLOS Computational Biology 3:e189.https://doi.org/10.1371/journal.pcbi.0030189

ConferenceLikelihoodfree mcmc with amortized approximate ratio estimatorsInternational Conference on Machine Learning PMLR. pp. 4239–4248.

Determination of parameter identifiability in nonlinear biophysical models: a bayesian approachJournal of General Physiology 143:401–416.https://doi.org/10.1085/jgp.201311116

An efficient method for structural identifiability analysis of large dynamic systems*IFAC Proceedings Volumes 45:941–946.https://doi.org/10.3182/201207113BE2027.00381

ConferenceImproved variational inference with inverse autoregressive flowAdvances in Neural Information Processing Systems. pp. 4743–4751.

ConferenceAdam: a method for stochastic optimizationInternational Conference On Learning Representations.

ConferenceGlow: generative flow with invertible 1x1 convolutionsAdvances in Neural Information Processing Systems. pp. 10215–10224.

pyABC: distributed, likelihoodfree inferenceBioinformatics 34:3591–3593.https://doi.org/10.1093/bioinformatics/bty361

Coupled oscillators and the design of central pattern generatorsMathematical Biosciences 90:87–109.https://doi.org/10.1016/00255564(88)900594

Inhibitory stabilization and visual coding in cortical circuits with multiple interneuron subtypesJournal of Neurophysiology 115:1399–1409.https://doi.org/10.1152/jn.00732.2015

ConferenceMaximum entropy flow networksInternational Conference On Learning Representations.

ConferenceFlexible statistical inference for mechanistic models of neural dynamicsAdvances in Neural Information Processing Systems. pp. 1289–1299.

BookSloppiness and the geometry of parameter spaceIn: Mannakee B. K, editors. Uncertainty in Biology. Springer. pp. 271–299.https://doi.org/10.1007/9783319212968_11

From biophysics to models of network functionAnnual Review of Neuroscience 21:25–45.https://doi.org/10.1146/annurev.neuro.21.1.25

Cellular, synaptic and network effects of neuromodulationNeural Networks 15:479–493.https://doi.org/10.1016/S08936080(02)000436

Interneurons of the neocortical inhibitory systemNature Reviews Neuroscience 5:793–807.https://doi.org/10.1038/nrn1519

Equation of state calculations by fast computing machinesThe Journal of Chemical Physics 21:1087–1092.https://doi.org/10.1063/1.1699114

ConferenceStatistical inference for analyzing sloppiness in neuroscience modelsBernstein Conference 2019 Germany.

Singletrial neural dynamics are dominated by richly varied movementsNature Neuroscience 22:1677–1686.https://doi.org/10.1038/s4159301905024

ConferenceRobust statistical inference for simulationbased models in neuroscienceBernstein Conference 2018 Germany.

Computational models in the age of large datasetsCurrent Opinion in Neurobiology 32:87–94.https://doi.org/10.1016/j.conb.2015.01.006

Using constraints on neuronal activity to reveal compensatory changes in neuronal parametersJournal of Neurophysiology 98:3749–3758.https://doi.org/10.1152/jn.00842.2007

Neural data science: accelerating the experimentanalysistheory cycle in largescale neuroscienceCurrent Opinion in Neurobiology 50:232–241.https://doi.org/10.1016/j.conb.2018.04.007

ConferenceMasked autoregressive flow for density estimationAdvances in Neural Information Processing Systems. pp. 2338–2347.

ConferenceSequential neural likelihood: fast likelihoodfree inference with autoregressive flowsThe 22nd International Conference on Artificial Intelligence and Statistics. pp. 837–848.

Similar network activity from disparate circuit parametersNature Neuroscience 7:1345–1352.https://doi.org/10.1038/nn1352

Delineating parameter unidentifiabilities in complex modelsPhysical Review E 95:032314.https://doi.org/10.1103/PhysRevE.95.032314

ConferenceVariational inference with normalizing flowsInternational Conference on Machine Learning.

Three groups of interneurons account for nearly 100% of neocortical GABAergic neuronsDevelopmental Neurobiology 71:45–61.https://doi.org/10.1002/dneu.20853

Integration of visual motion and locomotion in mouse visual cortexNature Neuroscience 16:1864–1869.https://doi.org/10.1038/nn.3567

BookA mean field learning algorithm for unsupervised neural networksIn: Jordan M. I, editors. Learning in Graphical Models. Springer. pp. 541–554.https://doi.org/10.1007/9789401150149_20

Maximum entropy models as a tool for building precise neural controlsCurrent Opinion in Neurobiology 46:120–126.https://doi.org/10.1016/j.conb.2017.08.001

Chaos in random neural networksPhysical Review Letters 61:259–262.https://doi.org/10.1103/PhysRevLett.61.259

Neural circuits as computational dynamical systemsCurrent Opinion in Neurobiology 25:156–163.https://doi.org/10.1016/j.conb.2014.01.008

ConferenceHierarchical implicit models and likelihoodfree variational inferenceAdvances in Neural Information Processing Systems. pp. 5523–5533.

Paradoxical effects of external modulation of inhibitory interneuronsThe Journal of Neuroscience 17:4382–4388.https://doi.org/10.1523/JNEUROSCI.171104382.1997

Graphical models, exponential families, and variational inferenceFoundations and Trends in Machine Learning 1:1–305.https://doi.org/10.1561/2200000001

Neurophysiological and computational principles of cortical rhythms in cognitionPhysiological Reviews 90:1195–1268.https://doi.org/10.1152/physrev.00035.2008

A recurrent network mechanism of time integration in perceptual decisionsJournal of Neuroscience 26:1314–1328.https://doi.org/10.1523/JNEUROSCI.373305.2006
Decision letter

John R HuguenardSenior Editor; Stanford University School of Medicine, United States

Timothy O'LearyReviewing Editor; University of Cambridge, United Kingdom

Mark S GoldmanReviewer; University of California at Davis, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Progress in understanding cellular and circuit mechanisms that underpin neural function depend on us being able to reconstitute key computations and high level functions from models that consist of simpler, biologically motivated building blocks. This requires choosing parameters in a model, yet even in the simplest models there tends to be a degenerate relationship between model parameters and its overall emergent function. Bittner and colleagues present a computationally tractable method for inferring parameter distributions that are consistent with an emergent property of interest and demonstrate its use in a variety of wellknown circuit models, and show that the method compares favourably with state of the art sampling and parameter identification methods.
Decision letter after peer review:
Thank you for submitting your article "Interrogating theoretical models of neural computation with deep inference" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and John Huguenard as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Mark S. Goldman (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The editors have judged that your manuscript is of potential interest, but as described below significant additional work is required before it can be considered for publication. Adequate revisions may require extensive work and we ordinarily avoid passing a manuscript to a second round of review on this basis. However, we have made changes in our revision policy in response to COVID19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)
Summary:
Bittner and colleagues introduce a machine learning framework for maximum entropy inference of model parameter distributions that are consistent with certain emergent model properties, specified by the investigator. The approach is illustrated on several models of potential interest.
Reviewers were broadly enthusiastic about the potential usefulness of this methodology. However, the reviews and ensuing discussion revealed several points of concern about the manuscript and the approach. The full reviewer comments are included below. The main concerns are summarized as follows.
Essential revisions:
1. The methodology is not adequately explained. Both the body text and methods section present a somewhat selective description that is very hard to follow in places and should be checked and rewritten for clarity, completeness, notational consistency and correctness.
2. The computational resources required to use this method are not adequately benchmarked. For example, the cosubmission (Macke et al) reported wall clock time, required hardware and iterations required to produce results by directly comparing to existing methods (approximate bayesian computation, naive sampling, etc.) Without transparent benchmarks it is not possible to assess the advance offered by this method.
3. The extent to which this method is generally/straightforwardly applicable was in doubt. It seemed as though a significant amount of computation was required to do inference on one specified property and that the computation would need to be run afresh to query a new property. The methodology in the cosubmission (Macke) made clear that computation required for successive inferences is 'amortized' during training on random parameters. Moreover, EPI seemed less flexible than the cosubmission's approach in that it required a differentiable loss function. The complementarity and advantages of this approach as opposed to the cosubmission's approach are therefore unclear.
4. Some examples lack depth in their treatment (see reviewer comments) and in some cases the presentation is somewhat misleading. The STG example is not in fact a model of the STG. The cosubmission (Macke) uses a model close to the original Prinz et al. model, which is a model of the pyloric subnetwork. It would be instructive to benchmark against this same model, including computation time/resources required. Secondly, the subsequent example (inputresponsivity in a nonlinear sensory system) appeared to imply that EPI permits 'generation' and testing of hypotheses in a way that other methods do not. All the method really does is estimate a joint distribution of feasible parameters in a specific model which is manually inspected to propose hypotheses. Any other method (including brute force sampling) could be used in a similar way, so any claim that this is an integral advantage of EPI would be spurious. Indeed, one reviewer was confused about the origin of these hypotheses. While it is helpful to illustrate how EPI (and other methods) could be used, the writing needs to be far clearer in general and should clarify that EPI does not offer any new specific means of generating or evaluating hypotheses.
5. There is a substantial literature on parameter sensitivity analysis and inference in systems biology, applied dynamical systems and control that has been neglected in this manuscript. The manuscript needs to acknowledge, draw parallels and explain distinctions from current methods (ABC, profile likelihood, deep learning approaches, gaussian processes, etc). The underreferencing of this literature deepened concerns about whether this approach represented an advance. DOIs for a small subset of potentially relevant papers include:
https://doi.org/10.1038/nprot.2014.025
http://doi.org/10.1085/jgp.201311116
http://doi.org/10.1016/j.tcs.2008.07.005
http://doi.org/10.3182/201207113BE2027.00381
http://doi.org/10.1093/bioinformatics/btm382
http://doi.org/10.1111/j.14679868.2010.00765.x
http://doi.org/10.1098/rsfs.2011.0051
https://doi.org/10.1098/rsfs.2011.0047
http://doi.org/10.1214/16BA1017
https://doi.org/10.1039/C0MB00107D
6. One of the reviewers expressed concern that the work might have had significant input from a senior colleague during its early stages, and that it might be worth discussing with the senior colleague whether their contribution was worthy of authorship. The authors may take this concern into account in revising the manuscript.
7. Finally, please also address specific points raised by the reviewers, included below.
Reviewer #1:
General Assessment: The authors introduce a machine learning framework for maximum entropy inference of model parameters which are consistent with certain emergent properties, which they call 'emergent property inference'. I think this is an interesting and direction, and this looks like a useful step towards this program. I think the paper could be improved with a more thorough discussion both of the broad principles their black box approach seeks to optimize, as well as the details of its implementation. I also think the detailed examples should be more selfcontained. Finally I find this work to be somewhat misrepresented as a key to all of theoretical neuroscience. This approach may have some things to offer to the interesting problem of finding parameter regions of models, but this is not the entirety of, nor really a major part of theoretical neuroscience as I see it.
Other concerns:
(1) Maximizing the entropy of the distribution is not a reparameterization invariant exercise. That is, results depend on whether the model parameters contains rates or time constants, for example. I wonder if this approach is attempting to use a 'flat prior' in some sense which has the same reparameterization issue? Can the authors comment?
(2) I don't think this is a criticism of the work, but instead of the writing about it: I find the introductory paragraphs to give a rather limited overview of theory as finding parameters of models which contain the right phenomenology.
(3) I am somewhat familiar with the stomatognathic circuit model, and so that is where I think I understand what they have done best. I don't understand what I should take away from their paper with regards to this model. Are there any findings that hadn't been appreciated before? What does this method tell us about the system and or its model?
(4) I don't follow the other examples. Ideally more details should be given so that readers like myself who don't already know these systems can understand what's been done.
(5) In figure 2C, the difference between the confidence interval between linear and nonlinear predictions is huge! How much of this is due to nonlinearities, and how much is due to differences in the way these models are being evaluated?
Reviewer #2:
This is a very interesting approach to an extremely important question in theoretical neuroscience, and the mathematics and algorithms appear to be very rigorous. The complexities in using this in practice make me wonder if it will find wide application though: setting up the objective to be differentiable, tweaking hyperparameters for training, and interpreting the results; all seem to require a lot from the user. On the other hand, the authors are to be congratulated on providing high quality open source code including clear tutorials on how to use it.
1. Training deep networks is hard. Indeed the authors devote a substantial amount of the manuscript to techniques for training them, and note that different hyperparameters were necessary for each of the different studies. Can the authors be confident that they have found the network which gives maximum entropy or close to it? If so, how. If not, how does that affect the conclusions?
2. Interpreting the results still seems to require quite a lot of work. For example, from inspecting Figure 2 the authors extract four hypotheses. Why these four? Are there other hypotheses that could be extracted and if not how do we know there aren't? Could something systematic be said here?
3. Scalability. The authors state that the method should in principle be scalable, but does that apply to interpreting the results? For example, for the V1 model it seems that you need to look at 48 figures for 4 variables, and I believe this would scale as O(n^{2}) with n variables. This seems to require an unsustainable amount of manual work?
4. There are some very particular choices made in the applications and I wonder how general the conclusions are as a consequence. For example, in Equation 5 the authors choose an arbitrary amount of variance 0.01^{2} – why? In the same example, why look at y=0.1 and 0.5?
Reviewer #3:
This paper addresses a major issue in fitting neuroscience models: how to identify the often degenerate, or nearly degenerate, set of parameters that can underlie a set of experimental observations. Whereas previous techniques often depended upon brute force explorations or special parametric forms or local linearizations to generate sets of parameters consistent with measured properties, the authors take advantage of deep generative networks to address this problem. Overall, I think this paper and the complementary submission have the potential to be transformative contributions to model fitting efforts in neuroscience. That being said, since the primary contribution is the methodology, I think the paper requires more systematic comparisons to ground truth examples to demonstrate potential strengths and weaknesses, and more focus on methodology rather than applications.
1) The authors only have a single groundtruth example where they compare to a known result (a 2x2 linear dynamical system). It would be good to show how well this method compares to results from, for example, a direct brute force grid search of a system with a strongly nonelliptical (e.g. sharply bent) shaped parameter regime and a reasonably large (e.g. 5?) number of parameters corresponding to a particular property, to see how well the derived probability distribution overlaps the brute force grid search parameters (perhaps shown via several 2D projections).
2) It was not obvious whether EPI actually scales well to higher dimensions and how much computation it would take (there is one claim that it 'should scale reasonably'). While I agree that examples with a small number of parameters is nice for illustration, a major issue is how to develop techniques that can handle large numbers of parameters (brute force, while inelegant, inefficient, and not producing an explicit probability distribution can do a reasonable job for small #'s of parameters). The authors should show some example of extending to larger number of parameters and do some checks to show that it appears to work. As a methodological contribution, the authors should also give some sense of how computationally intensive the method is and some sense of how it scales with size. This seems particularly relevant to, for example, trying to infer uncertainties in a large weight matrix or a nonparametric description of spatial or temporal responses or a sensory neuron (which I'm assuming this technique is not appropriate for? See point #4 below).
3) For the STGlike example, this was done for a very simple model that was motivated by the STG but isn't based on experimental recordings. Most of the brute force models of the STG seek to fit various waveform properties of neurons and relative phases. Could the model handle these types of analyses, or would it run into problems due to either needing to specify too many properties or because properties like "number of spikes per burst" are discrete rather than continuous? This isn't fatal, but would be good to consider and/or note explicitly.
4) The discussion should be expanded to be more specific about what problems the authors think the model is, or is not, appropriate for. Comparisons to the Goncalves article would also be helpful since users will want to know the comparative advantages/disadvantages of each method. (if the authors could coordinate running their methods on a common illustrative example, that would be cool, but not required).
5) Given that the paper is heavily a (very valuable!) methods paper for a general audience, the method should be better explained both in the main text and the supplement. Some specific ones are below, but the authors should more generally send the paper to naïve readers to check what is/is not well explained.
– Figure 1 is somewhat opaque and also has notational issues (e.g. omega is the frequency but also appears to be the random input sample).
– For the general audience of eLife, panels C and D are not well described individually or well connected to each other and don't illustrate or describe all of the relevant variables (including what q0 is and what x is).
– In Equation 2 (and also in the same equation in the supplement), it was not immediately obvious what the expectation was taken over.
– The authors don't specific the distribution of w (it's referred to only as 'a simple random variable', which is not clear).
– It was also sometimes hard to quickly find in the text basic, important quantities like what z was for a given simulation.
– The augmented Lagrangian optimization was not well explained or motivated. There is a reference to m=absolute value(mu) but I didn't see m in the above equation.
– Using mu to describe a vector that includes means and variances is confusing notation since mu often denotes means.
– It would be helpful to have a pseudocode 'Algorithm' figure or section of the text.
https://doi.org/10.7554/eLife.56265.sa1Author response
Essential revisions:
1. The methodology is not adequately explained. Both the body text and methods section present a somewhat selective description that is very hard to follow in places and should be checked and rewritten for clarity, completeness, notational consistency and correctness.
Agreed. In the introduction, we present our method by focusing on the key aspect of EPI that differentiates it from other approaches to inverse problems. Specifically, we explain how current methodology infers the parameters producing computation by conditioning on exemplar datasets (real or simulated), whereas in EPI we condition directly on the emergent properties that define the computation. We also note (and have an appendix detailing) how EPI is in fact doing variational bayesian inference, but that the parameterization of the problem as we solve it is more natural given the goal of reproducing a computation. In other words, the methodology is mathematically sound and directly connected to wellworn techniques, but the specific form we solve is the appropriate choice for the motivating problem.
Lines 4452
“Statistical inference, of course, requires quantification of the sometimes vague term computation. […] Related to this point, use of a conventional dataset encourages conventional data likelihoods or loss functions, which focus on some global metric like squared error or marginal evidence, rather than the computation itself.”
We now frame the method within the more general context of parameter inference techniques in neuroscience, rather than the context of recent advancements in machine learning. We have made concentrated efforts to simplify language and to relate to all relevant existing methodology.
Lines 5363
“Alternatively, researchers often quantify an emergent property (EP): a statistic of data that directly quantifies the computation of interest, wherein the dataset is implicit. […] This statistical framework is not new: it is intimately connected to the literature on approximate bayesian computation [24, 25, 26], parameter sensitivity analyses [27, 28, 29, 30], maximum entropy modeling [31, 32, 33], and approximate bayesian inference [34, 35]; we detail these connections in Section 5.1.1.”
To improve clarity, we have overhauled and simplified the notation and presentation of emergent properties. Emergent properties are now denoted with X: rather than a vector of first and secondmoment constraints, we present the emergent property more readably (and equivalently) as mean and variance constraints on emergent property statistics:
Line 742
“$X\phantom{\rule{0.222em}{0ex}}:\phantom{\rule{0.222em}{0ex}}{E}_{z,x}[f(x;z)]=\phantom{\rule{0.222em}{0ex}}\mu ,\phantom{\rule{0.222em}{0ex}}{\mathrm{\text{Var}}}_{z,x}[f(x;z)]=\phantom{\rule{0.222em}{0ex}}{\sigma}^{2}$ (12)”
Emergent properties and EPI are introduced and explained with this revised notation in Section 'Emergent property inference via deep generative models', and we have completely redone the Methods Section 'Emergent property inference (EPI)' to improve clarity, precisely explain EPI optimization (Section 'Augmented lagrangian optimization'), and derive equivalence to established bayesian inference techniques (Section 'EPI as variational inference').
In our revised writing, we deemphasize the role of maximum entropy in the EPI algorithm, because this has largely served as a distraction in our experience. We show in Section 5.1.6 that EPI is exactly variational inference (this was mentioned in the previous draft but not as a point of connection and focus, as it should be and now is). Therefore, it makes sense to present EPI in the main text as a statistical inference technique that constrains the predictions of the inferred parameters to be an emergent property, and leave the details of the maximum entropy to the technically proficient in Section 5.1.
Lines 152154
“Many distributions in Q will respect the emergent property constraints, so we select the most random (highest entropy) distribution, which also means this approach is equivalent to bayesian variational inference (see Section 5.1.6).”
Figure 1 has been largely redone to reflect the modified presentation, and improve the pictorial representation of the method. In Section 3.3, we now show that EPI is the only simulationbased inference method that controls the predictions of its inferred distribution (Figure 2CD).
Finally, we emphasize the utility of this deep inference technique for scientific inquiry in a new paragraph at the end of Section 3.2.
Lines 159175
“The structure of the inferred parameter distributions of EPI can be analyzed to reveal key information about how the circuit model produces the emergent property.
[…] Importantly and unlike alternative techniques, once an EPI distribution has been learned, the modes and Hessians of the distribution can be measured with trivial computation (see Section 5.1.2).”
2. The computational resources required to use this method are not adequately benchmarked. For example, the cosubmission (Macke et al) reported wall clock time, required hardware and iterations required to produce results by directly comparing to existing methods (approximate bayesian computation, naive sampling, etc.) Without transparent benchmarks it is not possible to assess the advance offered by this method.
We thank the reviewers for emphasizing the importance of this methodological comparison. In Section 3.3, we provide a direct comparison of EPI to alternative simulationbased inference techniques SMCABC and SNPE by inferring RNN connectivities that exhibit stable amplification. These comparisons evaluate both wall time (Figure 2A) and simulation count (Figure 2B), and we explain how each algorithm was run in its preferred hardware setting in Section 5.3.4.
In this analysis, we demonstrate the improved scalability of deep inference techniques (EPI and SNPE) with respect to the stateoftheart approximate bayesian computation technique (SMCABC). While controlling for architecture size (Figure 2—figure supplement 1), we push the limits of SNPE through targeted hyperparameter modifications (Figure 2—figure supplement 2), and show that EPI scales to higher dimensional RNN connectivities producing stable amplification than SNPE. Furthermore, we emphasize that SNPE does not constrain the properties of the inferred parameters; many connectivities inferred by SNPE result in unstable or nonamplified models (Figure 2C, Figure 2—figure supplement 3).
3. The extent to which this method is generally/straightforwardly applicable was in doubt. It seemed as though a significant amount of computation was required to do inference on one specified property and that the computation would need to be run afresh to query a new property. The methodology in the cosubmission (Macke) made clear that computation required for successive inferences is 'amortized' during training on random parameters. Moreover, EPI seemed less flexible than the cosubmission's approach in that it required a differentiable loss function. The complementarity and advantages of this approach as opposed to the cosubmission's approach are therefore unclear.
We respectfully disagree with this characterization, but we take responsibility here: we certainly needed to improve our exposition and comparisons to clarify the general applicability of EPI. We have done so, and what results is a meaningful comparison with the cosubmission (SNPE, Macke et al., now published) and other methods. Of course, there is a tradeoff (no free lunch, as usual), and these new analyses clarify when EPI is preferable to SNPE, and vice versa.
First, unlike SNPE, EPI leverages gradients of the emergent property throughout optimization, which leads to better efficiency and scalability (Section 3.3). This is the usual tradeoff of gradientbased vs samplingbased methods. The emergent properties of many models in neuroscience are tractably differentiable, four of which we analyze in this manuscript ranging across levels of biological realism, network size, and computational function. This tradeoff is now explained in the Discussion (Section 4).
Second, of course the reviewers are right to point out EPI does not amortize across emergent properties, and it requires differentiability of the emergent properties of the model. SNPE is more suitable for inference with nondifferentiable mechanistic models and scientific problems requiring many inferred parameter distributions. However, these relative drawbacks of EPI with respect to SNPE can be considered choices made in a tradeoff between simulationbased inference approaches.
On the balance, these two methods occupy different areas of use, and both are equivalently “generally/straightforwardly applicable”; we believe the new analyses and discussion in the manuscript clarify that, as it is a key point for practitioners going forward.
Lines 435447
“A key difference between EPI and SNPE, is that EPI uses gradients of the emergent property throughout optimization. […] In summary, choice of deep inference technique should consider emergent property complexity and differentiability, dimensionality of parameter space, and the importance of constraining the model behavior predicted by the inferred parameter distribution.”
Furthermore, EPI focuses the entire expressivity of the approximating deep probability distribution on a single distribution, rather than spreading this expressivity to some uncharacterized degree across the chosen training distribution of amortized posteriors in SNPE (Section 5.1.1 Related Approaches).
Lines 832838
“The approximating fidelity of the deep probability distribution in sequential approaches is optimized to generalize across the training distribution of the conditioning variable. […] The wellknown inverse mapping problem of exponential families [85] prohibits an amortizationbased approach in EPI, since EPI learns an exponential family distribution parameterized by its mean (in contrast to its natural parameter, see Section 5.1.3).”
Finally, we emphasize that EPI does something fundamental that SNPE and other inference techniques cannot. EPI learns parameter distributions whose predictions are constrained to produce the emergent property. We show in Figure 2 and Supplementary Figure 2—figure supplement 3 that SNPE does not control the statistical properties of its predictions, resulting in the inference of many parameters that are not consistent with the desired emergent property.
Lines 228238
“No matter the number of neurons, EPI always produces connectivity distributions with mean and variance of real(λ_{1}) and ${\lambda}_{1}^{s}$ according to X (Figure 2C, blue). […] Even for moderate neuron counts, the predictions of the inferred distribution of SNPE are highly dependent on N and g, while EPI maintains the emergent property across choices of RNN (see Section 5.3.5).”
4. Some examples lack depth in their treatment (see reviewer comments) and in some cases the presentation is somewhat misleading. The STG example is not in fact a model of the STG. The cosubmission (Macke) uses a model close to the original Prinz et al. model, which is a model of the pyloric subnetwork. It would be instructive to benchmark against this same model, including computation time/resources required. Secondly, the subsequent example (inputresponsivity in a nonlinear sensory system) appeared to imply that EPI permits 'generation' and testing of hypotheses in a way that other methods do not. All the method really does is estimate a joint distribution of feasible parameters in a specific model which is manually inspected to propose hypotheses. Any other method (including brute force sampling) could be used in a similar way, so any claim that this is an integral advantage of EPI would be spurious. Indeed, one reviewer was confused about the origin of these hypotheses. While it is helpful to illustrate how EPI (and other methods) could be used, the writing needs to be far clearer in general and should clarify that EPI does not offer any new specific means of generating or evaluating hypotheses.
We thank the reviewers for explaining how they found some of the presentation misleading. We have taken serious care in this manuscript to clarify (a) what is novel, appreciable scientific insight provided by EPI, as well as (b) which scientific analyses are made possible by EPI.
(a) In the revised manuscript we clarify that novel theoretical insights are not being made into the STG subcircuit model or the recurrent neural network models. The STG subcircuit serves as a motivational example to explain how EPI works, and we use RNNs exhibiting stable amplification as a substrate for scalability analyses.
Lines 7579
“First, we show EPI’s ability to handle biologically realistic circuit models using a fiveneuron model of the stomatogastric ganglion [41]: a neural circuit whose parametric degeneracy is closely studied [42]. Then, we show EPI’s scalability to high dimensional parameter distributions by inferring connectivities of recurrent neural networks that exhibit stable, yet amplified responses – a hallmark of neural responses throughout the brain [43, 44, 45].”
We do produce strong theoretical insights into a model of primary visual cortex (Section 3.4) and superior colliculus (Section 3.5). These analyses have substantially more depth than the previous manuscript.
Lines 7987
“In a model of primary visual cortex [46, 47], EPI reveals how the recurrent processing across different neurontype populations shapes excitatory variability: a finding that we show is analytically intractable. […] Intriguingly, the inferred connectivities of each regime reproduced results from optogenetic inactivation experiments in markedly different ways.”
(b) The ability to infer a flexible approximation to a probability distribution constrained to produce an emergent property is novel in its own right (Figure 2). The deep probability distribution fit by EPI facilitates the mode identification (via gradient ascent of the parameter log probability) and sensitivity measurement (via the measurement of the eigenvector of the Hessian at a parameter value). These mode identifications and sensitivity measurements are done in Sections 3.1 (Figure 1E), 3.4 (Figure 3E), and 3.5 (Figure 4C). By using this mode identification technique along the ridges of high parameter probability in the SC model, we identify the parameters transitioning between the two regimes. Finally, the sensitivity dimensions of the SC model identified by EPI facilitated regime characterization through perturbation analyses (Figure 4D).
Importantly, we do not claim that these theoretical insights were necessarily dependent on using the techniques in (b). One could have come to these conclusions via various combinations of techniques mentioned in Section 5.1.1 Related Methods. In the case of the V1 model inference, the main point is to indicate that such insight can be afforded by EPI and its related methods, in contrast to the analytic derivations emblematic of practice in theoretical neuroscience.
Lines 297305
“EPI revealed the quadratic dependence of excitatory variability on input variability to the E and Ppopulations, as well as its independence to input from the other two inhibitory populations. […] By pointing out this mathematical complexity, we emphasize the value of EPI for gaining understanding about theoretical models when mathematical analysis becomes onerous or impractical.”
In the case of the SC model inference, random sampling would have taken prohibitively long, and it is unclear how the continuum between the two connectivity regimes would have been identified with alternative techniques:
Lines 412415
“The probabilistic tools afforded by EPI yielded this insight: we identified two regimes and the continuum of connectivities between them by taking gradients of parameter probabilities in the EPI distribution, we identified sensitivity dimensions by measuring the Hessian of the EPI distribution, and we obtained many parameter samples at each step along the continuum at an efficient rate.”
As the reviewers indicate, the STG model analyzed in our manuscript is not that of Prinz et al. 2004, and thus not the model analyzed by the cosubmission (Macke et al). We chose an alternative model, believing it important to demonstrate inference in biophysically realistic MorrisLecar models when gradients are tractable, which is the case of the 5neuron STG model we analyzed from Gutierrez et al. 2013. This 5neuron model represents the IC neuron (hub) and its coupling to the pyloric (fast) or gastric mill (slow) subcircuit rhythms. In the introductory text, we refer to this model as the “STG subcircuit” model (rather than “STG model”), and we better clarify what aspect of the STG is being modeled in Results section 3.1.
Lines 98103
“A subcircuit model of the STG [41] is shown schematically in Figure 1A. The fast population (f1 and f2) represents the subnetwork generating the pyloric rhythm and the slow population (s1 and s2) represents the subnetwork of the gastric mill rhythm. […] The hub neuron couples with either the fast or slow population, or both depending on modulatory conditions.”
The difference between this model and the STG model of the pyloric subnetwork is emphasized in Discussion:
Lines 440443
“However, conditioning on the pyloric rhythm [68] in a model of the pyloric subnetwork model [15] proved to be prohibitive with EPI. The pyloric subnetwork requires many time steps for simulation and many key emergent property statistics (e.g. burst duration and phase gap) are not calculable or easily approximated with differentiable functions.”
5. There is a substantial literature on parameter sensitivity analysis and inference in systems biology, applied dynamical systems and control that has been neglected in this manuscript. The manuscript needs to acknowledge, draw parallels and explain distinctions from current methods (ABC, profile likelihood, deep learning approaches, gaussian processes, etc). The underreferencing of this literature deepened concerns about whether this approach represented an advance. DOIs for a small subset of potentially relevant papers include:
https://doi.org/10.1038/nprot.2014.025
http://doi.org/10.1085/jgp.201311116
http://doi.org/10.1016/j.tcs.2008.07.005
http://doi.org/10.3182/201207113BE2027.00381
http://doi.org/10.1093/bioinformatics/btm382
http://doi.org/10.1111/j.14679868.2010.00765.x
http://doi.org/10.1098/rsfs.2011.0051
https://doi.org/10.1098/rsfs.2011.0047
http://doi.org/10.1214/16BA1017
https://doi.org/10.1039/C0MB00107D
Thank you for pointing us to these references on sensitivity analyses, MCMC inference, and applied dynamical systems. We have incorporated most of them into the current manuscript and explain EPI’s relation to each class of these techniques throughout Section 5.1.1 Related approaches.
6. One of the reviewers expressed concern that the work might have had significant input from a senior colleague during its early stages, and that it might be worth discussing with the senior colleague whether their contribution was worthy of authorship. The authors may take this concern into account in revising the manuscript.
Thank you for pointing this out. We take authorship and scientific contribution very seriously (it is always part of our manuscript submission process, as it was here). We have gone back and discussed every researcher who was in any way involved in this work, and to our best estimation, we think this comment references Woods Hole Course Project mentors where this work was discussed and explored in its early stages. We have had explicit followup conversations with James Fitzgerald and Dhruva Raman and noted that of course we would be happy to have them involved at any level (great colleagues!). They both reported they are happy with their acknowledgement in the paper and don’t feel that there is a justification for authorship (again we were very welcoming of this possibility and maintain positive enthusiasm for both). Furthermore, Stephen Baccus has requested that we acknowledge the summer course, which we have done. If there is any other “senior colleague” the reviewer had in mind, please let us know and we will of course gladly pursue the matter.
Reviewer #1:
General Assessment: The authors introduce a machine learning framework for maximum entropy inference of model parameters which are consistent with certain emergent properties, which they call 'emergent property inference'. I think this is an interesting and direction, and this looks like a useful step towards this program. I think the paper could be improved with a more thorough discussion both of the broad principles their black box approach seeks to optimize, as well as the details of its implementation. I also think the detailed examples should be more selfcontained. Finally I find this work to be somewhat misrepresented as a key to all of theoretical neuroscience. This approach may have some things to offer to the interesting problem of finding parameter regions of models, but this is not the entirety of, nor really a major part of theoretical neuroscience as I see it.
We thank the reviewer for their positive comments and thoughtful feedback. We have made serious effort to improve our presentation and explanation of EPI (see response to main concern #1). Furthermore, we have focused on clearly motivating and describing each neural circuit model studied in this manuscript. Sufficient mathematical detail is written in each Results section, while full details are presented in Methods. All code for training EPI on these models and their analysis are available in welldocumented scripts and notebooks in our github repository.
https://github.com/cunninghamlab/epi
We modify our writing to clarify that EPI is not a key to all of theoretical neuroscience, but rather a powerful solution to inverse problems in neural circuit modeling. Inverse problems are indeed a major part of theoretical neuroscience, and their solutions are important for the evaluation of neural circuit models. We clarify this point in the introduction:
Lines 2635
“The fundamental practice of theoretical neuroscience is to use a mathematical model to understand neural computation, whether that computation enables perception, action, or some intermediate processing. […] The inverse problem is crucial for reasoning about likely parameter values, uniquenesses and degeneracies, and predictions made by the model [6, 7, 8].”
Other concerns:
(1) Maximizing the entropy of the distribution is not a reparameterization invariant exercise. That is, results depend on whether the model parameters contains rates or time constants, for example. I wonder if this approach is attempting to use a 'flat prior' in some sense which has the same reparameterization issue? Can the authors comment?
The reviewer is correct to point out that maximum entropy solutions are not reparameterization invariant, and indeed the units matter. The reviewer’s suggestion that the method is in some sense using a flat prior is also correct. To clarify, EPI does not execute posterior inference, because there is no explicit dataset or specified prior belief in the EPI framework. However, we derive the relation of EPI to bayesian variational inference in Section 5.1.6, which shows EPI uses a uniform prior when framed as variational inference.
Lines 10271031
“We see that EPI is exactly variational inference with an exponential family likelihood defined by sufficient statistics T(z^{) =} Ex_{∼p(xz)} [φ(x;z)], and where the natural parameter η^{∗} is predicated by the mean parameter µ_{opt}. Equation 40 implies that EPI uses an improper (or uniform) prior, which is easily changed.”
In our examples, we only infer distributions of parameters with the same units, so this issue should not draw concern over the validity of our model analyses. As suggested, the EPI solution will differ according to relative scaling of parameter values under the maximum entropy selection principle. Thus, an important clarification is that sensitivity quantifications are made in the context of the chosen parameter scalings.
(2) I don't think this is a criticism of the work, but instead of the writing about it: I find the introductory paragraphs to give a rather limited overview of theory as finding parameters of models which contain the right phenomenology.
We appreciate this feedback. We have adapted the introduction to clarify that we are focusing on solving inverse problems in theoretical neuroscience.
(3) I am somewhat familiar with the stomatognathic circuit model, and so that is where I think I understand what they have done best. I don't understand what I should take away from their paper with regards to this model. Are there any findings that hadn't been appreciated before? What does this method tell us about the system and or its model?
We clarify in point 4 above that we are not claiming to produce novel, appreciable scientific insight about the STG subcircuit model, which is used as a motivation example. The takeaway is that the conductance parameters producing intermediate hub frequency belong to a complex 2D distribution, which EPI captures accurately, and that EPI can tell us the parameter changes away from the prototypical configuration that change hub frequency the most or least. For example, for increases in g_{el} and g_{synA} according to the proportions of the degenerate dimension of parameter space, intermediate hub neuron frequency will be preserved in this model. This suggests that in the real STG neural circuit, the IC neuron will remain at an intermediate frequency between the pyloric and gastric mill rhythms if parameter changes are made along such a dimension.
Lines 171 173
“The directionality of v_{2} suggests that changes in conductance along this parameter combination will most preserve hub neuron firing between the intrinsic rates of the pyloric and gastric mill rhythms.”
(4) I don't follow the other examples. Ideally more details should be given so that readers like myself who don't already know these systems can understand what's been done.
Thank you for the feedback. We have taken care to give more general context and motivation for each neural circuit model.
(5) In figure 2C, the difference between the confidence interval between linear and nonlinear predictions is huge! How much of this is due to nonlinearities, and how much is due to differences in the way these models are being evaluated?
In the current manuscript, we do not examine the difference between linear and nonlinear predictions of the V1 model.
Reviewer #2:
This is a very interesting approach to an extremely important question in theoretical neuroscience, and the mathematics and algorithms appear to be very rigorous. The complexities in using this in practice make me wonder if it will find wide application though: setting up the objective to be differentiable, tweaking hyperparameters for training, and interpreting the results; all seem to require a lot from the user. On the other hand, the authors are to be congratulated on providing high quality open source code including clear tutorials on how to use it.
1. Training deep networks is hard. Indeed the authors devote a substantial amount of the manuscript to techniques for training them, and note that different hyperparameters were necessary for each of the different studies. Can the authors be confident that they have found the network which gives maximum entropy or close to it? If so, how. If not, how does that affect the conclusions?
We agree with the reviewer that hyperparameter sensitivity and global optima are important considerations of any optimization algorithm using deep neural networks. To draw a parallel, training deep networks for visual processing used to be considered infeasible, but became easier through iterative improvements in architectural and hyperparameter choices that spread across the field. Similarly, training deep networks via EPI to learn parameter distributions became easier throughout this research project as we learned through trial and error what works well. In fact, there has been extraordinary progress in the field of deep probability distributions (specifically normalizing flows), that have allowed EPI to converge regularly while capturing complex structure (e.g. Dinh et al. 2017 and Kingma et al. 2018). This manuscript’s explanation of hyperparameter choices and the extensive set of examples in the online code will serve as valuable guidelines for future research using this method. Every figure of this paper is reproducible with the jupyter notebooks, and there are several tutorials for understanding the most consequential hyperparameters: the augmented lagrangian constant and deep probability distribution architecture.
In general, we cannot know if we have obtained the global maximum entropy distribution for a given emergent property. The reviewer is correct to point out that multiple distributions may satisfy the emergent property and have different levels of entropy. In the new manuscript, we present the method as an inference technique without focusing very greatly on maximum entropy, since it tends to distract and confuse the reader. We derive an analytic equivalence to variational inference (Section 5.1.6) showing that (a) EPI is a valid inference method, and (b) to emphasize that maximum entropy is the normative selection principle of bayesian inference methods in general. Thus, the concern of not having the globally optimal inferred distribution is the same that applies to all other statistical inference techniques.
Practically, this has scientific implications. It means that we may be missing important structure in the inferred distribution, or we may be missing additional modes in parameter space. To handle this methodologically, we run EPI with multiple random seeds, and select the distribution that has converged with the greatest entropy for scientific analysis. Throughout the manuscript, we compare to analytic, error contour, and bruteforce ground truth to ensure we are capturing the correct distribution with EPI (see response to R3 concern 1).
2. Interpreting the results still seems to require quite a lot of work. For example, from inspecting Figure 2 the authors extract four hypotheses. Why these four? Are there other hypotheses that could be extracted and if not how do we know there aren't? Could something systematic be said here?
This analysis is no longer in the manuscript.
3. Scalability. The authors state that the method should in principle be scalable, but does that apply to interpreting the results? For example, for the V1 model it seems that you need to look at 48 figures for 4 variables, and I believe this would scale as O(n^{2}) with n variables. This seems to require an unsustainable amount of manual work?
We refer the reviewer to Figure 2 and Section 3.3 for scalability analysis. The scaling analysis addresses the question of the issue of parameter discovery with EPI in highdimensional parameter spaces.
Another important question the reviewer brings up is how well one can analyze the highdimensional parameter distributions that EPI produces? Indeed, these distributions become more challenging to understand and visualize in high dimensions. This is where the sensitivity measurements appearing in sections 3.1, 3.4, and 3.5 can be particularly useful. Even in high dimensions, trained deep probability distributions offer tractable quantitative assessments of how parametric combinations affect the emergent property that was conditioned upon.
4. There are some very particular choices made in the applications and I wonder how general the conclusions are as a consequence. For example, in Equation 5 the authors choose an arbitrary amount of variance 0.01^{2} – why? In the same example, why look at y=0.1 and 0.5?
In the current manuscript, we make sure to explain all choices of the emergent property constraints Here, we show the description of each emergent property with equations omitted.
Section 3.2, Lines 136140
“We stipulate that the hub neuron’s spiking frequency – denoted by statistic ω_{hub}(x)
– is close to a frequency of 0.55Hz, between that of the slow and fast frequencies. Mathematically, we define this emergent property with two constraints: that the mean hub frequency is 0.55Hz, and that the variance of the hub frequency is moderate.”
Section 3.3, Lines 200206
“Two conditions are necessary and sufficient for RNNs to exhibit stable amplification [51]: real(λ_{1}) < 1 and ${\lambda}_{1}^{s}>1$ .[…] EPI can naturally condition on this emergent property Variance constraints predicate that the majority of the distribution (within two standard deviations) are within the specified ranges.”
Section 3.4, Lines 275278
“We quantify levels of Epopulation variability by studying two emergent properties where s_{E}(x;z) is the standard deviation of the stochastic Epopulation response about its steady state (Figure 3C). In the following analyses, we select 1Hz^{2} variance such that the two emergent properties do not overlap in s_{E}(z;x).”
Section 3.5, Lines 331334
“We stipulate that accuracy be on average.75 in each task with variance.075^{2}. 75% accuracy is a realistic level of performance in each task, and with the chosen variance, inferred models will not exhibit fully random responses (50%), nor perfect performance (100%).”
Reviewer #3:
This paper addresses a major issue in fitting neuroscience models: how to identify the often degenerate, or nearly degenerate, set of parameters that can underlie a set of experimental observations. Whereas previous techniques often depended upon brute force explorations or special parametric forms or local linearizations to generate sets of parameters consistent with measured properties, the authors take advantage of deep generative networks to address this problem. Overall, I think this paper and the complementary submission have the potential to be transformative contributions to model fitting efforts in neuroscience. That being said, since the primary contribution is the methodology, I think the paper requires more systematic comparisons to ground truth examples to demonstrate potential strengths and weaknesses, and more focus on methodology rather than applications.
1) The authors only have a single groundtruth example where they compare to a known result (a 2x2 linear dynamical system). It would be good to show how well this method compares to results from, for example, a direct brute force grid search of a system with a strongly nonelliptical (e.g. sharply bent) shaped parameter regime and a reasonably large (e.g. 5?) number of parameters corresponding to a particular property, to see how well the derived probability distribution overlaps the brute force grid search parameters (perhaps shown via several 2D projections).
We thank the reviewer for pointing out the importance of ground truth comparisons in this manuscript. In this revision, we make ground truth comparisons via analytic derivations, empirical error contours, and bruteforce sampling.
Analytic comparisons: The 2x2 linear dynamical system is chosen as a worked example because it has multimodal nonelliptical structure (Figure 1—figure supplement 1), and its contours can be derived analytically (Figure 1—figure supplement 2). Similarly, in Section 5.4.5, we derive the quadratic relationship between excitatory variability and input noise variability (in a simplified model) suggesting that the quadratic relationship uncovered by EPI (see Section 3.4) is correct.
Error contours: In the motivation example, we compare the EPI inferred distribution of STG conductances to hub frequency contours (Figure 1E), which show that the nonelliptical parametric structure captured by EPI is in agreement with these contours. This general region of parameter space was labeled following grid search analyses in a previous study (Gutierrez et al. 2013, Figure 2, parameter regime G).
Bruteforce: The EPI inferred distribution for rapid task switching in the SC model is sharply bent (Figure 4), and matches the parameter set returned from random sampling (Figure 4—figure supplement 5A). We note that the bruteforce parameter set is actually not the groundtruth solution, because it does not obey the constraints of the emergent property as the EPI distribution does (Figure 4—figure supplement 5B). This can explain the spurious samples in the bruteforce set that are not in the EPI inferred distribution.
All EPI distributions shown in this manuscript are “validated” in the sense that they pass a hypothesis testing criteria for emergent property convergence; all EPI distributions produce their emergent properties. Finally, the underlying maximum entropy flow network (MEFN) algorithm is compared to a ground truth solution (Loaiza Ganem et al. 2017, Figure 2) by deriving ground truth from the duality of maximum entropy distributions and exponential families (see Section 5.1.3).
2) It was not obvious whether EPI actually scales well to higher dimensions and how much computation it would take (there is one claim that it 'should scale reasonably'). While I agree that examples with a small number of parameters is nice for illustration, a major issue is how to develop techniques that can handle large numbers of parameters (brute force, while inelegant, inefficient, and not producing an explicit probability distribution can do a reasonable job for small #'s of parameters). The authors should show some example of extending to larger number of parameters and do some checks to show that it appears to work. As a methodological contribution, the authors should also give some sense of how computationally intensive the method is and some sense of how it scales with size. This seems particularly relevant to, for example, trying to infer uncertainties in a large weight matrix or a nonparametric description of spatial or temporal responses or a sensory neuron (which I'm assuming this technique is not appropriate for? See point #4 below).
The reviewer is right to point out the importance of a scaling analysis. Please see response to Main concern #2.
3) For the STGlike example, this was done for a very simple model that was motivated by the STG but isn't based on experimental recordings. Most of the brute force models of the STG seek to fit various waveform properties of neurons and relative phases. Could the model handle these types of analyses, or would it run into problems due to either needing to specify too many properties or because properties like "number of spikes per burst" are discrete rather than continuous? This isn't fatal, but would be good to consider and/or note explicitly.
The STG subcircuit model of Gutierrez et al. 2013 is certainly less complex than other models of the STG, yet 5 connected MorrisLecar neurons is certainly a nontrivial system. We clarify why this model is analyzed in Section 3.1 instead of more complex STG models when discussing the differences between EPI and SNPE in Discussion:
Lines 435447
“A key difference between EPI and SNPE, is that EPI uses gradients of the emergent property throughout optimization. […] In summary, choice of deep inference technique should consider emergent property complexity and differentiability, dimensionality of parameter space, and the importance of constraining the model behavior predicted by the inferred parameter distribution.”
4) The discussion should be expanded to be more specific about what problems the authors think the model is, or is not, appropriate for. Comparisons to the Goncalves article would also be helpful since users will want to know the comparative advantages/disadvantages of each method. (if the authors could coordinate running their methods on a common illustrative example, that would be cool, but not required).
Thank you for this recommendation. We now include substantial text in discussion devoted to this topic in addition to that quoted in the previous response (R3 concern #3).
Lines 427447
“Methodology for statistical inference in circuit models has evolved considerably in recent years. […] In summary, choice of deep inference technique should consider emergent property complexity and differentiability, dimensionality of parameter space, and the importance of constraining the model behavior predicted by the inferred parameter distribution.”
5) Given that the paper is heavily a (very valuable!) methods paper for a general audience, the method should be better explained both in the main text and the supplement. Some specific ones are below, but the authors should more generally send the paper to naïve readers to check what is/is not well explained.
– Figure 1 is somewhat opaque and also has notational issues (e.g. omega is the frequency but also appears to be the random input sample).
Figure 1 has been completely revised.
– For the general audience of eLife, panels C and D are not well described individually or well connected to each other and don't illustrate or describe all of the relevant variables (including what q0 is and what x is).
Agreed. In the new figure, we clearly depict z as parameters and x as circuit activity. We keep the vertical directionality of panel D exclusively for the deep generative process of the deep probability distribution (where q_{0} is replaced and shown explicitly as an isotropic gaussian input to the deep network). The horizontal directionality shared between panels D and E reflect the procedure of theoretical neuroscience described in Section 3.1.
– In Equation 2 (and also in the same equation in the supplement), it was not immediately obvious what the expectation was taken over.
We take care to always indicate the variables over which expectations are taken in the updated manuscript.
– The authors don't specific the distribution of w (it's referred to only as 'a simple random variable', which is not clear).
It is clarified in the text that this simple initial distribution transformed by the deep neural network is an isotropic gaussian.
– It was also sometimes hard to quickly find in the text basic, important quantities like what z was for a given simulation.
We have made sure to make this explicit in each section of the paper.
– The augmented Lagrangian optimization was not well explained or motivated. There is a reference to m=absolute value(mu) but I didn't see m in the above equation.
Lines 767769
“To run this constrained optimization, we use an augmented lagrangian objective, which is the standard approach for constrained optimization [70], and the approach taken to fit Maximum Entropy Flow Networks (MEFNs) [38]."
– Using mu to describe a vector that includes means and variances is confusing notation since mu often denotes means.
Agreed. Throughout the updated main text, we only describe emergent properties through explicit mean and variance constraints as in Equation 11. We reserve the mean parameterization µ_{opt} of the maximum entropy solution of the EPI optimization for technical details in methods. The difference between µ and µ_{opt} is described in Section 5.1.3.
– It would be helpful to have a pseudocode 'Algorithm' figure or section of the text.
We provide pseudocode for the EPI optimization in Algorithm 1, which mirrors the pseudocode found in the paper describing the underlying algorithm for MEFN [38].
https://doi.org/10.7554/eLife.56265.sa2Article and author information
Author details
Funding
National Science Foundation (DGE1644869)
 Sean R Bittner
NINDS (5R01NS100066)
 Sean R Bittner
 John Cunningham
McKnight Endowment Fund for Neuroscience (Faculty Award)
 John Cunningham
Gatsby Charitable Foundation (GAT3708)
 Sean R Bittner
 Agostina Palmigiano
 Kenneth D Miller
 John Cunningham
Simons Foundation (542963)
 Sean R Bittner
 John Cunningham
National Science Foundation (1707398)
 Sean R Bittner
 Agostina Palmigiano
 Kenneth D Miller
 John Cunningham
NIH (5U19NS107613)
 Agostina Palmigiano
 Kenneth D Miller
Simons Foundation (Postdoctoral Fellowship)
 Chunyu A Duan
NIH (U01NS108683)
 Agostina Palmigiano
 Kenneth D Miller
NIH (R01EY029999)
 Agostina Palmigiano
Grossman Charitable Foundation
 Sean R Bittner
 John Cunningham
Simons Foundation (553017)
 Kenneth D Miller
Howard Hughes Medical Institute
 Carlos D Brody
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was funded by NSF Graduate Research Fellowship, DGE1644869, McKnight Endowment Fund, NIH NINDS 5R01NS100066, Simons Foundation 542963, NSF NeuroNex Award, DBI1707398, The Gatsby Charitable Foundation, Simons Collaboration on the Global Brain Postdoctoral Fellowship, Chinese Postdoctoral Science Foundation, and International Exchange Program Fellowship. We also acknowledge the Marine Biological Laboratory Methods in Computational Neuroscience Course, where this work was discussed and explored in its early stages. Helpful conversations were had with Larry Abbott, Stephen Baccus, James Fitzgerald, Gabrielle Gutierrez, Francesca Mastrogiuseppe, Srdjan Ostojic, Liam Paninski, and Dhruva Raman.
Senior Editor
 John R Huguenard, Stanford University School of Medicine, United States
Reviewing Editor
 Timothy O'Leary, University of Cambridge, United Kingdom
Reviewer
 Mark S Goldman, University of California at Davis, United States
Publication history
 Received: February 21, 2020
 Accepted: June 30, 2021
 Version of Record published: July 29, 2021 (version 1)
 Version of Record updated: August 3, 2021 (version 2)
Copyright
© 2021, Bittner et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 941
 Page views

 119
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.