Modelling the neural code in large populations of correlated neurons
Abstract
Neurons respond selectively to stimuli, and thereby define a code that associates stimuli with population response patterns. Certain correlations within population responses (noise correlations) significantly impact the information content of the code, especially in large populations. Understanding the neural code thus necessitates response models that quantify the coding properties of modelled populations, while fitting largescale neural recordings and capturing noise correlations. In this paper, we propose a class of response model based on mixture models and exponential families. We show how to fit our models with expectationmaximization, and that they capture diverse variability and covariability in recordings of macaque primary visual cortex. We also show how they facilitate accurate Bayesian decoding, provide a closedform expression for the Fisher information, and are compatible with theories of probabilistic population coding. Our framework could allow researchers to quantitatively validate the predictions of neural coding theories against both largescale neural recordings and cognitive performance.
Introduction
A foundational idea in sensory neuroscience is that the activity of neural populations constitutes a ‘neural code’ for representing stimuli (Dayan and Abbott, 2005; Doya, 2007): the activity pattern of a population in response to a sensory stimulus encodes information about that stimulus, and downstream neurons decode, process, and reencode this information in their own responses. Sequences of such neural populations implement the elementary functions that drive perception, cognition, and behaviour (Pitkow and Angelaki, 2017). Therefore, by studying the encoding and decoding of population responses, researchers may investigate how information is processed along neural circuits, and how this processing influences perception and behaviour (Wei and Stocker, 2015; Panzeri et al., 2017; Kriegeskorte and Douglas, 2018).
Given a true statistical model of how a neural population responds to (encodes information about) stimuli, Bayes’ rule can transform the encoding model into an optimal decoder of stimulus information (Zemel et al., 1998; Pillow et al., 2011). However, when validated as Bayesian decoders, statistical models of neural encoding are often outperformed by models trained to decode stimulusinformation directly, indicating that the encoding models miss key statistics of the neural code (Graf et al., 2011; Walker et al., 2020). In particular, the correlations between neurons’ responses to repeated presentations of a given stimulus (noise correlations), and how these noise correlations are modulated by stimuli, can strongly impact coding in neural circuits (Zohary et al., 1994; Abbott and Dayan, 1999; Sompolinsky et al., 2001; Ecker et al., 2016; Kohn et al., 2016; Schneidman, 2016), especially in large populations of neurons (MorenoBote et al., 2014; Montijn et al., 2019; Bartolo et al., 2020; Kafashan et al., 2021; Rumyantsev et al., 2020).
Statistically validating theories of population coding in large neural circuits thus depends on encoding models that support accurate Bayesian decoding, effectively capture noisecorrelations, and efficiently fit largescale neural recordings. There are at least two classes of neural recordings for which established models have facilitated such analyses. Firstly, for recordings of binary spikecounts, pairwisemaximum entropy models (Schneidman et al., 2006; Lyamzin et al., 2010; GranotAtedgi et al., 2013; Tkačik et al., 2013; Meshulam et al., 2017; Maoz et al., 2020) have been used to investigate the structure of the retinal code (Ganmor et al., 2015; Tkačik et al., 2015). Secondly, when modelling dynamic spiketrain recordings, generalized linear models (GLMs) have proven effective at modelling spatiotemporal features of information processing in the retina and cortex (Pillow et al., 2008; Park et al., 2014; Runyan et al., 2017; Ruda et al., 2020).
Nevertheless, many theories of neural coding are formulated in terms of unbounded spikecounts (Ma et al., 2006; Beck et al., 2011a; Ganguli and Simoncelli, 2014; Makin et al., 2015; Yerxa et al., 2020), rather than the binary spikecounts of pairwise maximum entropy models. Furthermore, neural correlations are often lowdimensional (Arieli et al., 1996; Ecker et al., 2014; Goris et al., 2014; Rabinowitz et al., 2015; Okun et al., 2015; Semedo et al., 2019), in contrast with the correlations that result from the fully connected, recurrent structure of standard GLMs. Although there are extensions of the GLM approach that capture sharedvariability (Vidne et al., 2012; Archer et al., 2014; Zhao and Park, 2017), they seem unable to support exact Bayesian decoding. Similarly, methods such as factor analysis that model unbounded spikecounts as continuous variables have proven highly effective at modelling neural correlations in largescale recordings (Yu et al., 2009; Cunningham and Yu, 2014; Ecker et al., 2014; Semedo et al., 2019), yet it is also unknown if they can support accurate Bayesian decoding.
Towards modelling spikecount responses and accurate Bayesian decoding in large populations of correlated neurons, we develop a class of encoding model based on finite mixtures of Poisson distributions. Within neuroscience, Poisson mixtures are widely applied to modelling the spikecount distributions of individual neurons (Wiener and Richmond, 2003; Shidara et al., 2005; Goris et al., 2014; Taouali et al., 2016). Outside of neuroscience, mixtures of multivariate Poisson distributions are an established model of multivariate count distributions that effectively capture correlations in count data (Karlis and Meligkotsidou, 2007; Inouye et al., 2017).
Building on the theory of exponential family distributions (Wainwright and Jordan, 2008; Macke et al., 2011b), our model extends previous mixture models of multivariate count data in two ways. Firstly, we develop a tractable extension of Poisson mixtures that captures both over and underdispersed response variability (i.e. where the response variance is larger or smaller than the mean, respectively) based on ConwayMaxwell Poisson distributions (Shmueli et al., 2005; Stevenson, 2016). Secondly, we introduce an explicit dependence of the model on a stimulus variable, which allows the model to accurately capture changes in response statistics (including noise correlations) across stimuli. Importantly, the resulting encoding model affords closedform expressions for both its Fisher information and probability density function, and thereby a rigorous quantification of the coding properties of a modelled neural population (Dayan and Abbott, 2005). Moreover, the model learns lowdimensional representations of stimulusdriven neural activity, and we show how it captures a fundamental property of population codes known as informationlimiting correlations (MorenoBote et al., 2014; Montijn et al., 2019; Bartolo et al., 2020; Kafashan et al., 2021; Rumyantsev et al., 2020).
We apply our mixture model framework to both synthetic data and recordings from macaque primary visual cortex (V1), and demonstrate that it effectively models responses of populations of hundreds of neurons, captures noise correlations, and supports accurate Bayesian decoding. Moreover, we show how our model is compatible with the theory of probabilistic population coding (Zemel et al., 1998; Pouget et al., 2013), and could thus be used to study the theoretical coding properties of neural circuits, such as their efficiency (Ganguli and Simoncelli, 2014), linearity (Ma et al., 2006), or information content (MorenoBote et al., 2014).
Results
A critical part of our theoretical approach is based on expressing models of interest in exponential family form. An exponential family distribution $p(n)$ over some data $n$ (in our case, neural responses) is defined by the proportionality relation $p(n)\propto {e}^{\mathit{\theta}\cdot \mathbf{s}(n)}b(n)$, where $\mathit{\theta}$ are the socalled natural parameters, $\mathbf{\mathbf{s}}(n)$ is a vectorvalued function of the data called the sufficient statistic, and $b(n)$ is a scalarvalued, nonnegative function called the base measure (Wainwright and Jordan, 2008). The exponential family form allows us to modify and extend existing models in a simple and flexible manner, and derive analytical results about the coding properties of our models. We demonstrate our approach with applications to both synthetic data, and experimental data recorded in V1 of anaesthetized and awake macaques viewing drifting grating stimuli at different orientations (for details see Materials and methods).
Extended Poisson mixture models capture spikecount variability and covariability
Our first goal is to define a class of models of stimulusindependent, neural population activity, that model neural activity directly as spikecounts, and that accurately capture singleneuron variability and pairwise covariability. We base our models on Poisson distributions, as they are widely applied to modelling the trialtotrial distribution of the number of spikes generated by a neuron (Dayan and Abbott, 2005; Macke et al., 2011a). We will also generalize our Poissonbased models with the theory of ConwayMaxwell (CoM) Poisson distributions (Sur et al., 2015; Stevenson, 2016; Chanialidis et al., 2018). The twoparameter CoMPoisson model contains the oneparameter Poisson model as a special case, however, whereas the Poisson model always has a Fano factor (FF; the variance divided by the mean) of 1, the CoMPoisson model can exhibit both over (FF>1) and underdispersion (FF<1), and thus capture the broader range of Fano factors observed in cortex (Stevenson, 2016).
The other key ingredient in our modelling approach are mixtures of Poisson distributions, which have been used to model complex spikecount distributions in cortex, and also allow for overdispersion (Shidara et al., 2005; Goris et al., 2014; Taouali et al., 2016; Figure 1A). In our case, we mix multiple, independent Poisson distributions in parallel, as such models can capture covariability in count data as well (see Karlis and Meligkotsidou, 2007 for a more general formulation of multivariate Poisson mixtures than what we consider here). To construct such a model, we begin with a product of independent Poisson distributions (IP distribution), one per neuron. We then mix a finite number of component IP models, to arrive at a multivariate spikecount, finite mixture model (see Materials and methods). Importantly, although each component of this mixture is an IP distribution, randomly switching between components induces correlations between the neurons (Figure 1B,C).
IP mixtures can in fact model arbitrary covariability between neurons (see Materials and methods, Equation 7); however, they are still limited because the model neurons in an IP mixture are always overdispersed. To overcome this, it is helpful to consider factor analysis (FA), which is widely applied to modelling neural population responses (Cunningham and Yu, 2014). IP mixtures are similar to FA, in that FA represents the covariance matrix of neural responses as the sum of a diagonal matrix that helps capture individual variance, and a lowrank matrix that captures covariance (see Bishop, 2006), and FA and IP mixtures can be finetuned to capture covariance arbitrarily well. However, whereas FA has distinct parameters for representing means and diagonal variances, the means and variances in an IP mixture are coupled through shared parameters (see Materials and methods, Equation 6). Our strategy will thus be to break this coupling between means and variances by granting IP mixtures an additional set of parameters based on the theory of CoMPoisson distributions.
To do so, we first show how to express an IP mixture as the marginal distribution of an exponential family distribution. Note that an IP mixture with ${d}_{K}$ components may be expressed as a latent variable model over spikecount vectors $\mathit{n}$ and latent componentindices $k$, where $1\le k\le {d}_{K}$. In this formulation we denote the $k$th component distribution by $p(\mathit{n}\mid k)$, and the probability of realizing (switching to) the $k$ th component by $p(k)$. The mixture model over spikecounts $\mathit{n}$ is then expressed as the marginal distribution $p(\mathit{n})=\sum _{k=1}^{{d}_{K}}p(\mathit{n}\mid k)p(k)=\sum _{k=1}^{{d}_{K}}p(\mathit{n},k)$, of the joint distribution $p(\mathit{n},k)$. Under mild regularity assumptions (see Materials and methods), we may reparameterize this joint distribution in exponential family form as
where the vectors $\mathit{\theta}}_{N$ and $\mathit{\theta}}_{K$, and matrix $\mathbf{\Theta}}_{NK$ are the natural parameters of $p(\mathit{n},k)$, and $\mathit{\delta}(k)=({\delta}_{2}(k),\dots ,{\delta}_{{d}_{K}}(k))$ is the Kronecker delta vector defined by ${\delta}_{j}(k)=1$ if $j=k$, and 0 otherwise.
This representation affords an intuitive interpretation. In general, the natural parameters of an IP distribution are the logarithms of the average spikecounts (firing rates), and the natural parameters of the first component distribution $p(\mathit{n}\mid k=1)$ of an IP mixture are simply $\mathit{\theta}}_{N$. The natural parameters of the $k$ th component for $k>1$ are then the sum of the ‘baseline’ parameters $\mathit{\theta}}_{N$ and column $k1$ from the matrix of parameters ${\mathbf{\mathbf{\Theta}}}_{NK}$ (Equation 13, Materials and methods). Because the dimension of the baseline parameters $\mathit{\theta}}_{N$ is much smaller than the total number of parameters in a given mixture, the baseline parameters provide a relatively lowdimensional means of affecting all the component distributions of the given mixture, as well as the probability distribution over indices $p(k)$ (Figure 1D; see Materials and methods, Equation 12 for how the indexprobabilities $p(k)$ depend on $\mathit{\theta}}_{N$).
We next extend Equation 1 with the theory of CoMPoisson distributions, and define the latent variable exponential family
where $\mathbf{\mathbf{l}\mathbf{f}}(\mathbf{\mathbf{n}})=(\mathrm{log}({n}_{1}!),\mathrm{\dots},\mathrm{log}({n}_{{d}_{N}}!))$ is the vector of logfactorials of the individual spikecounts, and $\mathit{\theta}}_{N}^{\ast$ are a set of natural parameters derived from CoMPoisson distributions (see Materials and methods). Based on this construction, each component $p(\mathbf{\mathbf{n}}\mid k)$ is a product of independent CoMPoisson distributions, and when $\mathit{\theta}}_{N}^{\ast}=\mathbf{1$, we recover an IP mixture defined by Equation 1 with parameters $\mathit{\theta}}_{N$, $\mathit{\theta}}_{K$, and ${\mathbf{\mathbf{\Theta}}}_{NK}$. The first component of this model $p(\mathbf{\mathbf{n}}\mid k=1)$ has parameters $\mathit{\theta}}_{N$ and $\mathit{\theta}}_{N}^{\ast$, and as with the IP mixture, the parameters $\mathit{\theta}}_{N$ are translated by column $k1$ of $\mathbf{\Theta}}_{NK$ when $k>1$. However, the parameters $\mathit{\theta}}_{N}^{\ast$ are never translated, and remain the same for each component distribution (Equation 16, Materials and methods, and see Equation 15 for formulae for the indexprobabilities $p(k)$). We refer to models defined by Equation 2 as CoMbased (CB) mixtures, and $\mathit{\theta}}_{N}^{\ast$ as CB parameters.
Due to the addition of the CB parameters $\mathit{\theta}}_{N}^{\ast$, a CB mixture breaks the coupling between the spikecount means and variances that is present in the simpler IP mixture (Equation 17, Materials and methods). In Figure 1D–F, we demonstrate how changing the parameters of a CB mixture can concentrate or disperse both the mixture distribution and its components, and that a CB mixture can indeed exhibit underdispersion.
To validate our mixture models, we tested if they capture variability and covariability of V1 population responses to repeated presentations of a grating stimulus with fixed orientation (${d}_{N}=43$ neurons and ${d}_{T}=355$ repetitions of 150 ms duration in one awake macaque; ${d}_{N}=70$ and ${d}_{T}=1,200$ of duration 70 ms in one anaesthetized macaque). We fit our mixtures to the complete datasets with expectationmaximization (EM, a standard choice for training finite mixture models [McLachlan et al., 2019] see Materials and methods). The CB mixture accurately captured singleneuron variability (Figure 2A–B, red symbols), including both cases of overdispersion and underdispersion. On the other hand, the simpler IP mixture (Figure 2A–B, blue symbols) cannot accommodate underdispersion due to its mathematical limits, and demonstrated limited ability to model overdispersion due to the coupling between the mean and variance (Equation 6).
To understand how the CB parameters allow the CB mixture to overcome the limits of the IP mixture, we plot a histogram of the CB parameters $\mathit{\theta}}_{N}^{\ast$ for both fits (Figure 2C–D). If the CB parameter of a given CoMPoisson distribution is $<1$, $>1$, or $=1$, then the CoMPoisson distribution is underdispersed, overdispersed, or Poissondistributed, respectively. When a CB mixture is fit to the awake data (Figure 2C), we see that it learns a range of values for the CB parameters around −1, to accommodate the variety of Fano factors observed in the awake data (Figure 2A). On the anaesthetized data, even though IP mixtures can capture overdispersion, the IP mixture underestimates the dispersion of neurons due to the coupling between the mean and variance (Figure 2B). The CB mixture thus uses the CB parameters to further disperse its model neurons (Figure 2D).
In contrast with individual variability, we found that both mixture models were flexible enough to qualitatively capture pairwise noise correlation structure in both awake and anaesthetized animals (Figure 3A–B), and that the distributions of modelled neural correlations were broadly similar when compared to the data (Figure 3C–D). In Appendix 1, we rigorously compare IP mixtures, CB mixtures, and FA on our datasets, and show that although FA is better than our mixture models at capturing secondorder statistics in training data, IP mixtures and CB mixtures achieve comparable predictive performance as FA when evaluated on heldout data.
Extended Poisson mixture models capture stimulusdependent response statistics
So far, we have introduced the exponential family theory of IP and CB mixtures, and shown how they capture response variability and covariability for a fixed stimulus. To allow us to study stimulus encoding and decoding, we further extend our mixtures by inducing a dependency of the model parameters on a stimulus. When there are a finite number of stimulus conditions and sufficient data, we may define a stimulusdependent model with a lookup table, and fit it by fitting a distinct model at each stimulus condition. However, this is inefficient when the amount of data at each stimuluscondition is limited and the stimulusdependent statistics have structure that is shared across conditions. A notable feature of the exponential family parameterizations in Equations 1 and 2 is that the baseline parameters influence both the index probabilities and all the component distributions of the model. This suggests that by restricting stimulusdependence to the baseline parameters, we might model rich stimulusdependent response structure, while bounding the complexity of the model.
In general, we refer to any finite mixture with stimulusdependent parameters as a conditional mixture (CM), and depending on whether the CM is based on Equations 1 and 2 and, we refer to it as an IP or CBCM, respectively. Although there are many ways we might induce stimulusdependence, in this paper we consider two forms of CM: (i) a maximal CM, which we implement as a lookup table, such that all the parameters in Equations 1 and 2 and depend on the stimulus, and (ii) a minimal CM, for which we restrict stimulusdependence to the baseline parameters $\mathit{\theta}}_{N$. This results in the CBCM
where $x$ is the stimulus, and ${\mathit{\theta}}_{N}(x)$ are the stimulusdependent baseline parameters, and we recover a minimal, IPCM by setting $\mathit{\theta}}_{N}^{\ast}=\mathbf{1$.
The IPCM again affords an intuitive interpretation: The first component of an IPCM $p(\mathbf{\mathbf{n}}\mid x,k=1)$ has stimulusdependent natural parameters ${\mathit{\theta}}_{N}(x)$, and thus the stimulusdependent firing rate, or tuning curve, of the $i$ th neuron given $k=1$ is $\mu}_{i1}(x)={e}^{{\theta}_{N,i}(x)$, where ${\theta}_{N,i}(x)$ is the $i$ th element of ${\mathit{\theta}}_{N}(x)$. The natural parameters of the $k$ th component for $k>1$ are then the sum of ${\mathit{\theta}}_{N}(x)$ and column $k1$ of ${\mathbf{\mathbf{\Theta}}}_{NK}$. As such, given $k>1$, the tuning curve of the $i$ th neuron ${\mu}_{ik}(x)={\gamma}_{i,(k1)}{\mu}_{i1}(x)$ is a ‘gainmodulated’ version of ${\mu}_{i1}(x)$, where the gain ${\gamma}_{i,(k1)}$ is the exponential function of element $i$ of column $k1$ of ${\mathbf{\mathbf{\Theta}}}_{NK}$ (see Equation 13, Materials and methods). For a CBCM this interpretation no longer holds exactly, but still serves as an approximate description of the behaviour of its components (see Equation 16 and the accompanying discussions).
Towards understanding the expressive power of CMs, we study a minimal, CBCM with ${d}_{N}=20$ neurons, ${d}_{K}=5$ mixture components, and randomly chosen parameters (see Materials and methods). Moreover, we assume that the stimulus is periodic (e.g. the orientation of a grating), and that the tuning curves of the component distributions $p(\mathbf{\mathbf{n}}\mid x,k)$ have a von Mises shape, which is a widely applied model of neural tuning to periodic stimuli (Herz et al., 2017). We may achieve such a shape by defining the stimulusdependent baseline parameters as ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathbf{v}\mathbf{m}(x)$, where $\mathit{\theta}}_{N}^{0$ and ${\mathbf{\mathbf{\Theta}}}_{NX}$ are parameters, and $\mathbf{\mathbf{v}\mathbf{m}}(x)=(\mathrm{cos}2x,\mathrm{sin}2x)$. Figure 4A shows that the tuning curves of the CBCM neurons are approximately bellshaped, yet many also exhibit significant deviations.
We also study if CMs can be effectively fit to datasets comparable to those obtained in typical neurophysiology experiments. We generated 200 responses from the CBCM described above — the ground truth CBCM — to each of 10 orientations spread evenly over the halfcircle, for a total of 2000 stimulusresponse sample points. We then used this data to fit a CBCM with the same number of components. Toward this aim, we derive an approximate EM algorithm to optimize model parameters (see Materials and methods). Figure 4B shows that the tuning curves of the learned CBCM are nearly indistinguishable from those of the ground truth CBCM (Figure 4B, coefficient of determination ${r}^{2}=0.998$).
To reveal the orientationdependent latent structure of the model, in Figure 4C we plot the index probability $p(k\mid x)$ for every $k$ as a function of the orientation $x$. In Figure 4D we show that the orientationdependent index probabilities of the learned CBCM qualitatively match the true index probabilities in Figure 4C. We also note that although the learned CBCM does not correctly identify the indices themselves, this has no effect on the performance of the CBCM.
The orientationdependent indexprobabilities provide a highlevel picture of how the complexity and structure of model correlations varies with the orientation. The vertical dashed lines in Figure 4C–D denote two orientations that yield substantially different index probabilities $p(k\mid x)$. When a large number of indexprobabilities are nonzero, the correlationmatrices of the CBCM can exhibit complex correlations with both negative and positive values (Figure 4E). However, when one index dominates, the correlation structure largely disappears (Figure 4F). In Figure 4G we show that the FFs also depend on stimulus orientation. Lastly, we find that both the FF and the correlationmatrices of the learned CBCM are nearly indistinguishable from the groundtruth CBCM (Figure 4E–G).
In summary, our analyses show that minimal CBCMs can express complex, stimulusdependent response statistics, and that we can recover the structure of a ground truth CBCM from realistic amounts of synthetic data with EM. In the following sections, we rigorously evaluate the performance of CMs on our awake and anaesthetized datasets.
Conditional mixtures effectively model neural responses in macaque V1
A variety of models may be defined within the CM framework delineated by Equations 1, 2 and 3. Towards understanding how effectively CMs can model real data, we compare different variants by their crossvalidated loglikelihood on both our awake and anaesthetized datasets; this is the same data used in Figures 2 and 3 but now including all stimulusconditions. We consider both IP and CB variants of each of the following conditional mixtures: (i) maximal CMs where we learn a distinct mixture for each of ${d}_{X}$ stimulus conditions, (ii) minimal CMs with von Misestuned components, and (iii) minimal CMs with discretetuned components given by ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathit{\delta}(x)$, where $\mathit{\bm{\delta}}$ is the Kronecker delta vector with ${d}_{X}1$ elements, and $x$ is the index of the stimulus. In contrast with the von Mises CM, the discrete CM makes no assumptions about the form of component tuning. In Table 1 we detail the number of parameters for all forms of CM.
To provide an interpretable measure of the relative performance of each CM variant, we define the ‘information gain’ as the difference between the estimated loglikelihood (base $e$) of the given CM and the loglikelihood of a von Misestuned, independent Poisson model, which is a standard model of uncorrelated neural responses to oriented stimuli (Herz et al., 2017). We then evaluate the predictive performance of our models with 10fold crossvalidation of the information gain.
Table 2 shows that the CM variants considered achieve comparable performance, and perform substantially better than the independent Poisson lower bound on both the awake and anaesthetized data. Figure 5 shows that a performance peak emerges smoothly as the model complexity (number of parameters) is increased. In all cases, the CB models outperform their IP counterparts, and typically with fewer parameters. The discrete CBCMs achieve high performance on both datasets. In contrast, von Mises CMs perform well on the anaesthetized data but more poorly on the awake data, and maximal CMs exhibit the opposite trend. Nevertheless, von Mises CMs solve a more difficult statistical problem as they also interpolate between stimulus conditions, and so may still prove relevant even where performance is limited. On the other hand, even though maximal CMs achieve high performance, they simply do so by replicating the high performance of stimulusindependent mixtures (Figures 2 and 3) at each stimulus condition, and require more parameters than minimal CMs to maximize performance.
Conditional mixtures facilitate accurate and efficient decoding of neural responses
To demonstrate that CMs model the neural code, we must show that CMs not only capture the features of neural responses, but that these features also encode stimulusinformation. Given an encoding model $p(\mathbf{\mathbf{n}}\mid x)$ and a response from the model $\mathbf{\mathbf{n}}$, we may optimally decode the information in the response about the stimulus $x$ by applying Bayes’ rule $p(x\mid \mathbf{\mathbf{n}})\propto p(\mathbf{\mathbf{n}}\mid x)p(x)$, where $p(x\mid \mathbf{\mathbf{n}})$ is the posterior distribution (the decoded information), and $p(x)$ represents our prior assumptions about the stimulus (Zemel et al., 1998). When we do not know the true encoding model, and rather fit a statistical model to stimulusresponse data, using the statistical model for Bayesian decoding and analyzing its performance can tell us how well it captures the features of the neural code.
We analyze the performance of Bayesian decoders based on CMs by quantifying their decoding performance, and comparing the results to other common approaches to decoding. We evaluate decoding performance with the 10fold crossvalidation logposterior probability $\mathrm{log}p(x\mid \mathbf{\mathbf{n}})$ (base $e$) of the true stimulus value $x$, for both our awake and anaesthetized V1 datasets. With regard to choosing the number of components ${d}_{K}$, we analyze the decoding performance of CMs that achieved the best encoding performance based as indicated in Table 2 and depicted Figure 5. We do this to demonstrate how well a single model can simultaneously perform at both encoding and decoding, instead of applying distinct procedures for selecting CMs based on decoding performance (see Materials and methods for a summary of tradeoffs when choosing ${d}_{K}$).
In our comparisons we focus on minimal, discrete CMs as overall they achieved high performance on both datasets (Figure 5). To characterize the importance of neural correlations to Bayesian decoding, we compare our CMs to the decoding performance of independent Poisson models with discrete tuning (Nonmixed IP). To characterize the optimality of our Bayesian decoders, we also evaluate the performance of linear multiclass decoders (Linear), as well nonlinear multiclass decoders defined as artificial neural networks (ANNs) with two hidden layers and a crossvalidated number of hidden units (for details on the training and model selection procedure, see Materials and methods).
Table 3 shows that on the awake data, the performance of the CMs is statistically indistinguishable from the ANN, and the CMs and the ANN significantly exceed the performance of both the Linear and Nonmixed IP models. On the anaesthetized data, the minimal CM approaches the performance of the ANN, and the minimal CMs and ANN models again exceed the performance of the Nonmixed IP and Linear models. Yet in this case, the Linear model is much more competitive, whereas the Nonmixed IP model performs very poorly, possibly because of the larger magnitude of noise correlations in this data. In Appendix 2, we also report that a Bayesian decoder based on a factor analysis (FA) encoding model performed inconsistently, and poorly relative to CMs, as it would occasionally assign numerically 0 probability to the true stimulus, and thus score an average logposterior of negative infinity. In Appendix 2, we present preliminary evidence that this is because CMs capture higher order structure that FA cannot.
On both the awake and anaesthetized data the ANN requires two orders of magnitude more parameters than the CMs to achieve its performance gains. In addition, the CBCM achieves marginally better performance with fewer parameters than the IPCM, indicating that although modelling individual variability is not essential for effective Bayesian decoding, doing so still results in a more parsimonious model of the neural code. In Appendix 3, we report a sample complexity analysis of CM encoding and decoding performance. We found that whereas our anaesthetized V1 dataset (sample size ${d}_{T}=10,800$) was large enough to saturate the performance of our models, a larger awake V1 dataset (${d}_{T}=3,168$) could yield further improvements to decoding performance.
We also consider widely used alternative measures of decoding performance, namely the Fisher information (FI), which is an upper bound on the average precision (inverse variance) of the posterior (Brunel and Nadal, 1998), as well as the linear Fisher information (LFI), which is a linear approximation of the FI (Seriès et al., 2004) corresponding to the accuracy of the optimal, unbiased linear decoder of the stimulus (Kanitscheider et al., 2015a). The FI is especially helpful when the posterior cannot be evaluated directly (such as when it is continuous), and is widely adopted in theoretical (Abbott and Dayan, 1999; Beck et al., 2011b; Ecker et al., 2014; MorenoBote et al., 2014; Kohn et al., 2016) and experimental (Ecker et al., 2011; Kafashan et al., 2021; Rumyantsev et al., 2020) studies of neural coding. As with other models based on exponential family theory (Ma et al., 2006; Beck et al., 2011b; Ecker et al., 2016), the FI of a minimal CM may be expressed in closedform, and is equal to its LFI (see Materials and methods), and therefore minimal CMs can be used to study FI analytically and obtain modelbased estimates of FI from data.
To study how well CMs capture FI, we defined 40 random subpopulations of ${d}_{N}=20$ neurons from both our V1 datasets, fit von Mises IPCMs to the responses of each subpopulation, and used these learned models as groundtruth populations. We then generated 50 responses at each of 10 evenly spaced orientations from each ground truth IPCM, for a total of ${d}_{T}=500$ responses per groundtruth model. We then fit a new IPCM to each set of 500 responses, and compared the FI of the refit CM to the FI of the groundtruth CM at 50 evenly spaced orientations. Pooled over all populations and orientations, the relative error of the estimated FI was $12.8\%\pm 18.6\%$ on the awake data and $9.1\%\pm 22.4\%$ on the anaesthetized data, suggesting that IPCMs can recover and even interpolate approximate FIs of groundtruth populations from modest amounts of data.
To summarize, CMs support accurate Bayesian decoding in awake and anaesthetized macaque V1 recordings, and are competitive with nonlinear decoders with two orders of magnitude more parameters. Moreover, CMs afford closedform expressions of FI and can interpolate good estimates of FI from modest amounts of data, and thereby support analyses of neural data based on this widely applied theoretical tool.
Constrained conditional mixtures support linear probabilistic population coding
Having shown that minimal CMs can both capture the statistics of neural encoding and facilitate accurate Bayesian decoding, we now aim to show how they relate to an influential theory of neural coding known as probabilistic population codes (PPCs), which describes how neural circuits process information in terms of encoding and Bayesian decoding (Zemel et al., 1998). In particular, linear probabilistic population codes (LPPCs) are PPCs with a restricted encoding model, that explain numerous features of neural coding in the brain (Ma et al., 2006; Beck et al., 2008; Beck et al., 2011a).
In general, an exponential family of distributions that depend on some stimulus $x$ may be expressed as $p(\mathbf{n}\mid x)={e}^{{\mathit{\theta}}_{N}(x)\cdot {\mathbf{s}}_{N}(\mathbf{n}){\psi}_{N}({\mathit{\theta}}_{N}(x))}\mu (\mathbf{n})$, where ${\mathbf{\mathbf{s}}}_{N}$ is the sufficient statistic, μ is the base measure, and ${\psi}_{N}({\mathit{\theta}}_{N}(x))$ is known as the logpartition function (in Equations 13 we used the proportionality symbol ∝ to avoid writing the logpartition functions explicitly). A PPC is an LPPC when its encoding model is in the socalled exponential family with linear sufficient statistics (EFLSS), which has the form $p(\mathbf{n}\mid x)={e}^{{\mathit{\theta}}_{N}(x)\cdot \mathbf{n}}\varphi (\mathbf{n})$ for some functions $\varphi (\mathbf{\mathbf{n}})$ and ${\mathit{\theta}}_{N}(x)$ (Beck et al., 2011a). If we equate the two expressions ${e}^{{\mathit{\theta}}_{N}(x)\cdot {\mathbf{s}}_{N}(\mathbf{n}){\psi}_{N}({\mathit{\theta}}_{N}(x))}\mu (\mathbf{n})={e}^{{\mathit{\theta}}_{N}(x)\cdot \mathbf{n}}\varphi (\mathbf{n})$ we see that an EFLSS is a stimulusdependent exponential family that satisfies two constraints: that the sufficient statistic ${\mathbf{\mathbf{s}}}_{N}(\mathbf{\mathbf{n}})=\mathbf{\mathbf{n}}$ is linear, and that the logpartition function ${\psi}_{N}({\mathit{\theta}}_{N}(x))=\alpha$ does not depend on the stimulus, so that $\varphi (\mathbf{\mathbf{n}})={e}^{\alpha}\mu (\mathbf{\mathbf{n}})$.
As presented, the EFLSS is a mathematical model that does not have fittable parameters. We wish to express CMs as a form of EFLSS in order to show how a fittable model could be compatible with LPPC theory. If we return to the general expression for a minimal CM (Equation 3) and assume that the logpartition function is given by the constant α, then we may write
where $\varphi (\mathbf{n})={e}^{{\mathit{\theta}}_{N}^{\ast}\cdot \mathbf{l}\mathbf{f}(\mathbf{n})\alpha}\sum _{k}{e}^{{\mathit{\theta}}_{K}\cdot \mathit{\delta}(k)+\mathbf{n}\cdot {\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k)}$, such that the given CM is in the EFLSS. Observe that this equation only holds due to the specific structure of minimal CMs: if the parameters $\mathit{\theta}}_{N}^{\ast$, $\mathit{\theta}}_{K$, or ${\mathbf{\mathbf{\Theta}}}_{NK}$ would depend on the stimulus, then it would not be possible to absorb them into the function $\varphi (\mathbf{\mathbf{n}})$.
Ultimately, this equivalence between constrained CMs and EFLSSs allows LPPC theory to be applied to constrained CMs, and provides theorists working on neural coding with an effective statistical tool that can help validate their hypotheses.
Minimal conditional mixtures capture informationlimiting correlations
Our last aim is to demonstrate that CMs can approximately represent a central phenomenon in neural coding known as informationlimiting correlations, which are neural correlations that fundamentally limit stimulusinformation in neural circuits (MorenoBote et al., 2014; Montijn et al., 2019; Bartolo et al., 2020; Kafashan et al., 2021; Rumyantsev et al., 2020). To illustrate this, we generate population responses with limited information, and then fit an IPCM to these responses and study the learned latent representation. In particular, we consider a source population of 200 independent Poisson neurons $p(\mathbf{\mathbf{n}}\mid s)$ with homogeneous, von Mises tuning curves responding to a noisy stimulusorientation $s$, where the noise $p(s\mid x)$ follows a von Mises distribution centred at the true stimulusorientation $x$ (see Materials and methods). In Figure 6A we show that, as expected, the average FI in the source population about the noisy orientation $s$ grows linearly with the size of randomized subpopulations, although the FI about the true orientation $x$ is theoretically bounded by the precision (inverse variance) of the sensory noise.
Even though the neurons in the source model are uncorrelated, sensory noise ensures that the encoding model $p(\mathbf{\mathbf{n}}\mid x)=\int p(\mathbf{\mathbf{n}}\mid s)p(s\mid x)\mathit{d}s$ contains informationlimiting correlations that bound the FI about $x$(MorenoBote et al., 2014; Kanitscheider et al., 2015b). Informationlimiting correlations can be small and difficult to capture, and to understand how CMs learn in the presence of informationlimiting noise correlations, we fit a von Mises IPCM $q(\mathbf{\mathbf{n}}\mid x)$ with ${d}_{K}=20$ mixture components to ${d}_{T}=10,000$ responses from the informationlimited model $p(\mathbf{\mathbf{n}}\mid x)$. Figure 6A (purple) shows that the FI of the learned CM $q(\mathbf{\mathbf{n}}\mid x)$ appears to saturate near the precision of the sensory noise, indicating that the learned CM approximates the informationlimiting correlations present in $p(\mathbf{\mathbf{n}}\mid x)$.
To understand how the learned CM approximates these informationlimiting correlations, we study the relation between the latent structure of the model and how it generates population activity. For an IPCM, the orientationdependent indexprobabilities may be expressed as $p(k\mid x)\propto {e}^{{\mathit{\theta}}_{K}\cdot \mathit{\delta}(k)+\sum _{i=1}^{{d}_{N}}{\mu}_{ik}(x)}$, where ${\mu}_{ik}(x)$ is the tuning curve of the $i$ th neuron under component $k$. In Figure 6B, we plot the sum of the tuning curves ${\sum}_{i=1}^{{d}_{N}}{\mu}_{ik}(x)$ for each component $k$ as a function of orientation, and we see that each component concentrates the tuning of the population around a particular orientation. This encourages the probability of each component to also concentrate around a particular orientation, and in Figure 6C we see that, given the true orientation $x={90}^{\circ}$, there are three components with probabilities substantially greater than 0.
Because there are essentially three components that are relevant to the responses of the IPCM to the true orientation $x={90}^{\circ}$, generating a response from the CM approximately reduces to generating a response from one of the three possible component IP distributions. In Figure 6D–F, we depict a response to $x={90}^{\circ}$ from each of the three component IP distributions, as well as the optimal posterior based on the learned IPCM (purple lines), and a suboptimal posterior based on the source model (i.e. ignoring noise correlations; green lines). We observe that the trialtotrial variability of the learned IPCM results in random shifts of the peak neural activity away from the true orientation, thus limiting information. Furthermore, when the response of the population is concentrated at the true orientation (Figure 6E), the suboptimal posterior assigns a high probability to the true orientation, whereas when the responses are biased away from the true orientation (Figure 6D and F) the suboptimal posterior assigns nearly 0 probability to the true orientation. This is in contrast to the optimal posterior, which always assigns a significant probability to the true orientation.
In summary, CMs can effectively approximate informationlimiting correlations, and the simple latent structure of CMs could help reveal the presence of informationlimiting correlations in data.
Discussion
We introduced a latent variable exponential family formulation of Poisson mixtures. We showed how this formulation allows us to effectively extend Poisson mixtures both to capture subPoisson variability, and to incorporate stimulus dependence using conditional mixtures. Our analyses and simulations showed that these conditional mixtures (CMs) can be fit efficiently and recover ground truth models in synthetic data, capture a wide range of V1 response statistics in real data, and can be easily inverted to obtain accurate Bayesian decoding that is competitive with nonlinear decoders, while using orders of magnitude less parameters. In addition, we illustrated how the latent structure of CMs can represent a fundamental feature of the neural code, namely informationlimiting correlations.
Our framework is particularly relevant for probabilistic theories of neural coding based on the theory of exponential families (Beck et al., 2007), which include theories that address the linearity of Bayesian inference in neural circuits (Ma et al., 2006), the role of phenomena such as divisive normalization in neural computation (Beck et al., 2011a), Bayesian inference about dynamic stimuli (Makin et al., 2015; Sokoloski, 2017), and the metabolic efficiency of neural coding (Ganguli and Simoncelli, 2014; Yerxa et al., 2020). These theories have proven difficult to validate quantitatively with neural data due to a lack of statistical models which are both compatible with their exponential family formulation (see Equation 4), and can model correlated activity in recordings of large neural populations. Our work suggests that CMs can overcome these difficulties, and help connect this rich mathematical theory of neural coding with the stateoftheart in parallel recording technologies.
CMs are not limited to modelling neural responses to stimuli and can model how arbitrary experimental variables modulate neural variability and covariability. Examples of experimental variables that have measurable effects on neural covariability include the spatial and temporal context around a stimulus (Snyder et al., 2014; Snow et al., 2016; Snow et al., 2017; Festa et al., 2020), as well as taskvariables and the attentional state of the animal (Cohen and Maunsell, 2009; Mitchell et al., 2009; Ruff and Cohen, 2014; Maunsell, 2015; Rabinowitz et al., 2015; Verhoef and Maunsell, 2017; Bondy et al., 2018). Each of these variables could be incorporated into a CM by either replacing the stimulusvariable in our equations with the variable of interest, or combining it with the stimulusvariable to construct a CM with multivariate dependence. This would allow researchers to explore how the stimulus and the experimental variables mutually interact to shape variability and covariability in large populations of neurons.
To understand how this variability and covariability effects neural coding, latent variable models such as CMs are often applied to extract interpretable features of the neural code from data (Whiteway and Butts, 2019). The latent states of a CM provide a soft classification of neural activity, and we may apply CMs to model how an experimental variable modulates the class membership of population activity. In the studies on experimental variables listed above, models of neural activity yielded predictions of perceptual and behavioural performance. Because CMs support Bayesian decoding, a CM can also make predictions about how a particular class of neurons is likely to modulate perception and behaviour, and we may then test these predictions with experimental interventions on the neurons themselves (Panzeri et al., 2017). In this manner, we believe CMs could form a critical part of a rigorous, Bayesian framework for ‘cracking the neural code’ in large populations of neurons.
Outside of the framework of mixture models, there are broader possibilities for designing conditional, latentvariable models which have the minimal, exponential family structure of Equation 3, yet for which the latent variable is not a finite index. We make use of finite mixture models in this paper primarily because mixture models are analytically tractable, even when mixing Poisson distributions. In contrast, models with Gaussian latent variables are analytically tractable when the observations are also Gaussian, but not in general. Nevertheless, if the relevant formulae and computations can be effectively approximated, then many of the advantages of CMs could be preserved even when using continuous latent variables. For example, if the expectation step in our EM algorithm does not have a closedform expression, it might be possible to approximate it with contrastive divergence (Hinton, 2002).
In our applications, we considered onedimensional stimuli and implemented the stimulusdependence of the CM parameters with linearly parameterized functions. Nevertheless, this stimulus dependence can be implemented by arbitrary parametric functions of highdimensional variables such as deep neural networks, and CMs can also incorporate historydependence via recurrent connectivity (see Appendix 4). As such, CMs have the potential to integrate encoding models of higher cortical areas (Yamins et al., 2014) with models of the temporal features of the neural code (Pillow et al., 2008; Park et al., 2014; Runyan et al., 2017), towards analyzing the neural code in dynamic, correlated neural populations in higher cortex. Finally, outside of neuroscience, highdimensional count data exists in many fields such as corpus linguistics and genomics (Inouye et al., 2017), and researchers who aim to understand how this data depends on history or additional variables could benefit from our techniques.
Materials and methods
Notation
Request a detailed protocolWe use capital, bold letters (e.g. $\mathbf{\mathbf{\Theta}}$) to indicate matrices; small, bold letters (e.g. $\mathit{\theta}$) to indicate vectors; and regular letters (e.g. $\theta$) to indicate scalars. We use subscript capital letters to indicate the role of a given variable, so that, in Equation 1 for example, $\mathit{\theta}}_{K$ are the natural parameters that bias the indexprobabilities, $\mathit{\theta}}_{N$ are the baseline natural parameters of the neural firing rates, and ${\mathbf{\mathbf{\Theta}}}_{NK}$ is the matrix of parameters through which the indices and rates interact.
We denote the $i$ th element of a vector $\mathit{\theta}$ by $\theta}_{i$, or e.g. of the vector $\mathit{\theta}}_{K$ by $\theta}_{K,i$. We denote the $i$ th row or $j$ th column of $\mathbf{\mathbf{\Theta}}$ by $\mathit{\theta}}_{i$ or $\mathit{\theta}}_{j$, respectively, and always state whether we are considering a row or column of the given matrix. When referring to the $j$th element of a vector $\mathit{\theta}}_{i$ indexed by $i$, we write $\theta}_{ij$. Finally, when indexing data points from a sample, or parameters that are tied to individual data points, we use parenthesized, superscript letters, e.g. ${x}^{(i)}$, or $\mathit{\theta}}_{N}^{(i)$.
Poisson mixtures and their moments
Request a detailed protocolThe following derivations were presented in a more general form in Karlis and Meligkotsidou, 2007, but we present the simpler case here for completeness. A Poisson distribution has the form $p(\mathit{n};\mathit{\lambda})=\frac{{\lambda}^{n}{e}^{\lambda}}{n!}$, where $n$ is the count and λ is the rate (in our case, spike count and firing rate, respectively). We may use a Poisson model to define a distribution over ${d}_{N}$ spike counts $\mathit{n}=({n}_{1},\dots ,{n}_{{d}_{N}})$ by supposing that the neurons generate spikes independently of one another, leading to the independent Poisson model $p(\mathit{n};\mathit{\lambda})=\prod _{i=1}^{{d}_{N}}p({n}_{i};{\lambda}_{i})$ with firing rates $\mathit{\lambda}=({\lambda}_{1},\dots ,{\lambda}_{{d}_{N}})$. Finally, if we consider the ${d}_{K}$ rate vectors $\mathit{\lambda}}_{1},\dots ,{\mathit{\lambda}}_{{d}_{K}$, and ${d}_{K}$ weights ${w}_{1},\mathrm{\dots},{w}_{{d}_{K}}$, where $0\le {w}_{k}$ for all $k$, and ${w}_{1}=1{\sum}_{k=2}^{{d}_{K}}{w}_{k}$, we then define a mixture of Poisson distributions as a latent variable model $p(\mathit{n})=\sum _{k}p(\mathit{n}\mid k)p(k)=\sum _{k}p(n,k)$, where $p(\mathit{n}\mid k)=p(\mathit{n};{\mathit{\lambda}}_{k})$, and $p(k)={w}_{k}$.
The mean ${\mu}_{i}$ of the $i$ th neuron of a mixture of independent Poisson distributions is
The variance ${\sigma}_{i}^{2}$ of neuron $i$ is
where ${\sigma}_{ik}^{2}={\lambda}_{ik}$ is the variance of the $i$ th neuron under the $k$ th component distribution, that is the variance of $p({n}_{i}\mid k)$, and where ${\sum}_{{n}_{i}=0}^{\mathrm{\infty}}p({n}_{i}\mid k){n}_{i}^{2}={\sigma}_{ik}^{2}+{\lambda}_{ik}^{2}$, and ${\sum}_{k=1}^{{d}_{K}}{w}_{k}{\lambda}_{ik}^{2}{\mu}_{i}^{2}={\sum}_{k=1}^{{d}_{K}}{w}_{k}{({\lambda}_{ik}{\mu}_{i})}^{2}$ both follow from the fact that a distribution’s variance equals the difference between its second moment and squared first moment.
The covariance ${\sigma}_{ij}^{2}$ between spikecounts n_{i} and n_{j} for $i\ne j$ is then
Observe that if ${w}_{k}=\frac{1}{{d}_{K}1}$, then ${\sigma}_{ij}^{2}$ is simply the sample covariance between $i$ and $j$, where the sample is composed of the rate components of the $i$ th and $j$ th neurons. Equation 7 thus implies that Poisson mixtures can model arbitrary covariances. Nevertheless, Equation 6 shows that the variance of individual neurons is restricted to being larger than their means.
Exponential family mixture models
Request a detailed protocolIn this section, we show that the latent variable form for Poisson mixtures we introduced above is a member of the class of models known as exponential families. An exponential family distribution $p(x)$ over some data $x$ has the form $p(x)={e}^{\mathit{\theta}\cdot \mathbf{s}(x)\psi (\mathit{\theta}})b(x)$, where $\mathit{\theta}$ are the socalled natural parameters, $\mathbf{s}(x)$ is a vectorvalued function of the data called the sufficient statistic, $b(x)$ is a scalarvalued function called the base measure, and $\psi (\mathit{\theta})=\mathrm{log}\int {e}^{\mathit{\theta}\cdot \mathbf{s}(x)}b(x)dx$ is the logpartition function (Wainwright and Jordan, 2008). In the context of Poisson mixture models, we note that an independent Poisson model $p(\mathbf{n};\mathit{\lambda})$ is an exponential family, with natural parameters $\mathit{\theta}}_{N$ given by $\theta}_{N,i}=\mathrm{log}{\lambda}_{i$, base measure $b(\mathit{n})=\frac{1}{\prod _{i}n!}$ and sufficient statistic $\mathit{s}}_{N}(\mathit{n})=\mathit{n$, and logpartition function $\psi}_{N}({\mathit{\theta}}_{N})=\sum _{i=1}^{{d}_{N}}{e}^{{\theta}_{N,i}$. Moreover, the distribution of component indices $p(k)={w}_{k}$ (also known as a categorical distribution) also has an exponential family form, with natural parameters $\theta}_{K,k}=\mathrm{log}\frac{{w}_{k+1}}{{w}_{1}$ for $1\le k<{d}_{K}$, sufficient statistic $\mathit{\delta}(k)=({\delta}_{2}(k),\dots ,{\delta}_{{d}_{K}}(k))$, base measure $b(k)=1$, and logpartition function ${\psi}_{K}({\mathit{\theta}}_{K})=\mathrm{log}(1+\sum _{k=1}^{{d}_{K}1}{e}^{{\theta}_{K,k}})$. Note that in both cases, the exponential parameters are welldefined only if the rates and weights are strictly greater than 0 — in practice, however, this is not a significant limitation.
We claim that the joint distribution of a multivariate Poisson mixture model $p(\mathbf{\mathbf{n}},k)$ can be reparameterized in the exponential family form
where $\psi}_{NK}({\mathit{\theta}}_{N},{\mathit{\theta}}_{K},{\mathbf{\Theta}}_{NK})=\mathrm{log}\sum _{k}{e}^{{\mathit{\theta}}_{k}\cdot \mathit{\delta}(k)+{\psi}_{N}({\mathit{\theta}}_{N}+{\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k))$ is the logpartition function of $p(\mathbf{\mathbf{n}},k)$. To show this, we show how to express the natural parameters $\mathit{\theta}}_{N},{\mathit{\theta}}_{K$, and $\mathbf{\Theta}}_{NK$ as (invertible) functions of the component rate vectors $\mathit{\lambda}}_{1},\dots ,{\mathit{\lambda}}_{{d}_{K}$, and the weights ${w}_{1},\mathrm{\dots},{w}_{{d}_{K}}$. In particular, we set
where $\mathrm{log}$ is applied elementwise. Then, for $1\le k<{d}_{K}$, we set the $k$ th row $\mathit{\theta}}_{NK,k$ of $\mathbf{\Theta}}_{NK$ to
and the $k$ th element of $\mathit{\theta}}_{K$ to
This reparameterization may then be checked by substituting Equations 9, 10, and 11 into Equation 8 to recover the joint distribution of the mixture model $p(\mathbf{n},k)=p(\mathbf{n}\mid k)p(k)={w}_{k}p(\mathbf{n};{\mathit{\lambda}}_{K})$; for a more explicit derivation see Sokoloski, 2019.
The equation for $p(\mathbf{\mathbf{n}},k)$ ensures that the indexprobabilities are given by
Consequently, the component distributions in exponential family form are given by
Observe that $p(\mathbf{\mathbf{n}}\mid k)$ is a multivariate Poisson distribution with parameters ${\mathit{\theta}}_{N}+{\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k)$, so that for $k>1$, the parameters are the sum of $\mathit{\theta}}_{N$ and row $k1$ of $\mathbf{\Theta}}_{NK$. Because the exponential family parameters are the logarithms of the firing rates of $\mathbf{\mathbf{n}}$, each row of $\mathbf{\Theta}}_{NK$ modulates the firing rates of $\mathbf{\mathbf{n}}$ multiplicatively. When ${\mathit{\theta}}_{N}(x)$ depends on a stimulus and we consider the component distributions $p(\mathbf{\mathbf{n}}\mid x,k)$, each row of $\mathbf{\Theta}}_{NK$ then scales the tuning curves of the baseline population (i.e. $p(\mathbf{\mathbf{n}}\mid x,k)$ for $k=1$); in the neuroscience literature, such scaling factors are typically referred to as gain modulations.
The exponential family form has many advantages. However, it has a less intuitive relationship with the statistics of the model such as the mean and covariance. The most straightforward method to compute these statistics given a model in exponential family form is to first reparameterize it in terms of the weights and component rates, and then evaluate Equations 5, 6, and 7.
CoMPoisson distributions and their mixtures
Request a detailed protocolConwayMaxwell (CoM) Poisson distributions decouple the location and shape of count distributions (Shmueli et al., 2005; Stevenson, 2016; Chanialidis et al., 2018). A CoM Poisson model has the form $p(\mathit{n};\lambda ,\nu )\propto {(\frac{{\lambda}^{n}}{n!})}^{\nu}$. The floor function $\lfloor \lambda \rfloor $ of the location parameter λ is the mode of the given distribution. With regards to the shape parameter ν, $p(n;\lambda ,\nu )$ is a Poisson distribution with rate λ when $\nu =1$, and is under or overdispersed when $\nu >1$ or $\nu <1$, respectively. A CoMPoisson model $p(n;\lambda ,\nu )$ is also an exponential family, with natural parameters ${\mathit{\theta}}_{C}=(\nu \mathrm{log}\lambda ,\nu )$, sufficient statistic ${\mathbf{\mathbf{s}}}_{C}(n)=(n,\mathrm{log}n!)$, and base measure $b(n)=1$. The logpartition function does not have a closedform expression, but it can be effectively approximated by truncating the series $\sum _{n=0}^{\mathrm{\infty}}{e}^{{\mathbf{s}}_{C}(n)\cdot {\mathit{\theta}}_{C}}$ (Shmueli et al., 2005). More generally, when we consider a product of independent CoMPoisson distributions, we denote its logpartition function by $\psi}_{C}({\mathit{\theta}}_{N},{\mathit{\theta}}_{N}^{\ast})=\sum _{i=1}^{{d}_{N}}\mathrm{log}\sum _{n=0}^{\mathrm{\infty}}{e}^{n{\theta}_{N,i}+\mathrm{log}(n)!{\theta}_{N,i}^{\ast}$, where ${\mathit{\theta}}_{C,i}=({\theta}_{N,i},{\theta}_{N,i}^{\ast})$ are the parameters of the $i$ th CoMPoisson distribution. In this case we can also approximate the logpartition function ${\psi}_{C}$ by truncating the ${d}_{N}$ constituent series ${\sum}_{n=0}^{\mathrm{\infty}}{e}^{n{\theta}_{N,i}+\mathrm{log}(n)!{\theta}_{N,i}^{*}}$ in parallel.
We define a multivariate CoMbased (CB) mixture as
where $\mathbf{\mathbf{l}\mathbf{f}}(\mathbf{\mathbf{n}})=(\mathrm{log}({n}_{1}!),\mathrm{\dots},\mathrm{log}({n}_{{d}_{N}}!))$ is the vector of logfactorials of the individual spikecounts, and $\psi}_{CK}({\mathit{\theta}}_{N},{\mathit{\theta}}_{N}^{\ast},{\mathit{\theta}}_{K},{\mathbf{\Theta}}_{NK})=\mathrm{log}\sum _{k}{e}^{{\mathit{\theta}}_{k}\cdot \mathit{\delta}(k)+{\psi}_{C}({\mathit{\theta}}_{N}+{\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k),{\mathit{\theta}}_{N}^{\ast})$ is the logpartition function. This form ensures that the indexprobabilities satisfy
and consequently that each component distribution $p(\mathbf{\mathbf{n}}\mid k)$ is a product of independent CoM Poisson distributions given by
Observe that, whereas the parameters ${\mathit{\theta}}_{N}+{\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k)$ of $p(\mathbf{\mathbf{n}}\mid k)$ depend on the index $k$, the parameters $\mathit{\theta}}_{N}^{\ast$ of $p(\mathbf{\mathbf{n}}\mid k)$ are independent of the index and act exclusively as biases. Therefore, realizing different indices $k$ has the effect increasing or decreasing the location parameters, and thus the modes of the corresponding CoMPoisson distributions. As such, although the different components of a CB mixture are not simply rescaled versions of the first component $p(\mathbf{\mathbf{n}}\mid k=1)$, in practice they behave approximately in this manner.
The moments of a CoMPoisson distribution are not available in closedform, yet they can also be effectively approximated through truncation. We begin by computing approximate means ${\mu}_{ik}$ and variances ${\sigma}_{ik}^{2}$ of $p({n}_{i}\mid k)$ through truncation, and then the mean of n_{i} is ${\mu}_{i}={\sum}_{k=1}^{{d}_{K}}p(k){\mu}_{ik}$, and its variance is
where ${\overline{\sigma}}_{i}^{2}={\sum}_{k=1}^{{d}_{K}}p(k){\sigma}_{ik}^{2}$. Similarly to Equation 7, the covariance ${\sigma}_{ij}$ between n_{i} and n_{j} is ${\sigma}_{ij}={\sum}_{k=1}^{{d}_{K}}p(k)({\mu}_{ik}{\mu}_{i})({\mu}_{jk}{\mu}_{j})$.
By comparing Equations 6 and 17, we see that the CB mixture may address the limitations on the variances ${\sigma}_{i}^{2}$ of the IP mixture by setting the average variance ${\overline{\sigma}}_{i}^{2}$ of the components in Equation 17 to be small, while holding the value of the means ${\mu}_{i}$ fixed, and ensuring that the means of the components ${\mu}_{ik}$ cover a wide range of values to achieve the desired values of ${\sigma}_{i}^{2}$ and ${\sigma}_{ij}$. Solving the parameters of a CB mixture for a desired covariance matrix is unfortunately not possible since we lack closedform expressions for the means and variances. Nevertheless, we may justify the effectiveness of the CB strategy by considering the approximations of the components means and variances ${\mu}_{ik}\approx {\lambda}_{ik}+\frac{1}{2{\nu}_{ik}}\frac{1}{2}$ and ${\sigma}_{ik}^{2}\approx \frac{{\lambda}_{ik}}{{\nu}_{ik}}$, which hold when neither ${\lambda}_{ik}$ or ${\nu}_{ik}$ are too small (Chanialidis et al., 2018). Based on these approximations, observe that when ${\nu}_{ik}$ is large, ${\sigma}_{ik}^{2}$ is small, whereas ${\mu}_{ik}$ is more or less unaffected. Therefore, in the regime where these approximations hold, a small value for ${\overline{\sigma}}_{i}^{2}$ can be achieved by reducing the parameters ${\nu}_{ik}$, without significantly restricting the values of ${\mu}_{ik}$ or ${\mu}_{i}$.
Fisher information of a minimal CM
Request a detailed protocolThe Fisher information (FI) of an encoding model $p(\mathbf{\mathbf{n}}\mid x)$ with respect to $x$ is $I(x)={\sum}_{\mathbf{\mathbf{n}}}p(\mathbf{\mathbf{n}}\mid x){({\partial}_{x}\mathrm{log}p(\mathbf{\mathbf{n}}\mid x))}^{2}$(Cover and Thomas, 2006). With regard to the FI of a minimal CM,
where ${\mathrm{\partial}}_{x}{\psi}_{CK}({\mathit{\theta}}_{N}(x),{\mathit{\theta}}_{N}^{\ast},{\mathit{\theta}}_{K},{\mathbf{\Theta}}_{NK})={\mathit{\mu}}_{N}(x)\cdot {\mathrm{\partial}}_{x}{\mathit{\theta}}_{N}(x)$ follows from the chain rule and properties of the logpartition function (Wainwright and Jordan, 2008). Therefore
where ${\mathbf{\mathbf{\Sigma}}}_{N}(x)$ is the covariance matrix of $p(\mathbf{\mathbf{n}}\mid x)$. Moreover, because ${\mathrm{\partial}}_{x}{\mathit{\theta}}_{N}(x)={\mathbf{\Sigma}}_{N}^{1}(x)\cdot {\mathrm{\partial}}_{x}\mathit{\mu}(x)$ (Wainwright and Jordan, 2008), the FI of a minimal CM may also be expressed as $I(x)={\partial}_{x}{\mathit{\bm{\mu}}}_{N}(x)\cdot {\mathbf{\mathbf{\Sigma}}}_{N}^{1}(x)\cdot {\partial}_{x}{\mathit{\bm{\mu}}}_{N}(x)$, which is the linear Fisher information (Beck et al., 2011b).
Note that when calculating the FI or other quantities based on the covariance matrix, IPCMs have the advantage that their covariance matrices tend to have large diagonal elements and are thus inherently wellconditioned. Because decoding performance is not significantly different between IP and CBCMs (see Table 3), IPCMs may be preferable when wellconditioned covariance matrices are critical. Nevertheless, the covariance matrices of CB mixtures can be made wellconditioned by applying standard techniques.
Expectationmaximization for CMs
Request a detailed protocolExpectationmaximization (EM) is an algorithm that maximizes the likelihood of a latent variable model given data by iterating two steps: generating modelbased expectations of the latent variables, and maximizing the complete loglikelihood of the model given the data and latent expectations. Although the maximization step optimizes the complete loglikelihood, each iteration of EM is guaranteed to not decrease the data loglikelihood as well (Neal and Hinton, 1998).
EM is arguably the most widely applied algorithm for fitting finite mixture models (McLachlan et al., 2019). As a form of latent variable exponential family, the expectation step for a finite mixture model reduces to computing average sufficient statistics, and the maximization step is a convex optimization problem (Wainwright and Jordan, 2008). In general, the average sufficient statistics, or mean parameters, correspond to (are dual to) the natural parameters of an exponential family, and where we denote natural parameters with θ, we denote their corresponding mean parameters with η.
Suppose we are given a dataset $({\mathbf{\mathbf{n}}}^{(1)},\mathrm{\dots},{\mathbf{\mathbf{n}}}^{({d}_{T})})$ of neural spikecounts, and a CB mixture with natural parameters $\mathit{\theta}}_{N$, $\mathit{\theta}}_{N}^{\ast$, $\mathit{\theta}}_{K$, and ${\mathbf{\mathbf{\Theta}}}_{NK}$ (see Equation 14). The expectation step for this model reduces to computing the datadependent mean parameters ${\mathit{\bm{\eta}}}_{K}^{(i)}$ given by
for all $0<i\le {d}_{T}$. The mean parameters ${\mathit{\bm{\eta}}}_{K}^{(i)}$ are the averages of the sufficient statistic ${\mathit{\bm{\delta}}}_{k}(k)$ under the distribution $p(k\mid {\mathbf{\mathbf{n}}}^{(i)})$, and are what we use to complete the loglikelihood since we do not observe $k$.
Given ${\mathit{\bm{\eta}}}_{K}^{(i)}$, the maximization step of a CB mixture thus reduces to maximizing the complete loglikelihood $\sum _{i=1}^{{d}_{T}}\mathcal{\mathcal{L}}({\mathit{\theta}}_{K},{\mathit{\theta}}_{N},{\mathit{\theta}}_{N}^{\ast},{\mathbf{\Theta}}_{NK},{\mathit{\eta}}_{K}^{(i)},{\mathbf{n}}^{(i)})$, where we substitute $\mathit{\eta}}_{K}^{(i)$ into the place of $\mathit{\delta}(k)$ in Equation 14, such that
This objective may be maximized in closedform for an IP mixture (Karlis and Meligkotsidou, 2007), but this is not the case when the model has CoMPoisson shape parameters or depends on the stimulus. Nevertheless, solving the resulting maximization step is still a convex optimization problem (Wainwright and Jordan, 2008), and may be approximately solved with gradient ascent. Doing so requires that we first compute the mean parameters $\mathit{\eta}}_{N$, ${\mathit{\bm{\eta}}}_{N}^{*}$, ${\mathit{\bm{\eta}}}_{K}$, and ${\mathbf{\mathbf{H}}}_{NK}$ that are dual to $\mathit{\theta}}_{N$, $\mathit{\theta}}_{N}^{\ast$, $\mathit{\theta}}_{K$, and ${\mathbf{\mathbf{\Theta}}}_{NK}$, respectively.
We compute the mean parameters by evaluating
where ${\eta}_{K,k}$ is the $k$ th element of ${\mathit{\bm{\eta}}}_{K}$, $\mathit{\eta}}_{N,j$ is the $j$ th element of ${\mathit{\bm{\eta}}}_{N}$, ${\eta}_{N,j}^{*}$ is the $j$ th element of ${\mathit{\bm{\eta}}}_{N}^{*}$, and ${\eta}_{NK,jk}$ is the $j$ th element of the $k$ th column of ${\mathbf{\mathbf{H}}}_{NK}$. Note as well that we truncate the series ${\sum}_{{n}_{j}}{n}_{j}p({n}_{j}\mid k)$ and ${\sum}_{{n}_{j}}\mathrm{log}{n}_{j}!p({n}_{j}\mid k)$ to approximate ${\mu}_{jk}$ and ${\eta}_{N,j}^{*}$. Given these mean parameters, we may then express the gradients of ${\mathcal{\mathcal{L}}}^{(i)}=\mathcal{\mathcal{L}}({\mathit{\theta}}_{K},{\mathit{\theta}}_{N},{\mathit{\theta}}_{N}^{\ast},{\mathbf{\Theta}}_{NK},{\mathit{\eta}}_{K,i},{\mathbf{n}}^{(i)})$ as
where ⊗ is the outer product operator, and where the second term in each equation follows from the fact that the derivative of ${\psi}_{CK}$ with respect to $\mathit{\theta}}_{N$, $\mathit{\theta}}_{N}^{\ast$, $\mathit{\theta}}_{K$, or $\mathbf{\Theta}}_{NK$ yields the dual parameters $\mathit{\eta}}_{N$, ${\mathit{\bm{\eta}}}_{N}^{*}$, $\mathit{\eta}}_{K$, and ${\mathbf{\mathbf{H}}}_{NK}$, respectively. By ascending the gradients of ${\sum}_{i=1}^{{d}_{T}}{\mathcal{L}}^{(i)}$ until convergence, we approximate a single iteration of the EM algorithm for a CB mixture.
Finally, if our dataset $(({\mathbf{\mathbf{n}}}^{(1)},{x}^{(1)}),\mathrm{\dots},({\mathbf{\mathbf{n}}}^{({d}_{T})},{x}^{({d}_{T})}))$ includes stimuli $x$, and the parameters $\mathit{\theta}}_{N$ depend on the stimulus, then the gradients of the parameters of ${\mathit{\bm{\theta}}}_{N}$ must also be computed. For a von Mises CM where ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathbf{v}\mathbf{m}(x)$, the gradients are given by
where ${\mathit{\theta}}_{N}^{(i)}={\mathit{\theta}}_{N}({x}^{(i)})$ is the output of $\mathit{\theta}}_{N$ at ${x}^{(i)}$. Although in this paper we restrict our applications to Von Mises or discrete tuning curves for onedimensional stimuli, this formalism can be readily extended to the case where the baseline parameters ${\mathit{\theta}}_{N}(x)$ are a generic nonlinear function of the stimulus, represented by a deep neural network. Then, the gradients of the parameters of $\mathit{\theta}}_{N$ can be computed through backpropagation, and $\mathrm{\partial}}_{{\mathit{\theta}}_{N}^{(i)}}{\mathcal{\mathcal{L}}}^{(i)$ is the error that must be backpropagated through the network to compute the gradients.
If we ignore stimulus dependence, the single most computationally intensive operation in each gradient ascent step is the computation of the outer product when evaluating ${\partial}_{{\mathbf{\mathbf{\Theta}}}_{NK}}{\mathcal{L}}^{(i)}$, which has a time complexity of $\mathcal{O}({d}_{K}{d}_{N})$. As such, the training algorithm scales linearly in the number of neurons, and CMs could realistically be applied to populations of tens to hundreds of thousands of neurons. That being said, larger values of ${d}_{K}$ will typically be required to maximize performance in larger populations, and fitting the model to larger populations typically requires larger datasets and more EM iterations.
CM initialization and training procedures
Request a detailed protocolTo fit a CM to a dataset $(({\mathbf{\mathbf{n}}}^{(1)},{x}^{(1)}),\mathrm{\dots},({\mathbf{\mathbf{n}}}^{({d}_{T})},{x}^{({d}_{T})}))$, we first initialize the CM and then optimize its parameters with our previously described EM algorithm. Naturally, initialization depends on exactly which form of CM we consider, but in general we first initialize the baseline parameters $\mathit{\theta}}_{N$, then add the categorical parameters $\mathit{\theta}}_{K$ and mixture component parameters ${\mathbf{\mathbf{\Theta}}}_{NK}$. When training CBCMs we always first train an IPCM, and so the initialization procedure remains the same for IP and CB models.
To initialize a von Mises CM with ${d}_{N}$ neurons, we first fit ${d}_{N}$ independent, von Misestuned neurons by maximizing the loglikelihood $\sum _{i=1}^{{d}_{T}}\mathrm{log}p({\mathbf{n}}^{(i)}\mid {x}^{(i)})$ of ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathbf{v}\mathbf{m}(x)$. This is a convex optimization problem and so can be easily solved by gradient ascent, in particular by following the gradients
to convergence. For both discrete and maximal CMs, where there are ${d}_{X}$ distinct stimuli, we initialize ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathit{\delta}(x)$ by computing the average rate vector at each stimuluscondition and creating a lookup table for these rate vectors. Formally, where x_{l} is the $l$ th stimulus value for $0<l\le {d}_{X}$, we may express the $l$ th rate vector as ${\mathit{\bm{\lambda}}}_{l}=\frac{1}{{\sum}_{i=1}^{{d}_{T}}\delta ({x}_{l},{x}^{(i)})}{\sum}_{i=1}^{{d}_{T}}\delta ({x}_{l},{x}^{(i)}){\mathbf{\mathbf{n}}}^{(i)}$, where $\delta ({x}_{l},{x}^{(i)})$ is one when ${x}_{l}={x}^{(i)}$, and 0 otherwise. We then construct a lookup table for these rate vectors in exponential family form by setting $\mathit{\theta}}_{N}^{0}=\mathrm{log}{\mathit{\lambda}}_{1$, and by setting the $l$ th row $\mathit{\theta}}_{NX,l$ of $\mathbf{\Theta}}_{NX$ to $\mathit{\theta}}_{NX,l}=\mathrm{log}{\mathit{\lambda}}_{l+1}\mathrm{log}{\mathit{\lambda}}_{1$.
In general, we initialize the parameters $\mathit{\theta}}_{K$ by sampling the weights ${w}_{1},\mathrm{\dots},{w}_{{d}_{K}}$ of a categorical distribution from a Dirichlet distribution with a constant concentration of 2, and converting the weights into the natural parameters of a categorical distribution $\mathit{\theta}}_{K$. For discrete and maximal CMs, we initialize the modulations $\mathbf{\Theta}}_{NK$ by generating each element of ${\mathbf{\mathbf{\Theta}}}_{NK}$ from a uniform distribution over the range $[0.0001,0.0001]$. For von Mises CMs we initialize each row $\mathit{\theta}}_{NK,k$ of ${\mathbf{\mathbf{\Theta}}}_{NK}$ as shifted sinusoidal functions of the preferred stimuli of the independent von Mises neurons. That is, given $\mathit{\theta}}_{N}^{0$ and ${\mathbf{\mathbf{\Theta}}}_{NX}$, we compute the preferred stimulus of the $i$ th neuron given by ${\rho}_{i}=\mathrm{a}\mathrm{t}\mathrm{a}\mathrm{n}2({\mathit{\theta}}_{N}^{0}+{\mathit{\theta}}_{NX,i})$, where $\mathit{\theta}}_{NX,i$ is the $i$ th row of ${\mathbf{\mathbf{\Theta}}}_{NX}$. We then set the $i$ th element $\theta}_{NK,k,i$ of $\mathit{\theta}}_{NK,k$ to ${\theta}_{NK,k,i}=0.2\mathrm{sin}({\rho}_{i}+{\frac{k}{360}}^{\circ})$. Initializing von Mises CMs in this way ensures that each modulation has a unique peak as a function of preferred stimuli, which helps differentiate the modulations from each other, and in our experience improves training speed.
With regard to training, the expectation step in our EM algorithm may be computed directly, and so the only challenge is solving the maximization step. Although the optimal solution strategy depends on the details of the model and data in question, in the context of this paper we settled on a strategy that is sufficient for all simulations we perform. For each model we perform a total of ${d}_{I}=500$ EM iterations, and for each maximization step we take ${d}_{S}=100$ gradient ascent steps with the Adam gradient ascent algorithm (Kingma and Ba, 2014) with the default momentum parameters (see Kingma and Ba, 2014). We restart the Adam algorithm at each iteration of EM and gradually reduce the learning rate. Where ${\u03f5}^{+}=0.002$ and ${\u03f5}^{}=0.0005$ are the initial and final learning rates, we set the learning rate ${\u03f5}_{t}$ at EM iteration $t$ to
where we assume $t$ starts at 0 and ends at ${d}_{I}1$.
Because we must evaluate large numbers of truncated series when working with CBCMs, training times are typically one to two orders of magnitude greater. To minimize training time of CBCMs over the ${d}_{I}$ EM iterations, we therefore first train a IPCM for $0.8{d}_{I}$ iterations. We then equate the parameters $\mathit{\theta}}_{N$, $\mathit{\theta}}_{K$, and ${\mathbf{\mathbf{\Theta}}}_{NK}$ of the IPCM (see Equation 8) with a CBCM (see Equation 14) and set $\mathit{\theta}}_{N}^{\ast}=\mathbf{1$, which ensures that resulting CB model has the same density function $p(\mathbf{\mathbf{n}},k\mid x)$ as the original IP model. We then train the CBCM for $0.2{d}_{I}$ iterations. We found this strategy results in practically no performance loss, while greatly reducing training time.
Strategies for choosing the CM form and latent structure
Request a detailed protocolThere are a few choices with regards to the form of the model than one must make when applying a CM: The form of the dependence, whether or not to use the CoMbased (CB) extension, and the number of components ${d}_{K}$. The form of the dependence is very openended, yet should be fairly clear from the problem context: one should use a minimal model if one wishes to make use of its mathematical features, and otherwise a maximal model may provide better performance. If one wishes to interpolate between stimulus conditions, or the number of stimulusconditions in the data is high, then a continuous stimulusdependence model (e.g. von Mises tuning curves) should be used, otherwise discrete tuning curves may provide better performance. Finally, if one wishes to model correlations in a complex neural circuit, one may use for example a deep neural network, and induce correlations in the output layer with the theory of CMs.
Similarly, CBCMs have clear advantages for modelling individual variability, and as we show in Appendix 2, this includes higherorder variability. Nevertheless, from the perspective of decoding performance, IPCMs and CBCMs perform moreorless equally well, and training CBCMs is more computationally intensive. As such, IPCMs may often be the better choice.
The number of components ${d}_{K}$ provides a finegrained method of adjusting model performance. If the goal is to maximize predictive encoding performance, then the standard way to do this is to choose a ${d}_{K}$ that maximizes the crossvalidated loglikelihood, as we demonstrated in Figure 5. Nevertheless, one may rather aim to maximize decoding performance, in which case maximizing the crossvalidated logposterior may be a more appropriate objective. In both cases, for very large populations of neurons, choosing a ${d}_{K}$ solely to maximize performance may be prohibitively, computationally expensive. As demonstrated in Figure 5 and Appendix 3, a small ${d}_{K}$ can achieve a large fraction of the performance gain of the optimal ${d}_{K}$, and choosing a modest ${d}_{K}$ that achieves qualitatively acceptable performance may prove to be the most productive strategy.
CM parameter selection for simulations
Request a detailed protocolIn the section Extended Poisson mixture models capture stimulusdependent response statistics and the section Conditional mixtures facilitate accurate and efficient decoding of neural responses we considered minimal CBCMs with randomized parameters ${\mathit{\theta}}_{N}(x)$, $\mathit{\theta}}_{N}^{\ast$, $\mathit{\theta}}_{K$, and ${\mathbf{\mathbf{\Theta}}}_{NK}$, which for simplicity we refer to as models 1 and 2, respectively. We construct randomized CMs piece by piece, in a similar fashion to our initialization procedure.
Firstly, where ${d}_{N}$ is the number of neurons, we tile their preferred stimuli ${\rho}_{i}$ over the circle such that ${\rho}_{i}=\frac{i}{{d}_{N}}{360}^{\circ}$. We then generate the concentration ${\kappa}_{i}$ and gain ${\gamma}_{i}$ of the $i$ th neuron by sampling from normal distributions in logspace, such that $\mathrm{log}{\kappa}_{i}\sim \mathrm{N}(0.1,0.2)$, and $\mathrm{log}{\gamma}_{i}\sim \mathrm{N}(0.2,0.1)$. Finally, for von Mises baseline parameters ${\mathit{\theta}}_{N}(x)={\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathbf{v}\mathbf{m}(x)$, we set each row $\mathit{\theta}}_{NX,i$ of ${\mathbf{\mathbf{\Theta}}}_{NX}$ to ${\mathit{\theta}}_{NX,i}=({\kappa}_{i}\mathrm{cos}{\rho}_{i},{\kappa}_{i}\mathrm{sin}{\rho}_{i})$, and each element $\theta}_{N,i}^{0$ of $\mathit{\theta}}_{N}^{0$ to ${\theta}_{N,i}^{0}=\mathrm{log}{\gamma}_{i}{\psi}_{X}({\mathit{\theta}}_{NX,i})$, where ${\psi}_{X}$ is the logarithm of the modified Bessel function of order 0, which is the logpartition function of the von Mises distribution.
We then set $\mathit{\theta}}_{K}=\mathbf{0$, and generated each element $\theta}_{NK,i,k$ of the modulation matrix $\mathit{\theta}}_{NK$ in the same matter as the gains, such that ${\theta}_{NK,i,k}\sim \mathrm{N}(0.2,0.1)$. Finally, to generate random CB parameters we generate each element $\theta}_{N,i}^{\ast$ of $\mathit{\theta}}_{N}^{\ast$ from a uniform distribution, such that ${\mathit{\theta}}_{N,i}^{\ast}\sim \mathrm{U}(1.5,0.8)$.
Model two entails two more steps. Firstly, when sampling from larger populations of neurons, single modulations often dominate the model activity around certain stimulus values. To suppress this we consider the natural parameters ${\mathit{\theta}}_{K}^{0}(x)$ of $p(k\mid x)$ (see Equation 15), and compute the maximum value of these natural parameters over the range of stimuli ${\theta}_{K,k}^{+}={\mathrm{max}}_{x}\{{\theta}_{K,k}^{0}(x)\}$. We then set each element ${\theta}_{K,k}$ of the parameters $\mathit{\theta}}_{K$ of the CM to ${\theta}_{K,k}={\overline{\theta}}_{K}^{+}{\theta}_{K,k}^{+}$, where ${\overline{\theta}}_{K}^{+}={\sum}_{i=1}^{{d}_{K}}\frac{{\theta}_{K,k}}{{d}_{K}}$, which helps ensure that multiple modulations are active at any given $x$. Finally, since model two is a discrete CM, we replace the von Mises baseline parameters with discrete baseline parameters, by evaluating ${\mathit{\theta}}_{N}^{0}+{\mathbf{\Theta}}_{NX}\cdot \mathbf{v}\mathbf{m}(x)$ at each of the ${d}_{X}$ valid stimulusconditions, and assemble the resulting collection of natural parameters into a lookup table in the manner we described in our initialization procedures.
Decoding models
Request a detailed protocolWhen constructing a Bayesian decoder for discrete stimuli, we first estimate the prior $p(x)$ by computing the relative frequency of stimulus presentations in the training data. For the given encoding model, we then evaluate $p(\mathbf{\mathbf{n}}\mid x)$ at each stimulus condition, and then compute the posterior $p(x\mid \mathbf{\mathbf{n}})\propto p(\mathbf{\mathbf{n}}\mid x)p(x)$ by bruteforce normalization of $p(\mathbf{\mathbf{n}}\mid x)p(x)$. When training the encoding model used for our Bayesian encoders, we only trained them to maximize encoding performance as previously described, and not to maximize decoding performance.
We considered two decoding models, namely the linear network and the artificial neural network (ANN) with sigmoid activation functions. In both cases, the input of the network was a neural response vector, and the output the natural parameters $\mathit{\theta}}_{X$ of a categorical distribution. The form of the linear network was $\mathit{\theta}}_{X}(\mathbf{n})={\mathit{\theta}}_{X}+{\mathbf{\Theta}}_{XN}\cdot \mathbf{n$, and is otherwise fully determined by the structure of the data. For the ANN on the other hand, we had to choose both the number of hidden layers, and the number of neurons per hidden layer. We crossvalidated the performance of both 1 and 2 hidden layer models, over a range of sizes from 100 to 2000 neurons. We found the performance of the networks with two hidden layers generally exceeded that of those with one hidden layer, and that 700 and 600 hidden neurons was optimal for the awake and anaesthetized networks, respectively.
Given a dataset $(({\mathbf{\mathbf{n}}}^{(1)},{x}^{(1)}),\mathrm{\dots},({\mathbf{\mathbf{n}}}^{({d}_{T})},{x}^{({d}_{T})}))$, we optimized the linear network and the ANN by maximizing ${\sum}_{i=1}^{{d}_{T}}\mathrm{log}p({x}^{(i)}\mid {\mathbf{\mathbf{n}}}^{(i)})$ via stochastic gradient ascent. We again used the Adam optimizer with default momentum parameters, and used a fixed learning rate of 0.0003, and randomly divided the dataset into minibatches of 500 data points. We also used early stopping, where for each fold of our 10fold crossvalidation simulation, we partitioned the dataset into 80% training data, 10% test data, and 10% validation data, and stopped the simulation when performance on the test data declined from epoch to epoch.
Experimental design
Request a detailed protocolThroughout this paper, we demonstrate our methods on two sets of parallel response recordings in macaque primary visual cortex (V1). The stimuli were drifting full contrast gratings at nine distinct orientations spread evenly over the halfcircle from 0° to 180° (2° diameter, two cycles per degree, 2.5 Hz drift rate). Stimuli were generated with custom software (EXPO by P. Lennie) and displayed on a cathode ray tube monitor (Hewlett Packard p1230; 1024 × 768 pixels, with ~ 40 cd/m^{2} mean luminance and 100 Hz frame rate) viewed at a distance of 110 cm (for anaesthetized dataset) or 60 cm (for awake dataset). Grating orientations were randomly interleaved, each presented for 70 ms (for anaesthetized dataset) or 150 ms (for awake dataset), separated by a uniform gray screen (blank stimulus) for the same duration. Stimuli were centered in the aggregate spatial receptive field of the recorded units.
Neural activity from the superficial layers of V1 was recorded from a 96 channel microelectrode array (400 μm spacing, 1 mm length). (400 μm spacing, 1 mm length). Standard methods for waveform extraction and preprocessing were applied (see Aschner et al., 2018). We computed spike counts in a fixed window with length equal to the stimulus duration, shifted by 50 ms after stimulus onset to account for onset latency. We excluded from further analyses all neurons that were not driven by any stimulus above baseline + 3std.
In the first dataset, the monkey was awake and performed a fixation task. Methods and protocols are as described in Festa et al., 2020. There were ${d}_{T}=3,168$ trials of the responses of ${d}_{N}=43$ neurons in the dataset. We refer to this dataset as the awake V1 dataset.
In the second dataset, the monkey was anaesthetized and there were ${d}_{T}=10,800$ trials of the responses of ${d}_{N}=70$ neurons; we refer to this dataset as the anaesthetized V1 dataset. The protocol and general methods employed for the anaesthetized experiment have been described previously (Smith and Kohn, 2008).
All procedures were approved by the Institutional Animal Care and Use Committee of the Albert Einstein College of Medicine, and were in compliance with the guidelines set forth in the National Institutes of Health Guide for the Care and Use of Laboratory Animals under protocols 20180308 and 20180309 for the awake and anaesthetized macaque recordings, respectively.
Code
Request a detailed protocolAll code used to run the simulations and generate the figures, as well as the awake and anaesthetized datasets, are available at the Git repository https://gitlab.com/sachasokoloski/neuralmixtures (Sokoloski, 2021; copy archived at swh:1:rev:8e82799f8934c47961ea02c5b7c25bd952abb961). Instructions are provided for installation, and scripts are provided that may be run on alternative datasets with a similar structure to what we have considered in this manuscript without modifying the code.
Appendix 1
Comparing conditional mixtures with factor analysis
In Conditional mixtures effectively model neural responses in macaque V1, we assess encoding performance with the crossvalidated, average loglikelihood of the given conditional mixture (CM) on the given dataset. However, in some cases, one might only be concerned with how well a model captures particular statistics of a dataset. In particular, response models based on Gaussian distributions treat spikecounts as continuous values, and assign positive probability to both negative and noninteger values. Although their loglikelihood performance consequently tends to suffer relative to spikecount models, they can still prove highly effective at capturing the mean and covariance of data.
Here, we compare CMs with factor analysis (FA), which is widely applied to modelling neural responses (Santhanam et al., 2009; Cowley et al., 2016; Semedo et al., 2019). FAs model data as Gaussian distributed, and have a latent structure that facilitates both interpretability and predictive performance. The easiest way to design an encoding model based on FA is with a simple lookuptable, so that we fit an independent FA model at each stimulus condition. This is also how we define maximal CMs, and so to keep our comparison straightforward we compare FA encoding models with maximal CMs. In particular, we compare FA to both independent Poisson (IP) and CoMBased (CB) maximal CMs on how well they capture response statistics on the two datasets from the article (anaesthetized and awake macaque V1). In general, we trained the CMs with expectationmaximization (EM) as described in Materials and methods, and the FA model with standard EM.
In Appendix 1—figure 1A and B, we depict scatter plots that compare the data noise correlations from our awake and anaesthetized datasets at the stimulus orientation $x={40}^{\circ}$, to the noise correlations learned by CB and IP mixtures, and FA trained on the complete datasets. Each model was defined with ${d}_{K}=5$ latent states/dimensions. We also state the coefficient of determination ${r}^{2}$ for each model, and see that although all models perform comparably on the anaesthetized data, FA has a clear advantage over the mixture models on the awake data. To see if this performance advantage holds on heldout data, in Appendix 1—figure 1C and D we depict the results of 10fold crossvalidation of the coefficient of determination ${r}^{2}$ between the data noise correlations and the various model noise correlations over all nine stimulus orientations, as a function of the number of latant states/dimensions ${d}_{K}$. We see that the predictive performance advantage of FA on the awake data is small, and that CBCMs exceed the performance of FA on anaesthetized data. At the same time, FA achieves peak performance on both datasets with a smaller number of parameters. Nevertheless, FA is designed precisely to capture secondorder correlations, and that our mixture models achieve comparable performance speaks favourably to the overall strengths of the mixture model approach.
In Appendix 1—figure 2, we depict scatter plots between the data Fano factors (FFs) and learned FFs of our models at stimulus orientation $x={40}^{\circ}$, and find that both the CB mixture and FA almost perfectly capture the data FFs. In Appendix 1—figure 2C–D, we see that the CB mixture and FA also achieve good crossvalidated $r}^{2$ scores on FFs. Unsurprisingly, however, the IP mixture struggles to effectively capture FFs.
Appendix 2
Higher order moments and neural decoding
FAbased encoding models are highly effective at capturing the first and secondorder statistics of neural responses, yet in our simulations we found that Bayesian decoders based on FA encoding models perform poorly when compared to the other decoding models considered in Conditional mixtures facilitate accurate and efficient decoding of neural responses. There we evaluate decoding performance by fitting a candidate model to training data, and computing the mean and standard error of the logposterior at the true stimulus on heldout data. On the awake data FA scores $0.246\pm 0.066$, which is comparable to an optimal linear decoder, yet still significantly worse than a nonlinear decoder, or a Bayesian decoder based on a CM. On the anaesthetized data FA scored $\mathrm{\infty}$, as it would occasionally assign numerically 0 posterior probability to the true stimulus; when we filtered out $\mathrm{\infty}$ values from the average, the FA encoder still only achieved performance of $2.21\pm 0.31$.
Normal distributions — and the FA observable distribution by extension — are essentially defined by their first and secondorder statistics, which suggests that there are higherorder statistics that are important for decoding that FA cannot capture. The third and fourth order statistics known as the skewness and excess kurtosis measure the asymmetry and heavytailedness of a given distribution, respectively. Normal distributions have a skewness and excess kurtosis of 0. Here, we study how well this normality assumption is reflected in our neural recordings, and how well our mixture models capture these higher order statistics.
In Figure 1, we present scatter plots of the empirical skewness and kurtosis of all the neurons in our datasets at orientation $x={40}^{\circ}$, and the model skewness and kurtosis learned by our mixture models. Exactly quantifying the nonnormality of higher order moments in multivariate distributions is a complicated and evolving subject (Mardia and Elatoum, 1976; Cain et al., 2017), nevertheless in Figure 1 the empirical skewness and kurtosis of the recorded neurons appear to qualitatively deviate from zero. On the awake data, both the CB and IP mixture achieve high ${r}^{2}$ when compared with the data skewness (Figure 1A) and kurtosis (Figure 1B), although the CB mixtures achieves notably better performance. On the anaesthetized data (Figure 1C and D), the CB mixture continues to achieve a high ${r}^{2}$, but the IP mixture performs extremely poorly; although the disparity in ${r}^{2}$ is not immediately apparent in the scatter plots, this is because some of the model skewness of the IP mixtures are outside the plot bounds.
When fit to the complete datasets, and averaged over all stimulus conditions, the CB mixtures achieved a skewness ${r}^{2}$ average and standard error of ${r}^{2}=0.87\pm 0.08$ and ${r}^{2}=0.94\pm 0.03$, and a kurtosis ${r}^{2}$ average and standard error of ${r}^{2}=0.61\pm 0.20$ and ${r}^{2}=0.82\pm 0.13$ on the awake and anaesthetized data, respectively; in contrast, the presence of outliers caused the average scores for the IP mixture to be dramatically negative in all cases. These results suggest that the CoMbased parameters of the CB mixture provide important degrees of freedom for capturing individual variability. That being said, when we crossvalidated the ${r}^{2}$ performance on the higherorder moments, the results were not significantly higher than 0 for the CB mixture, and as such, accurately estimating higher order moments requires larger datasets than what we have considered here.
Nevertheless, in spite of the inability to capture predictive performance on these moments, we speculate that when combined across higher order moments and crossmoments, the ability of mixture models to capture higher order structure in the data is necessary for maximizing decoding performance, and that these moments might play an important role in neural coding. As the complexity of neural datasets increases, a careful study of such higherorder statistics would become both feasible and warranted, and our mixture model approach could prove to be a useful tool in such work.
Appendix 3
Sample complexity of conditional mixtures
To develop a sense of the sample complexity of CMs, we repeated the crossvalidation simulations with discrete CMs on subsets of our two datasets (see Conditional mixtures effectively model neural responses in macaque V1 and Conditional mixtures facilitate accurate and efficient decoding of neural responses). In particular, we ran a 10fold crossvalidation simulation on a single subsample of 25%, 50%, 75%, and 100%, of each of our datasets. On our anaesthetized dataset this occasionally resulted in some neurons recording 0 spikes in a giving training set, which tends to cause our training algorithm to diverge, and so we filtered out neurons with low firing rates, leaving 50 neurons in our anaesthetized dataset.
We present the results of our simulations in Figure 1. In the left column (Figure 1A–D), we present the crossvalidated loglikelihood of the vanilla and CoMmodels, on the awake and anaesthetized data, respectively, and we see that, as we would expect, models with fewer components maximize their performance on smaller datasets. Even with large amounts of data, however, the performance difference between models with more than 10 components is nearly statistically indistinguishable. In the right column (Figure 1E–H), we present the crossvalidated logposterior performance of the models, and the results mirror those of the loglikelihood simulations, except the benefits of larger models becomes even more marginal.
Appendix 4
Conditional mixtures and generalized linear models
CMs are closely related to generalized linear models (GLMs), which are widely applied in neuroscience. The application of GLMs to modelling stimulusdriven spiketrains was pioneered in Pillow et al., 2008, in which the authors develop a Poisson encoding model $p({n}_{t}\mid \mathbf{\mathbf{x}},{\mathbf{\mathbf{m}}}_{1},\mathrm{\dots},{\mathbf{\mathbf{m}}}_{{d}_{N}})\propto {e}^{{\theta}_{N}(\mathbf{\mathbf{x}},{\mathbf{\mathbf{m}}}_{1},\mathrm{\dots},{\mathbf{\mathbf{m}}}_{{d}_{N}}){n}_{t}}$, where ${d}_{N}$ is the number of recorded neurons, n_{t} is the spikecount of the modelled neuron in timebin $t$, $\mathbf{\mathbf{x}}$ is the stimulus (here the stimulus is an image and represented as a vector), and where each ${\mathbf{\mathbf{m}}}_{i}$ is the spikecount history of the $i$ th recorded neuron up to time $t1$. The lograte $\theta}_{N$ of the modelled neuron at time $t$ depends linearly on the stimulus and the spikehistory, and is given by
where $\mathbf{\mathbf{k}}$ and ${\mathbf{\mathbf{h}}}_{i}$ are vectors; in Pillow et al., 2008 both $\mathbf{\mathbf{k}}$ and ${\mathbf{\mathbf{h}}}_{i}$ are represented by basis functions with a manageable number of fittable parameters.
This model may be trivially combined with a CM in order to extend the GLM formulation with a latent source of sharedvariability that affords analytic expressions for various quantities of interest. The definition of a CBCM is $p(\mathbf{n},k\mid x)\propto {e}^{{\mathit{\theta}}_{N}(x)\cdot \mathbf{n}+{\mathit{\theta}}_{N}^{\ast}\cdot \mathbf{l}\mathbf{f}(\mathbf{n})+{\mathit{\theta}}_{K}\cdot \mathit{\delta}(k)+\mathbf{n}\cdot {\mathbf{\Theta}}_{NK}\cdot \mathit{\delta}(k)}$, and we may simply replace the variable $x$ with all the independent variables in the GLM formulation, namely $\mathbf{\mathbf{x}}$ and ${\mathbf{\mathbf{m}}}_{1},\mathrm{\dots},{\mathbf{\mathbf{m}}}_{{d}_{N}}$, and define the baseline logfiring rates ${\mathit{\theta}}_{N}(\mathbf{x},{\mathbf{m}}_{1},\dots ,{\mathbf{m}}_{{d}_{N}})$ as ${d}_{N}$ copies of the function defined by Equation 18, each with its own independent parameters. In principle, the expectationmaximization framework we have presented for training CMs can be directly applied to a model with this structure. That being said, choosing the right parameterization of $\mathbf{\mathbf{k}}$ and ${\mathbf{\mathbf{h}}}_{1},\mathrm{\dots},{\mathbf{\mathbf{h}}}_{{d}_{N}}$ would pose a unique challenge in a combined GLMCM model, and whether such a model would be practically useful is an empirical question that is beyond the scope of this paper.
Here, we also clarify that although a CM is closely related to both mixture models and GLMs, it should not be confused with the models known as ‘mixtures of GLMs’ (Grün and Leisch, 2008). A mixture of GLMs has the form $p(y\mid x)=\sum _{k=1}^{{d}_{K}}{w}_{k}p(y\mid x;{\mathit{\theta}}_{k})$, where ${d}_{K}$ is the number of GLMs to be mixed, w_{k} are the weight parameters, and $p(y\mid x;{\mathit{\theta}}_{k})$ is a GLM with parameters $\mathit{\theta}}_{k$. This model differs from a CM in many subtle ways, and the easiest to note is that the weights w_{k} do not depend on the stimulus $x$ as they do in a CM, which, as shown in Figures 4 and 6 of the main paper, is critical to how CMs represent data.
Data availability
All data used in this study is available at the Git repository (https://gitlab.com/sachasokoloski/neuralmixtures; copy archived at https://archive.softwareheritage.org/swh:1:rev:8e82799f8934c47961ea02c5b7c25bd952abb961). This includes experimental data used for model validation, and code for running analyses and simulations.
References

BookLowDimensional Models of Neural Population Activity in Sensory Cortical CircuitsIn: Ghahramani Z, Welling M, Cortes C, Lawrence N. D, Weinberger K. Q, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 343–351.

Temporal contingencies determine whether adaptation strengthens or weakens normalizationThe Journal of Neuroscience 38:10129–10142.https://doi.org/10.1523/JNEUROSCI.113118.2018

InformationLimiting correlations in large neural populationsThe Journal of Neuroscience 40:1668–1678.https://doi.org/10.1523/JNEUROSCI.207219.2019

BookComputational Neuroscience: Theoretical Insights into Brain FunctionIn: Cisek P, Drew T, Kalaska J. F, editors. Progress in Brain Research. Elsevier. pp. 509–519.

Marginalization in neural circuits with divisive normalizationJournal of Neuroscience 31:15310–15319.https://doi.org/10.1523/JNEUROSCI.170611.2011

Mutual information, Fisher information, and population codingNeural Computation 10:1731–1757.https://doi.org/10.1162/089976698300017115

Efficient bayesian inference for COMPoisson regression modelsStatistics and Computing 28:595–608.https://doi.org/10.1007/s112220179750x

Attention improves performance primarily by reducing interneuronal correlationsNature Neuroscience 12:1594–1600.https://doi.org/10.1038/nn.2439

StimulusDriven population activity patterns in macaque primary visual cortexPLOS Computational Biology 12:e1005185.https://doi.org/10.1371/journal.pcbi.1005185

Dimensionality reduction for largescale neural recordingsNature Neuroscience 17:1500–1509.https://doi.org/10.1038/nn.3776

BookTheoretical Neuroscience: Computational and Mathematical Modeling of Neural SystemsMassachusetts Institute of Technology Press.

The effect of noise correlations in populations of diversely tuned neuronsJournal of Neuroscience 31:14272–14283.https://doi.org/10.1523/JNEUROSCI.253911.2011

On the structure of neuronal population activity under fluctuations in attentional stateThe Journal of Neuroscience 36:1775–1789.https://doi.org/10.1523/JNEUROSCI.204415.2016

Efficient sensory encoding and bayesian inference with heterogeneous neural populationsNeural Computation 26:2103–2134.https://doi.org/10.1162/NECO_a_00638

Decoding the activity of neuronal populations in macaque primary visual cortexNature Neuroscience 14:239–245.https://doi.org/10.1038/nn.2733

Stimulusdependent maximum entropy models of neural population codesPLOS Computational Biology 9:e1002922.https://doi.org/10.1371/journal.pcbi.1002922

BookFinite Mixtures of Generalized Linear Regression ModelsIn: Shalabh HC, editors. Recent Advances in Linear Models and Related Areas: Essays in Honour of Helge Toutenburg Heidelberg. Berlin, Germany: Springer. pp. 205–230.https://doi.org/10.1007/9783790820645

Training products of experts by minimizing contrastive divergenceNeural Computation 14:1771–1800.https://doi.org/10.1162/089976602760128018

A review of multivariate distributions for count data derived from the poisson distributionWIREs Computational Statistics 9:e1398.https://doi.org/10.1002/wics.1398

Measuring Fisher information accurately in correlated neural populationsPLOS Computational Biology 11:e1004218.https://doi.org/10.1371/journal.pcbi.1004218

Finite mixtures of multivariate poisson distributions with applicationJournal of Statistical Planning and Inference 137:1942–1960.https://doi.org/10.1016/j.jspi.2006.07.001

Correlations and neuronal population informationAnnual Review of Neuroscience 39:237–256.https://doi.org/10.1146/annurevneuro070815013851

Cognitive computational neuroscienceNature Neuroscience 21:1148–1160.https://doi.org/10.1038/s4159301802105

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

BookEmpirical Models of Spiking in Neural PopulationsIn: ShaweTaylor J, Zemel R. S, Bartlett P. L, Pereira F, Weinberger K. Q, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 1350–1358.

BookHow Biased Are Maximum Entropy Models?In: ShaweTaylor J, Zemel R. S, Bartlett P. L, Pereira F, Weinberger K. Q, editors. Advances in Neural Information Processing Systems. Curran Associates, Inc. pp. 2034–2042.

Learning to estimate dynamical state with probabilistic population codesPLOS Computational Biology 11:e1004554.https://doi.org/10.1371/journal.pcbi.1004554

Neuronal mechanisms of visual attentionAnnual Review of Vision Science 1:373–391.https://doi.org/10.1146/annurevvision082114035431

Finite mixture modelsAnnual Review of Statistics and Its Application 6:355–378.https://doi.org/10.1146/annurevstatistics031017100325

Encoding and decoding in parietal cortex during sensorimotor decisionmakingNature Neuroscience 17:1395–1403.https://doi.org/10.1038/nn.3800

Probabilistic brains: knowns and unknownsNature Neuroscience 16:1170–1178.https://doi.org/10.1038/nn.3495

Ignoring correlated activity causes a failure of retinal population codesNature Communications 11:4605.https://doi.org/10.1038/s41467020184362

Attention can either increase or decrease spike count correlations in visual cortexNature Neuroscience 17:1591–1597.https://doi.org/10.1038/nn.3835

Factoranalysis methods for higherperformance neural prosthesesJournal of Neurophysiology 102:1315–1330.https://doi.org/10.1152/jn.00097.2009

Towards the design principles of neural population codesCurrent Opinion in Neurobiology 37:133–140.https://doi.org/10.1016/j.conb.2016.03.001

A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distributionJournal of the Royal Statistical Society: Series C 54:127–142.https://doi.org/10.1111/j.14679876.2005.00474.x

Spatial and temporal scales of neuronal correlation in primary visual cortexJournal of Neuroscience 28:12591–12603.https://doi.org/10.1523/JNEUROSCI.292908.2008

Correlations in V1 are reduced by stimulation outside the receptive fieldJournal of Neuroscience 34:11222–11227.https://doi.org/10.1523/JNEUROSCI.076214.2014

Implementing a Bayes filter in a neural circuit: the case of unknown stimulus dynamicsNeural Computation 29:2450–2490.https://doi.org/10.1162/neco_a_00991

BookImplementing Bayesian Inference with Neural NetworksDissertation: University of Leipzig.

Population coding in neuronal systems with correlated noisePhysical Review E 64:051904.https://doi.org/10.1103/PhysRevE.64.051904

Flexible models for spike count data with both over and under dispersionJournal of Computational Neuroscience 41:29–43.https://doi.org/10.1007/s108270160603y

Modeling bimodal discrete data using ConwayMaxwellPoisson mixture modelsJournal of Business & Economic Statistics 33:352–365.https://doi.org/10.1080/07350015.2014.949343

Testing the odds of inherent vs. observed overdispersion in neural spike countsJournal of Neurophysiology 115:434–444.https://doi.org/10.1152/jn.00194.2015

The simplest maximum entropy model for collective behavior in a neural networkJournal of Statistical Mechanics: Theory and Experiment 2013:P03011.https://doi.org/10.1088/17425468/2013/03/P03011

Modeling the impact of common noise inputs on the network activity of retinal ganglion cellsJournal of Computational Neuroscience 33:97–121.https://doi.org/10.1007/s1082701103762

Graphical models, exponential families, and variational inferenceFoundations and Trends 1:1–305.https://doi.org/10.1561/2200000001

A neural basis of probabilistic computation in visual cortexNature Neuroscience 23:122–129.https://doi.org/10.1038/s4159301905545

A bayesian observer model constrained by efficient coding can explain 'antiBayesian' perceptsNature Neuroscience 18:1509–1517.https://doi.org/10.1038/nn.4105

The quest for interpretable models of neural population activityCurrent Opinion in Neurobiology 58:86–93.https://doi.org/10.1016/j.conb.2019.07.004

Decoding spike trains instant by instant using order statistics and the mixtureofPoissons modelThe Journal of Neuroscience 23:2394–2406.https://doi.org/10.1523/JNEUROSCI.230602394.2003

Efficient sensory coding of multidimensional stimuliPLOS Computational Biology 16:e1008146.https://doi.org/10.1371/journal.pcbi.1008146

GaussianProcess factor analysis for LowDimensional SingleTrial analysis of neural population activityJournal of Neurophysiology 102:614–635.https://doi.org/10.1152/jn.90941.2008

Probabilistic interpretation of population codesNeural Computation 10:403–430.https://doi.org/10.1162/089976698300017818
Decision letter

Jonathan W PillowReviewing Editor; Princeton University, United States

Joshua I GoldSenior Editor; University of Pennsylvania, United States

Kenneth D HarrisReviewer; University College London, United Kingdom
Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Acceptance summary:
This paper introduces a new framework for modeling correlated neural population spike responses using multivariate mixtures of Poisson or ConwayMaxwellPoisson distributions. It describes an algorithm for fitting the model to data using Expectation Maximization (EM), a formula for Fisher information, and a Bayesian decoder that is competitive with other more computationally demanding decoding methods such as artificial neural networks. The authors apply this model to V1 data from awake and anesthetized monkeys, and show that it captures the variability (eg., Fano Factor) and covariability of population responses better than Poisson models. Finally, the paper shows how the latent variables of the model can provide insight into the structure of population codes. The resulting framework represents a powerful advance for modeling the correlated variability in neural population responses, and promises to be a useful new tool for analyzing largescale neural recordings. The paper will be of interest to computational neuroscientists studying neural coding, and to system neuroscientists who use descriptive models to characterize the stimulus tuning of correlated spiking activity recorded from large neural populations.
Decision letter after peer review:
Thank you for submitting your article "Modelling the neural code in large populations of correlated neurons" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Joshua Gold as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Kenneth D Harris (Reviewer #3).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
1) Clarify abstract (and optionally title): see comments from R1.
2) Address R1 comments about claims "C1" and "C2".
3) Compare performance of the proposed models to existing models that capture stimulus encoding by large populations of correlated neurons (see comments from R2).
4) Clarify how users should decide how many mixture components to add. (R3 comment 4).
5) Provide publicly released code.
Reviewer #1 (Recommendations for the authors):
1 – A main suggestion is to rewrite the title and abstract as I outlined in Weakness #1 in the public comments.
2 – Address 2.12.3 (under weaknesses), in the public comments or eliminate/tone down the claims C1C3, in the abstract, intro, discussion or elsewhere.
3 – I found the discussion around Figure 5 hard to follow. To improve this, I suggest that authors provide (already in the Results section) simplified/readable formulas (valid for the "vanilla" Poisson case which is used in this section) for the stimulus dependence of index probabilities (IP) which depends on the mean population response according to
p(kx) ∝expθ_{K}(k) + r(xk)
where r(xk) is the average total population spike count in the mixture component k, which I believe is given by the partition function ψ_{N}.
4 – As they point out, in their minimal model, the tuning curves (mean response) of neurons are scaled ("gain modulated") versions of the "baseline"component's tuning curves. Again, I think it would help to write a formula for this in the Results section, and connect the scale factor and the baseline tuning curve with the θ_{NK} and θ_{N}(x) components, respectively.
5 – I don't think the simple scaling (see previous comment) relationship actually holds in the nonVanilla CoM case, but they do claim that. If so, the text in lines 189192 (especially "scaled" in line 191) should be corrected.
6 – As they derive in the Methods part, the population covariance matrix of the CPM has the same form as the covariance matrix in factor analysis: a diagonal "private noise" matrix + lowrank "sharednoise" matrix. I think it would be valuable to point this out and write the corresponding formula in the Results section e.g. around Figure 2. Also point out what happens to the diagonal term in the vanilla vs. CoM cases.
7 – The ground truth exercise of Figure 3 is valuable, but I think more valuable than showing how the model fits one example would be to give an idea of the "sample complexity": give an idea of goodness of fit vs. number of trials in the dataset. (At least clarify in the caption how many trials per stimulus conditions were used.)
8 – Not sure what the exercise described in lines 310321 shows. Given that the gound truth model is within the fit model family, isn't it given (by classical asymptotic statistics results) that for large enough data the likelihood and therefore the posterior should converge the true posterior?
So is the result really surprising given that the dataset seems pretty large (d_T=10000)?
Again the more relevant thing would be: what is the minimal amount of data needed to find a good posterior approximation… or as a simpler version: how would it do for typical neural dataset sizes (# trials). (c.f. the previous comment).
9 – lines 359371 – especially line 3656: The reasoning here (that the shown results establish the informationlimiting nature of the noise) are not really complete. Technically, "Informationlimiting" means that the Fisher info is not extensive, i.e. does not scale linearly with population size. So they have to argue that the "random shifts" (discussed in line 365) will not go to zero as. Population size goes to infinity.
Reviewer #3 (Recommendations for the authors):
Line 103: Some more introduction to CoM Poisson distributions would be nice. Why are these better than the negative binomial, which is analytically more tractable? Presumably because they can handle underdispersion? Neural data is usually overdispersed, but does the extra dispersion introduced by a mixture model mean one needs to use underdispersed components for the mixture components?
Line 119: "express multivariate Poisson mixtures in an exponential family form". This is misleading: it sounds like you have expressed the marginal distribution of the mixture model in exponential form, which I believe is impossible. In fact, you are expressing each component distribution in exponential form.
Line 150: "vanilla mixtures". Why not call them Poisson mixtures? That's what they are.
Line 159: "optimized model parameters as described in Materials and methods". You mean you used the EM algorithm derived in Materials and methods? Say so explicitly.
Figure 2. Is this crossvalidated? From the text it seems not, so no wonder the CoM model, with more parameters, fits better. Also, why does the vanilla model ever produce FFs that are too low? Can't it just add more mixture components to increase dispersion?
Line 180: the "CPM" sounds like a mixture of generalized linear models. If so, "mixture of GLMs" would be more familiar terminology for most readers.
Line 193227: it is not clear what we really learn from this. If it just is a validation that the EM algorithm can work on simulated ground truth, then shouldn't that go first, before the application to real data? Also comparing to a less sophisticated model would help show the benefits of this one.
Table 1: please state how many cells are in both data sets.
Line 238: "log likelihood". Please specify if this is to base 2 or base e; also give a unit in table 1 (e.g. bits/trial).
Figure 5: it would be nice to see this applied to real data.
Line 466: do you mean ψN = ∑i θ_{N,i}? The log partition functions should add, right?
Equation 12: is there a denominator of ∏_{i} n_{i}! missing?
Line 573: how much time does the gradient ascent take? Is it going to be a problem for recordings with large numbers of neurons?
https://doi.org/10.7554/eLife.64615.sa1Author response
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
1) Clarify abstract (and optionally title): see comments from R1.
We rewrote the abstract to be more concrete, and state more clearly the nature of the model, and our exact contributions.
2) Address R1 comments about claims "C1" and "C2".
C1 concerned the exact relationship between our work and the theory of Probabilistic Population Codes. We clarified this relationship with a mathematical result and a new section in the manuscript (Constrained conditional mixtures support linear probabilistic population coding).
C2 concerned how we (over) described the “interpretability" of the latent variables in our application of conditional mixtures to synthetic data with informationlimiting correlations. We reformulated relevant text to emphasize that the model in this case is descriptive and not mechanistic.
3) Compare performance of the proposed models to existing models that capture stimulus encoding by large populations of correlated neurons (see comments from R2).
We've expanded the article with an appendix, where we provided a thorough comparison of our model with factor analysis, and highlight these results where relevant in the main article. We focused our comparison on factor analysis since it has been applied to modelling unbounded, trialtotrial spikecounts, and in our introduction we attempted to better explain this is the form of data that our model should be applied to. In the Introduction we have also included references to other models indicated by R2 (lines 44{51), and clarified how they differ from ours.
4) Clarify how users should decide how many mixture components to add. (R3 comment 4).
We added a subsection in the methods section (Strategies for choosing the CM form and latent structure) which presents strategies and suggestions for choosing the latent variable structure/number of mixture components when applying conditional mixtures.
5) Provide publicly released code.
Our code is now available at the Git repository: https://gitlab.com/sachasokoloski/
neuralmixtures, and at that page we have provided an overview of the library and instructions for running the code (we point to this page at lines 796800). The code provides several scripts with which we generated the various figures in the manuscript, and we would be willing to write e.g. Python bindings if interested parties would desire more finegrained control of the code and libraries.
Reviewer #1 (Recommendations for the authors):
1 – A main suggestion is to rewrite the title and abstract as I outlined in Weakness #1 in the public comments.
See online response to public review.
2 – Address 2.12.3 (under weaknesses), in the public comments or eliminate/tone down the claims C1C3, in the abstract, intro, discussion or elsewhere.
See online response to public review.
3 – I found the discussion around Figure 5 hard to follow. To improve this, I suggest that authors provide (already in the Results section) simplified/readable formulas (valid for the "vanilla" Poisson case which is used in this section) for the stimulus dependence of index probabilities (IP) which depends on the mean population response according to
p(kx) ∝expθ_{K}(k) + r(xk)
where r(xk) is the average total population spike count in the mixture component k, which I believe is given by the partition function ψ_{N}.
We've added this formula (lines 415416) and reworked this section (see the paragraphs from lines 413434).
4 – As they point out, in their minimal model, the tuning curves (mean response) of neurons are scaled ("gain modulated") versions of the "baseline"component's tuning curves. Again, I think it would help to write a formula for this in the Results section, and connect the scale factor and the baseline tuning curve with the θ_{NK} and θ_{N}(x)components, respectively.
We now provide more explicit formulae that connect gain modulation with the exponential family parameters (lines 221227).
5 – I don't think the simple scaling (see previous comment) relationship actually holds in the nonVanilla CoM case, but they do claim that. If so, the text in lines 189192 (especially "scaled" in line 191) should be corrected.
You are correct, and we've clarified the language in the corresponding section (lines 227229).
6 – As they derive in the Methods part, the population covariance matrix of the CPM has the same form as the covariance matrix in factor analysis: a diagonal "private noise" matrix + lowrank "sharednoise" matrix. I think it would be valuable to point this out and write the corresponding formula in the Results section e.g. around Figure 2. Also point out what happens to the diagonal term in the vanilla vs. CoM cases.
This is very insightful, thank you. We've incorporated this comparison into the development and introduction of mixture our models (see lines 126137, and 170172). We've resisted adding the explicit formulas for the variances/covariances of the various models, as we found that it required introducing too much notation, however we hope we conveyed your insight effectively in words.
7 – The ground truth exercise of Figure 3 is valuable, but I think more valuable than showing how the model fits one example would be to give an idea of the "sample complexity": give an idea of goodness of fit vs. number of trials in the dataset. (At least clarify in the caption how many trials per stimulus conditions were used.)
We agree that this is important to establish. In Appendix 3 we repeated our crossvalidated simulations for estimating the predictive loglikelihood and logposterior on subsets of our two datasets, i.e. by subsampling different fractions of trials, to better understand the sample complexity of these two quantities.
8 – Not sure what the exercise described in lines 310321 shows. Given that the gound truth model is within the fit model family, isn't it given (by classical asymptotic statistics results) that for large enough data the likelihood and therefore the posterior should converge the true posterior?
So is the result really surprising given that the dataset seems pretty large (d_T=10000)?
Again the more relevant thing would be: what is the minimal amount of data needed to find a good posterior approximation… or as a simpler version: how would it do for typical neural dataset sizes (# trials). (c.f. the previous comment).
Again we agree. We have removed this short paragraph and instead provide a sample complexity analysis in Appendix 3.
9 – lines 359371 – especially line 3656: The reasoning here (that the shown results establish the informationlimiting nature of the noise) are not really complete. Technically, "Informationlimiting" means that the Fisher info is not extensive, i.e. does not scale linearly with population size. So they have to argue that the "random shifts" (discussed in line 365) will not go to zero as. Population size goes to infinity.
This is a fair point, and we've reworded parts of this section to emphasize that our model learns to approximate informationlimiting correlations, which we hope clarifies the difference. (See again the section starting at line 393, and in particular lines 392393, and 435436).
Reviewer #3 (Recommendations for the authors):
Line 103: Some more introduction to CoM Poisson distributions would be nice. Why are these better than the negative binomial, which is analytically more tractable? Presumably because they can handle underdispersion? Neural data is usually overdispersed, but does the extra dispersion introduced by a mixture model mean one needs to use underdispersed components for the mixture components?
To address these questions, first, we added a bit more introduction to the CoMPoisson distributions (lines 108114), and indeed the reason we use them is to allow our components do exhibit underdispersion, as described in Stevenson et al., 2016, Journal of Computational Neuroscience.
Secondly, as you say, it is indeed helpful to to use CoMPoisson distributions even when the data is primarily overdispersed, because mixture models can make the statistics “too" overdispersed. A specific example of this effect is illustrated in our applications to anaesthetized data (Figure 2B). We explained this more concretely when we show the improvements that result from our CoMbased model, and show how the CoMBased parameters facilitate this in Figures 2CD.
Line 119: "express multivariate Poisson mixtures in an exponential family form". This is misleading: it sounds like you have expressed the marginal distribution of the mixture model in exponential form, which I believe is impossible. In fact, you are expressing each component distribution in exponential form.
Indeed, thanks for pointing this out. We rewrote the text to properly explain that mixtures are marginals of an exponential family joint distribution (lines 138{139).
Line 150: "vanilla mixtures". Why not call them Poisson mixtures? That's what they are.
Indeed, we realize now that our naming conventions could lead to confusion, and we have retooled the names and acronyms in the paper. We now refer to Independent Poisson (IP) and CoMBased (CB) models, and for Conditional Mixtures we write CM, so we have IP and CB mixtures, and then IPCMs and CBCMs.
Line 159: "optimized model parameters as described in Materials and methods". You mean you used the EM algorithm derived in Materials and methods? Say so explicitly.
Changed (lines 178179).
Figure 2. Is this crossvalidated? From the text it seems not, so no wonder the CoM model, with more parameters, fits better. Also, why does the vanilla model ever produce FFs that are too low? Can't it just add more mixture components to increase dispersion?
The analyses around Figure 2 are not crossvalidated. We highlight that better in the revised text (lines 178179).
The reviewer is correct that it's not a surprise that the CoM model, with more parameters, fits better. Our goal with this exercise instead is to illustrate how it fits better. Indeed, intuitively, one would expect that the vanilla mixture should capture anaesthetized data well, since it is primarily overdispersed, yet we find that this is not necessarily the case.
In the revised manuscript we have emphasized and explained this counterintuitive result with the histograms of the CoM parameters around Figure 2 (which has now been split into two figures) and related text. In short, the means and variances of a Poisson mixture are coupled through shared parameters, and the CoMbased parameters break this coupling, and in particular allow the model to capture overdispersion directly without comprising its estimate of the mean. This correction highlights the strength and importance of incorporating CoMPoisson distributions into our models.
Line 180: the "CPM" sounds like a mixture of generalized linear models. If so, "mixture of GLMs" would be more familiar terminology for most readers.
This is a logical conclusion but in fact mixtures of GLMs are different from our models. In Appendix 4 material we include an explanation of the relationship between these two models.
Line 193227: it is not clear what we really learn from this. If it just is a validation that the EM algorithm can work on simulated ground truth, then shouldn't that go first, before the application to real data? Also comparing to a less sophisticated model would help show the benefits of this one.
This is a fair point. In some sense this is just a validation of the EM algorithm, however, in the first set of results around figures 23, before Line 193227 (now Lines 230263), we aren't considering stimuli, and the EM algorithm for stimulusdependence, requires a further extension.
The paper is centred around two results  incorporating CoMbased parameters, and adding stimulusdependence  and in some sense it's arbitrary which we start with. We went with the former, and we hope it's okay to leave the structure as is.
As for the comparison with less sophisticated models, in our crossvalidation of the loglikelihood and logposterior under various models, we do consider an independent Poisson model and a Linear decoder. In the text we have attempted to better explain that this is a “sanity check" and not a rigorous performance analysis of the model, which comes in later sections (lines 263264).
Table 1: please state how many cells are in both data sets.
Done (Now Table 2).
Line 238: "log likelihood". Please specify if this is to base 2 or base e; also give a unit in table 1 (e.g. bits/trial).
Added and added (see Table 2, and line 276).
Figure 5: it would be nice to see this applied to real data.
We agree, but, as for the application on other datasets, we believe that a proper analysis of the latent structure of informationlimiting correlations in real data deserves a dedicated and extensive study.
Line 466: do you mean ψN = ∑i θ_{N,i}? The log partition functions should add, right?
Fixed, thank you (Line 542).
Equation 12: is there a denominator of ∏_{i} n_{i}! missing?
Fixed, thank you (Now Equation 13).
Line 573: how much time does the gradient ascent take? Is it going to be a problem for recordings with large numbers of neurons?
The model scales quite well. The largest computation in the descent step is an outer product computation which is O(k * n), where k is the number of mixture components, n is the number of neurons. Nevertheless, there are subtleties (in particular around minibatch sizes and number of stimuli). We believe the best place for this information is in our gitlab repository, and so we have included it there.
For reference, we have successfully applied the model in synthetic experiments to populations of thousands of neurons, with training times on the order of hours. For tens of thousands there might be a need for refinement of our code, perhaps including GPU computation.
https://doi.org/10.7554/eLife.64615.sa2Article and author information
Author details
Funding
National Institutes of Health (EY030578)
 Ruben CoenCagli
National Institutes of Health (EY02826)
 Sacha Sokoloski
 Amir Aschner
National Institutes of Health (EY016774)
 Amir Aschner
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank all the members of the labs of Ruben CoenCagli and Adam Kohn for their regular feedback and support.
Ethics
Animal experimentation: All procedures were approved by the Institutional Animal Care and Use Committee of the Albert Einstein College of Medicine, and were in compliance with the guidelines set forth in the National Institutes of Health Guide for the Care and Use of Laboratory Animals under protocols 20180308 and 20180309 for the awake and anaesthetized macaque recordings, respectively.
Senior Editor
 Joshua I Gold, University of Pennsylvania, United States
Reviewing Editor
 Jonathan W Pillow, Princeton University, United States
Reviewer
 Kenneth D Harris, University College London, United Kingdom
Publication history
 Received: November 5, 2020
 Accepted: October 1, 2021
 Accepted Manuscript published: October 5, 2021 (version 1)
 Version of Record published: November 9, 2021 (version 2)
 Version of Record updated: November 12, 2021 (version 3)
Copyright
© 2021, Sokoloski et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,492
 Page views

 230
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.