Unsupervised Bayesian Ising Approximation for decoding neural activity and other biological dictionaries
Abstract
The problem of deciphering how lowlevel patterns (action potentials in the brain, amino acids in a protein, etc.) drive highlevel biological features (sensorimotor behavior, enzymatic function) represents the central challenge of quantitative biology. The lack of general methods for doing so from the size of datasets that can be collected experimentally severely limits our understanding of the biological world. For example, in neuroscience, some sensory and motor codes have been shown to consist of precisely timed multispike patterns. However, the combinatorial complexity of such pattern codes have precluded development of methods for their comprehensive analysis. Thus, just as it is hard to predict a protein’s function based on its sequence, we still do not understand how to accurately predict an organism’s behavior based on neural activity. Here, we introduce the unsupervised Bayesian Ising Approximation (uBIA) for solving this class of problems. We demonstrate its utility in an application to neural data, detecting precisely timed spike patterns that code for specific motor behaviors in a songbird vocal system. In data recorded during singing from neurons in a vocal control region, our method detects such codewords with an arbitrary number of spikes, does so from small data sets, and accounts for dependencies in occurrences of codewords. Detecting such comprehensive motor control dictionaries can improve our understanding of skilled motor control and the neural bases of sensorimotor learning in animals. To further illustrate the utility of uBIA, we used it to identify the distinct sets of activity patterns that encode vocal motor exploration versus typical song production. Crucially, our method can be used not only for analysis of neural systems, but also for understanding the structure of correlations in other biological and nonbiological datasets.
Editor's evaluation
This work introduces a new unsupervised Bayesian method for identifying important patterns in neural population responses. The method offers improvements relative methods that do not assume correlations in neural responses, and is likely to also outperform methods that take into account pairwise correlations in neural responses.
https://doi.org/10.7554/eLife.68192.sa0Introduction
One of the goals of modern highthroughput biology is to generate predictive models of interaction networks, from interactions among individual biological molecules (Marks et al., 2011) to the encoding of information by networks of neurons in the brain (Schneidman et al., 2006). To make predictions about activity across networks requires one to accurately approximate—or build a model of—their joint probability distribution, such as the distribution of joint firing patterns in neural populations or the distribution of cooccurring mutations in proteins of the same family. To successfully generalize and to improve interpretability, models should contain as few as possible terms. Thus constructing a model requires detecting relevant features in the data: namely, the smallest possible set of spike patterns or nucleotide sequences that capture the most correlations among the network components. By analogy with human languages, where words are strongly correlated cooccurring combinations of letters, we refer to the problem of detecting features that succinctly describe correlations in a data set as the problem of dictionary reconstruction, as schematized in Figure 1. It is the first step towards building a model of the underlying data, but it is substantially different (and potentially simpler) than the latter: detecting which features are relevant is not the same as quantifying how they matter in detail.
In recent years, the problem of dictionary reconstruction has been addressed under different names for a variety of biological contexts (Natale et al., 2018) including gene expression networks (Margolin et al., 2006; Lezon et al., 2006), protein structure, proteinprotein interactions (Marks et al., 2011; Morcos et al., 2011; Bitbol et al., 2016; Halabi et al., 2009), the structure of regulatory DNA (Otwinowski and Nemenman, 2013), distribution of antibodies and pathogenic sequences (Mora et al., 2010; Ferguson et al., 2013), species abundance (Tikhonov et al., 2015), and collective behaviors (Bialek et al., 2012; Couzin and Krause, 2003; Lukeman et al., 2010; Kelley and Ouellette, 2013; PérezEscudero and de Polavieja, 2011). The efforts to identify interactions in neural activity have been particularly plentiful (Stevens and Zador, 1995; Schneidman et al., 2006; Pillow et al., 2008; Bassett and Sporns, 2017; Williams, 2019). The diversity of biological applications notwithstanding, most of these attempts have relied on similar mathematical constructs, and most have suffered from the same limitations. First, unlike in classical statistics and traditional quantitative model building, where the number of observations, $M$, usually vastly exceeds the number of unknowns to be estimated, $K$, $K/M\ll 1$, modern biological data often has $M\gg 1$, but also $K/M\gg 1$. Indeed, because of network features such as protein allostery, recurrent connections within neural populations, and coupling to global stimuli, biological systems are rarely limited to local interactions only (Schwab et al., 2014; Merchan and Nemenman, 2016; Nemenman, 2017), so that the number of pairwise interactions among $N$ variables is $K\sim {N}^{2}$, and the number of all higher order interactions among them is (that is, interactions that involve more than two network variables at the same time) is $K\sim {2}^{N}$. Put differently, words in biological dictionaries can be of an arbitrary length, and spelling rules may involve many letters simultaneously, some of which are far away from each other. Because of this, reconstruction of biological dictionaries from data sets of realistic sizes requires assumptions and simplifications about the structure of possible biological correlations, and will not be possible by brute force. The second problem is that, as in human languages, biological dictionaries have redundancies: there are synonyms and words that share roots. For example, a set of gene expressions may be correlated not because the genes interact directly, but because they are related to some other genes that do. Similarly, a certain pattern of neural activity may be statistically over or underrepresented (relative to a null model) not on its own, but because it is a subset or a superset of another, more important, pattern. Identifying irreducible words—the root forms of biological dictionaries—is therefore harder than detecting all correlations while also being crucial to fully understanding biological systems. Together, these complications make it impossible to use standard methods for reconstructing combinatorially complex dictionaries from datasets of realistic sizes.
In this work, we propose a new method for reconstructing complex biological dictionaries from relatively small datasets, as few as $M\sim {10}^{2}\mathrm{\dots}{10}^{3}$ samples of the joint activity and test it on neural data from songbirds. We focus on the regime $M\gg N$, which means $N$ of a few tens for our datasets. While small compared to some of the largest high throughput biological datasets, this regime is relevant in many biological contexts, and especially in studies of motor systems, where recording from multiple single motor units is hard. Crucially, the method imposes no limitation on the structure of the words that can enter the dictionary — neither their length nor their rules of spelling — beyond the obvious limitation that (i) words that do not happen in the data cannot be detected, and (ii) that data contain few samples of many words, rather than of just a few that repeat many times. Additionally, we address the problem of irreducibility, making the inferred dictionaries compact, nonredundant, and easier to comprehend. The main realization that allows this progress is that instead of building an explicit model of the entire joint probability distribution of a system’s states and hence answering how specific significant words matter, we can focus on a more restricted, and thus possibly simpler, question: which specific words contribute to the dictionary. In other words, unlike many other methods, we do not build an explicit model of the underlying probability distribution, which would allow us to ‘decode’ the meaning of the data, but only detect features that can be used in such models later. We do this using the language of Bayesian inference and statistical mechanics by developing an unsupervised version of the Bayesian Ising Approximation (Fisher and Mehta, 2015) and by merging it with the reliable interactions model (Ganmor et al., 2011).
We believe that the approach we develop is fully general and will allow analysis of diverse datasets with realistic size requirements, and with few assumptions. However, to illustrate the capabilities of the approach, we present it here in the context of a specific biological system: recordings from single neurons in brain area RA (the robust nucleus of the arcopallium) in the Bengalese finch, a songbird. Neurons communicate with each other using patterns of action potentials (spikes), which encode sensory information and motor commands, and hence behavior. Reconstructing the neural dictionary, and specifically detecting irreducible patterns of neural activity that correlate with (or ‘encode’) sensory stimuli or motor behaviors — which we hereafter call codewords — has been a key problem in computational neuroscience for decades (Stevens and Zador, 1995). It is now known that in both sensory (Berry et al., 1997; Strong et al., 1998; Reinagel and Reid, 2000; Arabzadeh et al., 2006; Rokem et al., 2006; Nemenman et al., 2008; Lawhern et al., 2011; Fairhall et al., 2012) and motor systems (Tang et al., 2014; Srivastava et al., 2017; Sober et al., 2018; Putney et al., 2019) the timing of neural action potentials (spikes) in multispike patterns, down to millisecond resolution, can contribute to the encoding of sensory or motor information. Such dictionaries that involve long sequences of neural activities (or incorporate multiple neurons) at high temporal resolution are both more complex and more likely to be severely undersampled. Specifically, even though the songbird datasets considered here are large by neurophysiological standards, they are too small to build their statistical models, which is the goal of most existing analysis approaches. This motivates the general inference problem we address here.
The power of our approach is illustrated by discoveries in the analysis of the songbird vocal motor code. Specifically, while it is known that various features of the complex vocal behaviors are encoded by millisecondscale firing patterns (Tang et al., 2014), here we identify which specific patterns most strongly predict behavioral variations. Further, we show that dictionaries of individual neurons are rather large and quite variable, so that neurons speak different languages, which nonetheless share some universal features. Intriguingly, we detect that codewords that predict large, exploratory deviations in vocal acoustics are statistically different from those that predict typical behaviors. Collectively, these findings pave the way for development of future theories of the structure of these dictionaries, of how they are formed during development, how they adapt to different contexts, and how motor biophysics translates them into specific movements. More importantly, the development of this method and its successful application to neural and behavioral data from songbirds suggests its utility in other biological domains, where reconstruction of feature dictionaries is equally important, and where new discoveries are waiting to be made.
Results
The dictionary reconstruction problem
We formalize the dictionary reconstruction problem as follows. An experimental dataset consists of $M$ samples of a vector of binary variables of length $N$, $\overrightarrow{\sigma}={\{{\sigma}_{i}\}}_{0}^{N}$, which we call letters (or spins, by analogy with statistical physics). These samples are assumed to be taken from a stationary joint probability distribution $P(\overrightarrow{\sigma})$, but the distribution is unknown. From the frequency of occurrence of various combinations of σs in the dataset, we need to detect words in the model defined by $P(\overrightarrow{\sigma})$, namely those patterns of σs that are significantly over or underrepresented (statistically anomalous) compared to some null model. While different null models are possible, a common choice is the product of marginal distributions ${P}_{\mathrm{null}}={\prod}_{i=1}^{N}P({\sigma}_{i})$. In this case, words are defined as correlated patterns of binary letters. Additionally, to be a part of the dictionary, a word must be irreducible. That is, it must be statistically anomalous not only with respect to the null model, but also with respect to its own various parts. For example, a word ${\sigma}_{1}{\sigma}_{2}{\sigma}_{3}$ all equal to 1 is a word in a dictionary only if it is statistically anomalous with respect to its frequency expected from frequencies of ${\sigma}_{1}{\sigma}_{2}$, ${\sigma}_{1}{\sigma}_{3}$, ${\sigma}_{2}{\sigma}_{3}$, and also from frequencies of ${\sigma}_{1}{\sigma}_{2}{\sigma}_{3}{\sigma}_{4}$, ${\sigma}_{1}{\sigma}_{2}{\sigma}_{3}{\sigma}_{5}$, ${\sigma}_{1}{\sigma}_{2}{\sigma}_{3}{\sigma}_{4}{\sigma}_{5}$, and so on, eventually accounting for all other words that include any combination of letters ${\sigma}_{1}$, ${\sigma}_{2}$, and ${\sigma}_{3}$. In principle, such statistical over or underrepresentation of a word has a precise mathematical definition, based on comparing the entropy of the maximum entropy distribution constrained by frequencies of all other words in the distribution to that constrained additionally by the word itself (Margolin et al., 2010). For example, three anomalous overrepresented words are shown in the top right panel of Figure 1. Similarly, an example of anomalous underrepresentaion is shown in light blue in Figure 1, right middle panel, in which the word 1.1 is underrepresented relative to the frequency expected from the (marginal) frequencies of 1·· and ··1. In practice, performing these many comparisons is impossible for all but the simplest cases. Even approximate analyses, aiming to prove that a method results in irreducible dictionaries under some specific assumptions, have not been very successful to date. As a result, typical methods for dictionary reconstruction are assessed in their ability to build irreducible dictionaries based on heuristics, such as having a low probability of including overlapping words. We will follow the same route here.
Songbird neural motor code as a dictionary reconstruction problem
Owing to their complex and tightly regulated vocal behavior and experimentally accessible nervous system, songbirds provide an ideal model system for investigating the neural dictionaries underlying complex motor behaviors (Doupe and Kuhl, 1999; Kuebrich and Sober, 2015). We recorded from individual neurons in the motor cortical area RA of Bengalese finches during spontaneous singing and quantified the acoustic ‘features’ of song, specifically the fundamental frequency (which we will refer to as ‘pitch’), amplitude, and spectral entropy of individual vocal gestures, or ‘syllables’, as described previously (Sober et al., 2008; Tang et al., 2014; Wohlgemuth et al., 2010). The data sets are sufficiently large to be used as examples of dictionary reconstruction, allowing us to illustrate the type of biological insight that our approach can gain: we have 49 data sets — spanning 4 birds, 30 neurons and sometimes multiple song syllables for each neuron — for which we observed at least 200 instances of the behavior and the associated neural activity, which we estimate below to be the lower threshold for a sufficient statistical power.
To represent analysis of this motor code as a dictionary reconstruction problem, we binarize the recorded spiking time series so that ${\sigma}_{t}=(0,1)$ indicates the absence or presence of a spike in a time slice of $[(t1)\mathrm{\Delta}t,t\mathrm{\Delta}t]$, see Figure 2. Thus each time interval is represented by a binary variable, and interactions among these patterns are described by overrepresented or underrepresented sequences of zeros and ones in the data. Using a complementary informationtheoretic analysis, Tang et al., 2014 showed that the mutual information between the neural spike train and various features of song acoustics peaks at $\mathrm{\Delta}t=1\mathrm{\dots}2$ ms. Thus, studying precise timing pattern codes means that we focus on $\mathrm{\Delta}t=2$ ms (our datasets are not large enough to explore smaller $\mathrm{\Delta}t$) as discussed previously in Tang et al., 2014. Detection of statistically significant codewords at this temporal resolution would both reconfirm that this neural motor code is timing based, consistent with previous analyses (Tang et al., 2014), as well as for the first time reveal the specific patterns that most strongly predict behavior. We focus on neural time series of length $T=40$ ms duration preceding a certain acoustic syllable, which includes the approximate premotor delay with which neurons and muscles influence behavior (Tang et al., 2014). Thus the index $t$ runs between 1 and $T/\mathrm{\Delta}t=20$.
Since we are interested in words that relate neural activity and behavior, we similarly binarize the motor output (Tang et al., 2014), denoting by 0 or 1 different binary characteristics of the behavior, such as pitch being either below or above its median value, or outside or inside a certain range (Figure 2). We treat the behavioral variable as the 0’th component of the neurobehavioral activity, which then has $N=21$ binary variables, $\overrightarrow{\sigma}={\{{\sigma}_{i}\}}_{0}^{N1}$. Building the neuralbehavioral dictionary is then equivalent to detecting significantly over or underrepresented patterns in the probability distribution $P(\overrightarrow{\sigma})$. Focusing specifically on the statistically anomalous words that include the behavioral bit results in detection of codewords, for which the neural activity is correlated with (or is predictive of) the behavioral bit. Note that ${2}^{N}={2}^{21}\approx 2\cdot {10}^{6}$, which is much greater than $M\sim 100\mathrm{\dots}1000$ observations of the activity that we can record, illustrating the complexity of the problem. In fact, similar to the famous birthday problem (one gets coinciding birthdays with a lot fewer people than the number of days in a year), one expects a substantial number of repeating samples of the activity of the full length $N$ — and hence the ability to detect statistically over and underrepresented binary words – only when $M\sim \sqrt{{2}^{N}}$, which is what limits the statistical power of any dictionary reconstruction method. Crucially, the approach presented here works by analyzing all patterns, not just patterns of the full length, allowing us to detect anomalous subwords even in more severely undersampled regimes.
The unsupervised BIA method (UBIA) for dictionary reconstruction
To reconstruct dictionaries in the neural motor code dataset and others with similar properties, we have developed the unsupervised Bayesian Ising Approximation (uBIA) method based on the Bayesian Ising Approximation for detection of significant features in linear Gaussian regression problems (Fisher and Mehta, 2015). Specifically, we extend BIA to detect significant interaction terms in probability distributions, rather than in linear regression models. For this, we write the probability distribution $P(\overrightarrow{\sigma})$ without any loss of generality as
where $Z$ is the normalization coefficient (Amari, 2001). We use the notation such that ${V}_{\mu}$ is a nonempty subset of indexes $i\in [0,N]$, and $\mu =[1,{2}^{N+1}1]$ is the subset number. Then $\{{\theta}_{\mu}\}=\overrightarrow{\theta}$ are coefficients in the loglinear model in front of the corresponding product of binary σs. In other words, ${V}_{\mu}$ denotes a specific combination of the behavior and / or times when the neuron is active. If ${\theta}_{\mu}$ is nonzero for a term ${\prod}_{i\in {V}_{\mu}}{\sigma}_{i}$, where $i=0$ (the response variable) is in ${V}_{\mu}$, then this specific spike word is correlated with the motor output, and is a significant codeword in the neural code, see Figure 1. Finding nonzero $\theta}_{\mu s$ is then equivalent to identifying which codewords matter and should be included in the dictionary in Figure 1, and inferring the exact values of ${\theta}_{\mu}$ tells how they matter. Notice that Equation 1 makes precise the definition of the order of an interaction, which corresponds to the number of variables ${\sigma}_{i}$ in the interaction term.
A common alternative model of probability distributions uses $x=2\sigma 1=\pm 1$ instead of $\sigma =(0,1)$. A third order term coupling, for example, ${\sigma}_{i}{\sigma}_{j}{\sigma}_{k}$ represents a combination of first, second, and third order terms in the corresponding xs, and vice versa. Thus which words are codewords may depend on the parameterization used, but the longest codewords and nonoverlapping groups of codewords remain the same in both parameterizations. Our choice of σ vs $x$ is for a practical reason: a codeword in the σ basis does not contribute to $P$ unless all of its constituent bins are nonzero. Thus since spikes are rare, we do not need to consider contributions of very long words to the code.
We would like to investigate the neural dictionary systematically and without arbitrarily truncating Equation 1 at some order of interactions or making other strong assumptions about the structure of the words in the dictionary. In fact, this is possibly the biggest distinction of our approach from others in the literature (Bialek et al., 1991; Pillow et al., 2008; Schneidman et al., 2006), which usually start with strong a priori assumptions. However, as discussed above, some assumptions must be made to solve the problem for typical data set sizes, and we would like to be very explicit about those we make. To achieve all of this, we define indicator variables ${s}_{\mu}=(0,1)$, $\mu =1,\mathrm{\dots},{2}^{N+1}1$, which denote if a particular sequence of ${\sigma}_{i}=1$, $i\in {V}_{\mu}$, and ${\sigma}_{i}=0$, $i\notin {V}_{\mu}$, ‘matters’ (is a putative word in the dictionary), that is, it is either statistically significantly over or underrepresented in the data set compared to a null model (which we define later). In other words, we rewrite $P(\overrightarrow{\sigma}\overrightarrow{\theta})$ without any assumptions as:
We then choose a prior on ${\theta}_{\mu}$ and on ${s}_{\mu}$. We choose to focus on problems where there are many weak words in the dictionary; in other words, typically ${\theta}_{\mu}\ll 1$. We make this choice for two reasons. First, detecting words that are strongly anomalously represented is easy, and does not require a complex statistical apparatus. Second, having many contributing small effects is more realistic biologically. Specifically, for songbird vocal motor control, since many neurons control the muscles and hence the behavioral output, individual spikes in single neuron can only have a very weak effect on the motor behavior. We thus work in the strong regularization limit and impose priors
Note that the prior distribution $P({\theta}_{\mu}{s}_{\mu}=0)$ is irrelevant since, for ${s}_{\mu}=0$, ${\theta}_{\mu}$ does not contribute to $P(\overrightarrow{\sigma}\overrightarrow{\theta},\overrightarrow{s})$.
At this point, we need to choose the null model for the occurrence of letter patterns. We do this by choosing ${\overline{\theta}}_{\mu}$ in a way such that the a priori averages (calculated within the prior only) and the empirical averages (frequencies, observed in the data) of individual ${\sigma}_{i}s$ are equal, $\u27e8{\sigma}_{i}\u27e9=\overline{{\sigma}_{i}}$ (we always use $\u27e8\mathrm{\dots}\u27e9$ and $\overline{\mathrm{\dots}}$ to denote a priori and a posteriori expectations, respectively). This is equivalent to saying that the null model reproduces the firing rate of neurons and the frequency of the behavior, ${P}_{\mathrm{null}}={\prod}_{i=1}^{N}P({\sigma}_{i})$. This is possible to do since, in typical problems, marginal probabilities $P({\sigma}_{i})$ are, indeed, wellknown, and it is the higher order interaction terms, the words in the dictionary, that make the reconstruction hard. Finally, we choose the least informative prior $P({s}_{\mu}=1)=P({s}_{\mu}=0)=0.5$, so that a priori a particular word has a fiftyfifty chance of being included in the neural dictionary. If we have reasons to suspect that some words are more or less likely to be included a priori, this probability can be changed.
Since we are only interested in whether a word is anomalous enough to be in the dictionary, but not in the full model of the joint probability distribution, we integrate out all ${\theta}_{\mu}$, after having observed $M$ samples of the $N$ dimensional vector $\overrightarrow{\sigma}$. To perform this calculation, we start with the Bayes formula (notice that for the whole set of $M$ samples of the vector $\overrightarrow{\sigma}$ we use the notation $\overrightarrow{\mathit{\sigma}}$)
Now we make two approximations. First, we evaluate the integral in Equation 4 using the saddle point approximation around the peak of the prior, ${\overrightarrow{\theta}}^{*}$. This is a low signaltonoise limit, and it is different from most high signaltonoise approaches that analyze the saddle around the peak of the posterior. This leads to
where $\mathbf{H}$ and $\mathbf{b}$ have size $S\times S$ and $S$ respectively, being $S={\sum}_{\mu}{s}_{\mu}$ being the total number of active variables. Their elements corresponds to
where $\mathcal{L}=\mathrm{log}P(\overrightarrow{\mathit{\sigma}}\overrightarrow{\theta})$ is the loglikelihood. Second, we do all calculations as a Taylor series in the small parameter $\u03f5$ (see below on the choice of $\u03f5$). Both approximations are facets of the same strong regularization assumption, which insists that most coupling constants ${\theta}_{\mu}$ are small. Again, the logic here is that we may not have enough information to know what θ is a posteriori, but we should have enough to know if it is nonzero. Following Fisher and Mehta, 2015, we obtain
Finally, after explicitly reintroducing We now explicitly reintroduce the indicator variables and by taking into account the both $\mathbf{H}$ and $\mathbf{b}$ are restricted to the dimensions where ${s}_{\mu}=1$. That is, for example, the term ${\mathbf{b}}^{\u22ba}\mathbf{b}$ corresponds to ${\sum}_{\mu}{b}_{\mu}^{2}{s}_{\mu}$. Finally, adding the normalization, we get
where the magnetic fields (biases) ${h}_{\mu}$ and the exchange interactions ${J}_{\mu \nu}$ are
see ‘Geometric interpretation of uBIA Field’ in ‘Materials and Methods’ for a geometric interpretation of the field.
Notice that Equation 8 has a familiar pairwise Ising form (Thompson, 2015), with datadependent magnetic fields ${h}_{\mu}$ and the couplings ${J}_{\mu \nu}$. This Ising model has ${2}^{N}$ spins, replacing the Ising model with $N$ spins, but with higher order interactions in Equation 1. Naively, we created a harder problem, with many more variables! However, since most of the ${2}^{N}$ words do not appear in the actual data, and because of the ${\u03f5}^{2}$ in front of the pairwise coupling term, evaluating posterior expectations $\u27e8{s}_{\mu}\u27e9$ for all word that actually occur is relatively easy, as we show now. Indeed, plugging in the model of the probability distribution, Equation 1, we get for the fields and the exchange interactions
Here, to simplify the notation, we defined ${\mathit{\sigma}}_{\mu}\equiv {\prod}_{i\in {V}_{\mu}}{\sigma}_{i}$, and we remind the reader that $M$ represents the number of samples. Further, angular brackets, $\mathrm{cov}$, and $\mathrm{var}$ denote the a priori expectations, covariances, and variances of frequencies of words in the null model, which matches frequency of occurrence of each individual ${\sigma}_{i}$ (probability of firing in every time bin for the songbird data). Similarly, overlines denote the empirical counts or correlations between cooccurrences of words in the observed data. Specifically, denoting by ${n}_{\mu}$ the marginal frequencies of the word ${V}_{\mu}$ in the data, these expectations and frequencies are defined as follows:
To derive these equations, note that ${\sigma}_{i}^{2}={\sigma}_{i}$. Note also that $\mathrm{cov}({\mathit{\sigma}}_{\mu},{\mathit{\sigma}}_{\nu})=0$ if the intersection of ${V}_{\mu}$ and ${V}_{\nu}$ is empty.
Equation 10 has a straightforward interpretation. Specifically, if the difference between the a priori expected frequency and the empirical frequency of a word is statistically significantly nonzero (compared to the a priori standard error), then the corresponding word is anomalously represented. It does not matter whether the word is over or underrepresented: in either case, if the frequency deviates from the expectation, then the field ${h}_{\mu}$ is positive, biasing the indicator ${s}_{\mu}$ toward 1, and hence toward inclusion of the word in the dictionary. If the frequency is as expected, then the field is negative, and the indicator is biased towards 0, excluding the word from the dictionary. Note that as $M$ increases, the standard error goes down, and the field generally increases, allowing us to consider more words. The sign of ${\theta}_{\mu}$ would reflect whether the word is over or underrepresented. However, estimating the exact value of ${\theta}_{\mu}$ from small datasets is often impossible and is not our goal, even though, in Figure 2, we denote words as under or overrepresented by whether their empirical frequency is smaller or larger than the a priori expectation. Thus in some aspects, our approach is similar to the previous work (Schnitzer and Meister, 2003), where multineuronal patterns are found by comparing empirical firing probabilities to expectations. However, we do this comprehensively for all patterns that occur in data. Crucially, in addition, the exchange interactions ${J}_{\mu \nu}$ also allow us to account for reducibility of the dictionaries.
To see this, recall that correlations among words create a problem since a word can occur too frequently not in its own right, but either (a) because its subwords are common, or (b) it is a subword of a larger common word, as illustrated in Figure 1. In other approaches, resolving these overlaps requires imposing sparsity or other additional constraints. In contrast, the couplings ${J}_{\mu \nu}$ address this problem for uBIA naturally and computationally efficiently. Notice that because of the factor of 2 in the negative term in Equation 11, the exchange interactions are predominantly negative if one expects the two studied words to be correlated, and if they cooccur in the empirical data as much as they are expected to cooccur in the null model because of the overlaps in their composition, ${V}_{\mu}$ and ${V}_{\nu}$. Negative ${J}_{\mu \nu}s$ implement a mechanism, where statistical anomalies in data that can be explained, in principle, by many different ${\theta}_{\mu}s$ are attributed predominantly to one such ${\theta}_{\mu}$ that explains them the best, bringing the dictionary closer to the irreducible form. On the other hand, the exchange interactions are positive if one expects correlations between the words a priori, but does not observe them. Thus, in principle, a word can be included in the dictionary even at zero field ${h}_{\mu}$. Crucially, every word affects the probability of every other word to be included in the dictionary by means of their corresponding ${J}_{\mu \nu}$. In this way, while uBIA is not equivalent to the full maximum entropy definition of irreducibility (Margolin et al., 2010), it comes close.
Knowing the coefficients ${h}_{\mu}$ and ${J}_{\mu \nu}$, one can numerically estimate $\u27e8{s}_{\mu}\u27e9$, the posterior expectation for including a word ${V}_{\mu}$ in the dictionary. Generally, finding such marginal expectations from the joint distribution in disordered systems is a hard problem. However, here ${h}_{\mu}\propto \u03f5$ and ${J}_{\mu \nu}\propto {\u03f5}^{2}$, so that the fields and the interactions create small perturbations around the ‘total ignorance’ solution, $\u27e8{s}_{\mu}\u27e9=1/2$ (this is a manifestation of our general assumption that none of the words is very easy to detect). Therefore, we calculate the marginal expectation using fast mean field techniques (Opper and Saad, 2001). We use the naive mean field approximation, which is given by selfconsistent equations for the posterior expectations in terms of the magnetizations ${m}_{\mu}=2\u27e8{s}_{\mu}\u27e91$,
so that interactions among spins are encapsulated in an effective field $\u03f5{h}_{\mu}^{\text{eff}}$. We solve Equation 16 iteratively (Fisher and Mehta, 2015), by increasing $\u03f5$ from 0 —that is, from the total ignorance $\u27e8{s}_{\mu}\u27e9=1/2$ or ${m}_{\mu}=0$ — and up to the limiting value ${\u03f5}_{\text{max}}$ in steps of $\delta \u03f5={M}^{1}/20$. This limiting value ${\u03f5}_{\mathrm{max}}$ is determined by the two approximations involved in the strong regularization assumption. First, the saddle point approximation around the peak of the prior in Equation 4 implies that the characteristic width of the prior should be smaller than that of the likelihood, $\u03f5\le {\u03f5}_{1}=1/M$. Second, the Taylor series up to second order in $\u03f5$ for the posterior of the indicator variables implies that the quadratic corrections should not be larger than the linear terms. Within the mean field approximation, this means that ${\u27e8{h}_{\mu}\u27e9}_{\mu}\ge {\u27e8\u03f5{h}_{\mu}^{\text{eff}}(\u03f5)\u27e9}_{\mu}$, which is saturated at some ${\u03f5}_{2}$ (notice that, in contrast to our usual notation, the averages here are over the indices, and not the data). Thus, overall we take ${\u03f5}_{\mathrm{max}}=\text{min}\{{\u03f5}_{1},{\u03f5}_{2}\}$. In other words, we use the largest $\u03f5$ (the weakest possible regularization), which is still consistent with the strong regularization limit. Additionally we have used the TAP equations (Opper and Saad, 2001), instead of Equation 16 to calculate magnetizations. These are more accurate since they account for how a spin affects itself through its couplings with the other spins. However, corrections due to this more complicated method were observed to be negligible in our strong regularized regime, since they were of higher order in $\u03f5\ll 1$. Thus, all results that we report here, and the software implementation of uBIA on GitHub, (copy archived at swh:1:rev:a374c0d478958aaf38415c7b616bbdebe83c6219) are based on the mean field estimation.
Note that the analysis above, and our GitHub code, only focused on words that appear in the data. However, most of the ${2}^{N}$ possible words must be absent from any realistic dataset. In the Supplementary Materials, we show that neglecting these words when calculating posterior probabilities for word inclusion does not lead to significant errors.
Testing and finetuning uBIA on synthetic data
To verify that uBIA can, indeed, recover dictionaries and to set various adjustable parameters involved in the method, we tested the approach on synthetic data that are statistically similar to those that we expect to encounter in realworld applications, such as our neural recordings. We used the loglinear model, Equation 1, as a generative model for binary correlated observables $\overrightarrow{\sigma}$ with $N=20$. While somewhat small compared to many state of the art experimental datasets, this choice of $N$ is highly relevant to the motor control studies, which are our primary target in this work. We chose the individual biases in the generative model from a normal distribution, ${\theta}_{i}\sim N(0.7,{0.1}^{2})$, which matched the observed probability of a spike in a bin in the bird data. That is, $p({\sigma}_{i}=1)\simeq {[1+\mathrm{exp}(2{\theta}_{i})]}^{1}\sim q\sim 0.2$. Then we selected which binary variables interacted. We allowed interactions of 2nd, 3rd, and 4th order, with an equal number of interactions per order. For different tests, we chose the interaction strengths from (a) the sum of two Gaussian distributions, one with a positive mean and the other with a negative one, $\mathrm{mean}({\theta}_{\mu})=\pm 0.5$, $\mathrm{std}({\theta}_{\mu})=0.1$, and (b) from one Gaussian distribution centered at zero with $\mathrm{std}({\theta}_{\mu})=0.5$. Both choices reflect our strong regularization assumption, so that effects of individual variables on each other are weak, and a state of one variable does not determine the state of the others, and hence does not ‘freeze’ the system. We were specifically interested in performance of the algorithm in the case where $\overrightarrow{\theta}$ are distributed as the sum of Gaussians. On the one hand, this tested how the algorithm deals with data that are atypical within its own assumptions. On the other hand, this choice ensured that there were fewer values of $\overrightarrow{\theta}$ that were statistically indistinguishable from zero, making it easier to quantify findings of the algorithm as either true or false. We have additionally tested other distributions of $\overrightarrow{\theta}$, but no new qualitatively different behaviors were observed. Finally, for both types of distributions of $\overrightarrow{\theta}$, we also varied the density of interactions α (number of interactions per spin), from $\alpha =2$ to $\alpha =4$, which spans the interaction densities of treelike and 2D latticelike networks. We generated $M$ samples from these random probability distribution and we applied our pipeline to reconstruct the dictionary. We tested on 400 distributions from each family. As the first step, we discarded highorder words absent in the data using a threshold on the expected number of occurrences $\u27e8{\sigma}_{\mu}\u27e9M=\u27e8{n}_{\mu}\u27e9<0.02$. Next, we selected ${N}_{\text{max}}$ words that have the highest (absolute) values in magnetic field ${h}_{\mu}$ (we have tested ${N}_{\text{max}}=200,\mathrm{\hspace{0.17em}500},\mathrm{\hspace{0.17em}2000},\mathrm{\hspace{0.17em}5000}$, and finally used 500 after not observing substantial differences). To decide which of these highfield words are to be included in the dictionary, we built the Ising model on the indicator variables, Equation 8, with its corresponding magnetizations ${m}_{\mu}$ given by the mean field equations. We started from an inverse regularization strength of $\u03f5=0$ and then decreased the regularization strength by increasing $\u03f5$ in steps of $\delta \u03f5=1/(20M)$, up to ${\u03f5}_{\mathrm{max}}=\text{min}\{{\u03f5}_{1},{\u03f5}_{2}\}$, as detailed above.
Next we needed to identify the significance threshold for the magnetization ${m}_{\mu}$, or, equivalently, for the posterior probability of including a word into the dictionary $\u27e8{s}_{\mu}\u27e9=({m}_{\mu}+1)/2$. As is often the case, this is affected by two contradictory criteria. Setting the threshold high would miss a lot of weakly significant words (high false negatives), but the words remaining in the dictionary would be likelier to be true (low false positives). In contrast, setting the threshold low would admit weakly significant words (lowering false negatives) at the cost of also admitting words by chance (increasing false positives). To measure false positives and negatives, we used two metrics: precision and recall. The precision, ξ, is the fraction of the words included in the dictionary that are true, that is, have a nonzero ${\theta}_{\mu}$ in the generated true model. The recall, η, is the fraction of the words in the generated model with ${\theta}_{\mu}\ne 0$ that were included in the dictionary. Fundamentally, there are no first principles arguments for the choice of the magnetization inclusion threshold, and thus we explored many different values of $m$ and infer the functions $\xi (m)$ and $\eta (m)$ for every data set explored. In Figure 3, we plot the precision vs. recall curves parametrically as a function of $m$. We see that, as the amount of data increases, the recall generally increases, though it remains small. However, since data set sizes are relatively small, we do not expect to detect all words, especially in the case where ${\theta}_{\mu}$ are allowed to be close to 0 in the generative model (Gaussian distributed). Thus we emphasize precision over recall in setting parameters of the algorithm: we are willing to not include words in a dictionary, but those words that we include should have a high probability of being true words in the underlying model. It is thus encouraging that, in all tested cases, there was an underlying magnetization threshold that allowed for a high (e.g. 80%) precision to ensure that almost all of the words that we detected can be trusted to be true. Crucially, we see that the precisionrecall curves are remarkably stable with the changing density of interactions. As a final point for interpreting these figures, we point out that η is smaller when interactions coefficients are taken from a Gaussian centered at zero. However, one could argue that missing words with very small ${\theta}_{\mu}$ should not be considered a mistake: they are not significant words in the studied dictionary.
An additional way of measuring the accuracy of our approach is by exploring the full false discovery rate n_{false} — the total number of dictionary words that are false positives, averaged over our usual 400 realizations of the training samples — produced by our algorithm on fully reshuffled data, where every identified word is false, by definition. We did this with reshuffling that kept the observed frequency of individual variables n_{i} constant. We mapped out computationally the relation ${n}_{\mathrm{false}}(m)$, which, together with $\xi (m)$ explored above, allowed to explore the dependence between ξ and $m$. Figure 3 shows that $\xi =80\%$ corresponds to ${n}_{\mathrm{f}\mathrm{a}\mathrm{l}\mathrm{s}\mathrm{e}}<1$ for every data set that we have tried. Specifically, by keeping n_{false} below 0.5 (only about half a word detected falsely, on average, in shuffled data), we can reach a precision as high as 80%, with the recall of 20%  30% of the codewords depending on the number of samples, the distribution of $\overrightarrow{\theta}$, and α. This shows that the findings of our approach are robust to variation of many parameters.
For the rest of this work, we set the magnetization threshold as a function of the false discovery rate, and we will admit words to the dictionary only when they have their marginal magnetizations ${m}_{\mu}>m({n}_{\mathrm{f}\mathrm{a}\mathrm{l}\mathrm{s}\mathrm{e}}=0.5)$. With that, we are confident that our method produces dictionaries, in which a substantial fraction of words correspond to true words in the data, though the details of how many may depend on various (often unknown) properties of the data themselves, including with respect to patterns that are possible yet were never observed empirically.
To quantify the effect of interactions in among words in shaping the final dictionary, we check how many words with a field larger than the smallest field of a putative codeword corresponding to the kept codewords were discarded. Crucially, of such words with large magnetic field were We observe that $\sim 40\%$ of all such largefield words are removed from the final dictionary due to the wordword exchange interactions. This signifies that uBIA works as designed in identifying multiple words that can explain a statistical pattern and choosing the smallest subset of words able to explain it.
We finish this section with the following observation about the performance of uBIA in regimes that are even more severely undersampled than considered above, so that most of relatively long words only happen once or never in the data. In this regime, uBIA has two major strengths. First, uBIA analyzes putative codewords of arbitrary length, so that then it will detect short subpatterns as codewords – and, in any reasonable dataset, at least some short subpatterns will coincide – for example, there are only $N$ first order words formed by $N$ interacting units, and typically $M\gg N$. Second, uBIA detects not just overrepresentation of patterns, but also their underrepresentation. Thus, if a complex spike word happens once or does not happen at all, it may enter the uBIA dictionary as well precisely because it happens so infrequently.
Reconstructing songbird neural motor codes
Having finetuned the method on synthetic data, we can now test it on a biological dataset. We applied uBIA to the songbird data, confirming the precise timing pattern nature of the code in this dataset. We then demonstrate the generality of the algorithm by applying it to different parameterizations of the data, which allows us to make surprising observations about control of exploratory vs typical renditions of the song by the birds. Notice that, for all applications below, we have to binarize the behavior (pitch). This inevitably results in the loss of resolution and the corresponding loss of codewords. However, such binarization is meaningful in the context of songbirds (Tang et al., 2014), and, crucially, it cannot lead to emergence of keywords where they would not exist otherwise.
Statistical properties of the motor codes
We start with Figure 4, which explores the occurence of just two specific codewords found by uBIA that encode highpitch renditions of syllables. Note that these codewords are, indeed, overrepresented together with the high pitch vocalizations. Analyzing if a particular word is correlated with an acoustic feature is, of course, not hard. However, detecting words that should be tested, without a multiple hypothesis testing significance penalty is nontrivial. Thus the power of uBIA comes from being able to systematically analyze abundances of combinatorially many such spike patterns, and further to identify which of them are irreducibly over or underrepresented. Figure 5 illustrates statistical properties of entire neuralbehavior dictionaries discovered by uBIA for different songbird premotor neurons and for three features of the acoustic behavior. While we reconstruct the dictionaries that include all irreducible words, including those that have only anomalous firing patterns but a statistically normal behavioral bit, here we primarily focus on codewords, which, recall, are defined as statistically anomalous relations between the behavior and the neural activity. We do the analysis twice, first for behavior binarized as ${\sigma}_{0}=1$ for the abovemedian acoustic features, and then for the belowmedian acoustic features. This way we detect words that predict either a behavior or its opposite. We do this because the same pattern of spikes should not be anomalous in the same way simultaneously when studying both the above and the below median codes, since the pattern cannot code for two mutually exclusive features. Detecting such patterns thus serves as a consistency check on our method. There were 0.7 such codewords on average per dictionary. This is consistent with the expected false discovery rate of lt_{1} codewords per neuron for data sets of our size and statistical properties, further increasing our confidence in the method.
The most salient observation is that the inferred codewords consist of present or absent spikes in specific 2ms time bins, (Figure 5A). This is consistent with our previous analysis (Tang et al., 2014), which identified the same timescale for this dataset by analyzing the dependence of the mutual information between the activity and the behavior on the temporal resolution, but was unable to detect the specific words that carry the information. The second crucial observation is that most of codewords are composed of multiple spikes, (Figure 5C) representing an orthographically complex pattern timing code (Sober et al., 2018), in contrast to single spike timing codes, such as in Bialek et al., 1991. Large number of codewords of 2 or more spikes (and thus 3 or more features, including the behavior itself) suggests that analyzing these dictionaries with the widely used lower order MaxEnt or GLM methods that typically focus on lowerorder statistics (see Online Methods) would miss their significant components. Our third crucial observation is that very few subwords / superwords pairs occur in the dictionaries (Figure 5A; e.g. the second codeword coding for entropy in neuron 2 in the panel A is a subword of the others). Finally, similarly to our synthetic data analysis, 30% of words with large magnetic field were removed from the final dictionary due to wordword interactions. This indicates that uBIA fulfills its goal of rejecting multiple correlated explanations for the same data.
We quantify these observations as follows. In the 49 different datasets, the average size of a dictionary within one dataset is 14 words. Of these words, on average 5.6 include the behavioral feature and hence are codewords (Figure 5D). That there are so many specific temporally precise codewords suggests that the behaviorallyrelevant spike timing patterns are the rule, rather than the exception, in this dataset. We found that 66% of codewords are unique to one of the three analyzed acoustic features. This further quantifies the observation that some neurons in RA are selective for specific acoustic features, as noted previously (Sober et al., 2008). Across all neurons and all acoustic features, only 15% of codewords consist of a single spike (or absence of spike), while 58%, 23%, and 4% consist of two, three, and four spikes respectively, (Figure 5C) (we are likely missing many long codewords, especially with small θ’s due to undersampling). This observation is consistent across all neurons and acoustic features, again indicating that coding by temporally precise spike patterns is a rule and not an exception.
At the same time, the observed dictionaries are quite variable across neurons and the production of particular song syllables. Codewords are built by stitching together multiple spikes or spike absences, and individual spikes occur at certain time points in the (40,0) ms window with different probabilities in different neurons and syllables (i.e. the firing rate is both time and neuron dependent, Figure 5A, grey lines). Codewords are likely to occur where the probability of seeing a spike in a bin is ∼50%, since these are the times that have more capacity to transmit information. Thus variability in firing rates as a function of time across neurons necessarily creates variability in the dictionaries across these neurons. Beyond this, we observe additional variability among the dictionaries that is not explained by the rate fluctuations. For example, we can differentiate one of the four birds from two of the others just by looking at the proportions of highorder codewords (an average of 0.21 bits in JensenShannon divergence between the target bird and the rest, which means that we need around five independent samples/codewords to distinguish this bird from the others). This is further corroborated by the fact that the mean interspike interval (ISI) for codewords is different from that of other words in the dictionaries, and this difference is also birddependent, see Figure 5E.
Verification validation of the inferred dictionaries
To show that the dictionaries we decoded are biologically (and not just statistically) significant, we verify whether Statistical significance is not a substitute for biological significance. The only way to interpret findings of any statistical method, including ours, is through perturbation experiments. For example, one could try to see if a stimulation of a neuron with a specific codewordlike patterns would cause (rather than merely correlate with) a certain behavior (Srivastava et al., 2017). Unable to do this, we do a weaker validation and check if the codewords can, in fact, be used to predict the behavioral features. For this, we built two logistic regression models that relate the neural activity to behavior. The first one uses the presence / absence of spikes in individual time bins and the second the presence / absence of the uBIA detected codewords as predictor variables (see Online Methods). Note that the individual spikes model is still a precisetiming model, which has 20 predictors (20 time bins, each 2 ms long), and hence one may expect it to predict better than the codewords model, which typically has many fewer predictors. To account for the possibility of overfitting, in all comparisons we test the predictive power of models using crossvalidation. We emphasize that we do not expect either of the two models to capture an especially large fraction of the behavioral variation. Indeed, Tang et al., 2014 have shown that, at 2ms resolution, on average, there is only about 0.12 bits of information between the activity of an individual neuron and the behavioral features, and the assumption behind our entire approach is that none of individual predictors have strong effects. Further, a specific model, such as logistic regression, will likely recover even less predictive power from the data. With this, Figure 9 compares prediction between the two models, obtaining a significantly higher accuracy and a lower mean crossentropy between the model and the data for the models that use codewords as predictors. In other words, the higher order, but smaller, dictionaries decoded by uBIA outperform larger, nonspecific dictionaries in predicting behavior. This is especially significant since uBIA codewords are detected for their statistical anomaly and irreducibility, and not directly for how accurately they predict behavior.
Dictionaries for exploratory vs. typical behaviors
Bengalese finches retain the ability to learn through their lifetimes, updating their vocalizations based on sensorimotor feedback (Kuebrich and Sober, 2015; Kelly and Sober, 2014; Sober and Brainard, 2009; Saravanan et al., 2019). A key element of this lifelong learning capacity is the precise regulation of vocal variability, which songbirds use to explore the space of possible motor outputs, (Figure 6A and B). For example, male songbirds minimize variability when singing to females during courtship, but dramatically increase the range of variability in acoustic features such as pitch when singing alone (Hessler and Doupe, 1999; Woolley et al., 2014). The variability is controlled by the activity of nucleus LMAN. Silencing or lesioning LMAN reduces the acoustic variance of undirected song (Figure 6A) to a level approximately equal to that of femaledirected song (Kao et al., 2005; Olveczky et al., 2005). Using uBIA, we can ask for the first time whether the statistics of codewords controlling the exploratory vs. the baseline range of motor variability are different. To do this, we analyze the statistics of codewords representing different parts of the pitch distribution. First, we define the output as ${\sigma}_{0}=1$ if the behavior belongs to a specific 20percentile interval ([0 20], [10  30], …, [80 100] ) and compare the dictionaries that code for behavior in each of the intervals. We find that there are significantly more codewords for exploratory behaviors (percentile intervals farthest from the median, Figure 6C). This holds true for different features of the vocal output, although the results are only statistically significant if pooled over all features. To improve statistical power by increasing the number of trails in each acoustic interval, we also consider a division of the output into three equal intervals: low, medium, and high. In this case, there are still more codewords for the high exploratory pitch, and the dictionaries for each of the intervals are still multispike (Figure 6D). We further observe that the codewords themselves are different for the three percentile groups: the mean ISI of high pitch, amplitude, and spectral entropy codewords is higher, with the largest effect coming from the pitch and the spectral entropy (Figure 6E). Examples of typical and exploratory dictionaries are illustrated in Figure 6F. Note that this analysis partially addresses the concern about losing resolution due to discretization of behavior by exploring effects of different discretizations on the reconstructed dictionaries.
These findings challenge common accounts of motor variability, in songbirds and other systems, that motor exploration is induced by adding random spiking variations to a baseline motor program. Rather, the overabundance of codewords in the exploratory flanks of the acoustic distributions indicates that the mapping between the neural activity and the behavior is more reliable than in the bulk of the behavioral activity: multiple renditions of the same neural command result in the same behaviors more frequently, making it easier to detect the codewords. One possibility is that the motor system is less biomechanically noisy for large behavioral deviations. This is unlikely due to the tremendous variation in the acoustic structure (pitch, etc.) of different song syllables within and across animals (Sober and Brainard, 2009; Elemans et al., 2015), which indicates that songbirds can produce a wide range of sounds and that particular pitches (i.e. those at at one syllable’s exploratory tail) are not intrinsically different or harder for an animal to produce. Similarly, songbirds can dramatically modify syllable pitch in response to manipulations of auditory feedback (Sober and Brainard, 2009; Kuebrich and Sober, 2015). A more likely explanation for the greater prevalence of codewords in the exploratory tails is that the nervous system drives motor exploration by selectively introducing particular patterns into motor commands that are specifically chosen for their reliable neuraltomotor mapping. This would result in a more accurate deliberate exploration and evaluation of the sensory feedback signal, which, in turn, is likely to be useful during sensorimotor learning (Zhou et al., 2018).
Finally, although dissecting the role of different neural structures to generating code words would require additional (perturbation) experiments, we can speculate about the contributions of LMAN inputs and local RA circuitry to shaping the statistics of activity in RA. One possibility is that the greater prevalence of code words in exploratory behaviors reflects the interaction of unstructured variability (from LMAN) with the dynamics determined by local circuitry within RA. Alternatively, inputs patterns from LMAN during exploration might be highly structured across LMAN neurons such that tightlyregulated multispike patterns in LMAN, rather than the interaction of uncoordinated LMAN activity with intrinsic RA dynamics, is responsible for generating exploratory deviations in behavior. Future studies, including both perturbations of neural activity and recording ensembles of LMAN neurons, will shed light on these questions.
Discussion
In this work, we developed the unsupervised Bayesian Ising Approximation as a new method for reconstructing biological dictionaries — the sets of anomalously represented joint activities of multiple components of biological systems. Inferring these dictionaries directly from data is a key problem in many fields of modern datarich biological and complex systems research including systems biology, immunology, collective animal behavior, and population genetics. Our approach addresses crucial shortcomings that so far have limited the applicability of other methods. First, uBIA does not limit the possible dictionaries, either by considering words of only limited length or of a predefined structure, instead performing a systematic analysis through all possible words that occur in the data sample. Second, it promotes construction of irreducible dictionaries, deemphasizing related, cooccurring words. Finally, uBIA does not make assumptions about the linear structure of dependencies unlike various linear methods.
To illustrate capabilities of the method, we applied it first to simulated data sets that are similar to those we expect in experiments. The method was able to reconstruct the dictionaries with a very low false discovery rate for a wide range of parameters and statistical properties of the data (Figure 3), which made us hopeful that uBIA’s findings will be similarly meaningful in reallife applications. In this analysis, we explored the range $M\sim {10}^{2}\mathrm{\dots}{10}^{3}$, and $N\sim 20$, which is highly relevant to the neurobiological data we focus on here. Crucially, this $N$ is smaller than in many modern highthroughput experiments. Indeed, there is a necessary tradeoff among the system size, the amount of data, and the ability to explore the interactions systematically, to all orders. Since there are many methods able to analyze data at much larger $N$, but with making assumptions about, in particular, pairwise structure of words in the dictionary (see Overview of prior related methods in the literature in Online Methods), we decided to focus uBIA on systematic exploration of somewhat smaller systems, $N\sim 20$.
To show that the methoduBIA, indeed, can work with real data, we applied it to analysis of motor activity in cortical area RA in a songbird. We were able to infer statistically significant codewords from largedimensional probability distributions (${2}^{21}=2,097,152$ possible different words) with relatively small data sets ($\sim {10}^{2}\mathrm{\dots}{10}^{3}$ samples). We verified that the codewords are meaningful, in the sense that they predict behavioral features better than alternative approaches. Importantly, most of words in hundreds of dictionaries that we reconstructed were more complex than is usually considered, involving multiple spikes in precisely timed patterns. The multispike, precisely timed nature of the codes was universal across individuals, neurons, and acoustic features, while details of the codes (e.g. specific codewords and their number) showed tremendous variability.
Further, we identified codewords that correlate with three different acoustic features of the behavior (pitch, amplitude, and spectral entropy), and different percentile ranges for each of these acoustic features. Across many of these analyses, various statistics of codewords predicting exploratory vs. typical behaviors were different. Specifically, the exploratory dictionaries contained more codewords than the dictionaries for typical behavior, suggesting that the exploratory spiking patterns are more consistently able to evoke particular behaviors. This is surprising since the exploratory behavior is usually viewed as being noisier than the typical one. Crucially, exploration is a fundamental aspect of sensorimotor learning (Tumer and Brainard, 2007; Kuebrich and Sober, 2015; Kelly and Sober, 2014), and it has been argued that large deviations in behaviors are crucial to explaining the observed learning phenomenology (Zhou et al., 2018). However, the neural basis for controlling exploration vs. typical performance is not well understood. Intriguingly, vocal motor exploration in songbirds is driven by the output of a corticalbasal gangliathalamocortical circuit, and lesions of the output nucleus of this circuit (area LMAN) abolishes the exploratory (larger) pitch deviations (Kao et al., 2005; Olveczky et al., 2005). Our findings therefore suggest that the careful selection of the spike patterns most consistently able to drive behavior may be a key function of basal ganglia circuits.
While the identified codewords are statistically significant, and, for the songbird data, we show that they can predict the behavior better than larger, but nonspecific features of the neural activity, a crucial future test of our findings will be in establishing biological significance of uBIA findings. In the context of the neural motor control, biological significance may be in establishing the causal rather than merely correlative nature of the codewords, which can be done by stimulating neurons with patterns of pulses mimicking the codewords (Srivastava et al., 2017). Such verification will be facilitated by the speed of our method, which can reconstruct dictionaries in real time on a laptop computer. The speed and the relatively modest data requirements by uBIA will also allow us to explore how populationlevel dictionaries are built from the activity of individual neurons, how control of complex behaviors differs from control of their constituent features, how the dictionaries develop and are modified in development, and whether the structure of dictionaries as a whole can be predicted from various biomechanical and informationtheoretic optimization principles.
In building uBIA, we have made a lot of simplifying assumptions. For example, in applications to neural populations, uBIA would explore only symmetric statistical correlations, while the physical neural connectivity is certainly asymmetric. Similarly, in the analysis of activity of a neuron over time bins in the current work, we did not account for causality. Further, we assumed that all data are stationary for the duration of the experiment, which will break down for longer experiments. Making useful biological interpretation of uBIA findings and designing better perturbation experiments may depend on our ability to lift some of these restrictions. The interpretation may be aided by extending uBIA to different null models. The current null model assumes independence between the units—a common assumption in the field. Detecting words that are represented anomalously compared to more complex null models, such as those preserving pairwise correlations, may focus the analysis on the most surprising, and hence maybe easier to interpret, codewords. This is a feasible future extension of uBIA. However, one would have to be careful, since, for datasets with $M\sim {10}^{2}\mathrm{\dots}{10}^{3}$, pairwise correlations are not known well themselves, which may introduce additional uncertainty into the interpretation.
We tested uBIA on a small number of biological and synthetic datasets. However, for our method to be broadly applicable, users will need to adjust our the hyperparameters, for example, the significance threshold $m$ and the inverse regularization strength $\u03f5$, depending on the statistical structure of their data. Different values may be required depending on the dataset size and the prevalence and strength of higherorder interactions. Although it is hard to say a priori what these adjustments might be, we offer a few suggestions. First, we note that false negatives (failure to identify a significant pattern) should not be considered a failure point major problem in the undersampled regime, since it is manifestly impossible to observe all patterns with relatively small datasetspossible coding patterns with little data. More worrying would be false positive errors, in which statistically insignificant patterns are identified as code wordscodewords. However, our algorithms algorithm offers a selfconsistency check: a single pattern should not be identified as predicting both the presence and the absence of a behavior. One can search for such errors by relabeling the presence as 0 or 1 and repeating the analysis, see Figure 5A. Hyperparameters should therefore be adjusted to keep the number of such cases, and with them the rate of all false positives, below an acceptable threshold More generally, one can do similar tests by relabeling individuals variables 0→1 —words and their partial negations should not be both anomalously represented. The fraction of such incorrectly identified words is a good measure of the false positives rate. Then one should adjust the detection threshold $m$ to the smallest value that keeps the false discovery rate acceptable to the user.
Another important parameter is the inverse regularization strength $\u03f5$. We would like to keep it as large as possible, so that the regularization is the weakest. At the same time, our perturbative analysis depends crucially on $\u03f5$ being small. This suggests a tradeoff for choosing $\u03f5$: make it as large as possible, but such that the perturbative constraints, discussed after Equation 17 are satisfied. This value will also depend on a specific dataset and cannot be predicted a priori.
Finally, additional evidence of biological significance of the method will need to come from its application to other types of data, such as, in particular, molecular sequences, or species abundances in ecology. Crucial for this will be the match between the assumptions of our method (e.g. no dominant words) and the actual data in specific applications: there is no way to say a priori when this will happen, and one will simply need to try. Crucially, in all cases, if the method works, we expect it to be fast and to work well even for problems with large $N$. In part, this is because the accuracy of the method does not collapse for undersampled problems (large $N$ and not too large $M$, Figure 3), and its computational complexity is limited not by $N$, but by the number of distinct words that occur in data.
Materials and methods
Overview of prior related methods in the lterature
Request a detailed protocolAs we pointed out in the main text, a number of different methods have been developed for reconstructing various biological dictionaries, or for the related problem of building the model of the underlying probability distribution. It is important to compare and contrast uBIA to these methods in order to highlight when it should be used. Since these prior methods have been especially common in neuroscience, and since our main biological application throughout this article is also in neuroscience, this is where we will focus our comparisons.
For many different experimental systems, it has been possible to measure the information content of spike trains (Fairhall et al., 2012; Tang et al., 2014; Srivastava et al., 2017), but the question of decoding – which spike patterns carry this information and how? – has turned out to be a much harder one. Most of the effort has been expended on decoding per se: building the model of the activity distribution, rather than deciding which specific spike patterns should belong to the model. Multiple approaches have been used, whether in the context of sensory or motor systems, starting with linear decoding methods (Bialek et al., 1991). All have fallen a bit short, especially in the context of motor codes in natural settings, where an animal is free to perform any one of many behaviors it wishes, and hence statistics are usually poor, with only a few samples per behavior. A leading method is Generalized Linear Models (GLMs) (Paninski, 2004; Pillow et al., 2008; Gerwinn et al., 2010), which encode the rate of spike generation from a certain neuron at a certain time as a nonlinear function of a linear combination of past stimuli (sensory systems) or of future motor behavior (motor systems) and the past spiking activity of a neuron and its presynaptic partners. GLM approaches can detect the importance of the timing of individual spikes and sometimes interspike intervals for information encoding, but generalizations to detect importance of higher order spiking patters are not yet well established. Another common approach is based on maximum entropy (MaxEnt) models (Schneidman et al., 2006; GranotAtedgi et al., 2013; Savin and Tkačik, 2017). These replace the true distribution of the data with the least constrained (i. e., maximum entropy) approximation consistent with loworder, wellsampled correlation functions of the distribution. The In some versions of the method, one then searches for the constraints that affect the distributions the most, and only focuses on those, thus avoiding overfitting (BarratCharlaix et al., 2021). The MaxEnt approach is computationally intensive, especially when higher order correlations are constrained by data. As a result, almost all of the applications focus on, at most, constraining pairwise activities in the data. At the same time, to approximate empirical distributions well, a large number of such constraints is constraints—even just pairwise ones—is often required. This requires very large datasets, especially if one is interested in relating the neural activity to the external (behavioral or sensory) signals. Such large datasets are hard to obtain in the motor control setting. More recently, feedforward and recurrent artificial neural network approaches have been used to decode largescale neural activity (Pandarinath et al., 2018; Glaser et al., 2017), but these have focused primarily on neural firing rates over large (tens of milliseconds) temporal windows, and typically require larger datasets than considered here.
As a result of the large data set size requirement and of the focus on building the model of the neural activity rather than finding statistically anomalous features in it, to date, there have not been successful attempts to reconstruct neural dictionaries from data. A success method must (i) resolve spike timing in words of the dictionary to a high temporal resolution, (ii) be comprehensive and orthographically complex, not limiting the words to just single spikes or pairs of spikes, and (iii) discount correlations among spiking words to produce irreducible dictionaries that only detect those codewords that cannot be explained away by correlations with other words in the dictionary.
There are methods (Ganmor et al., 2015; Prentice et al., 2016) that are more closely related to our approach, and which can be used within the same pipeline as uBIA, potentially for even better results —specifically in an undersampled regime. When modeling highdimensional data using too few samples, all such methods tend to decrease complexity of their fitted models (MacKay, 1992), such as using fewer parameters or sparser dictionaries. In this regime, such methods can be improved if there is some a priori information regarding which elements of the model must be fitted first. Here, uBIA could be invaluable: it can be used first to choose the features to be included in models, and then a complementary algorithm can be used to actually construct a model on these feature, not unlike we did with the logistic regression in this work. For example, Ganmor et al., 2015 proposed to understand ganglion cell population activity in relation to the stimuli they encode, and they used a pairwise MaxEnt approach to model the (undersampled) probability distributions of neural activities conditional on specific stimuli. Alternatively, uBIA can be used first to identify conditional neural dictionaries (which will include only a subset of neural pairs, but also potentially higher order combinations of neurons). Then the conditional MaxEnt models can be build based on such detected conditional dictionaries, potentially alleviating the undersampling problem. Crucially, the same approach can applied to other models, not just the loglinear model, Equation 1, or the MaxEnt model. Indeed, as long as the loglikelihood function is specified within a model, and the derivatives of the loglikelihood with respect to the model parameters can be evaluated, similar to Equation 9, then we can use uBIA to calculate the fields and the exchange interactions, and eventually the probability of inclusion of any term in the model.
On the other hand, as the number of samples increases, we might want to increase the complexity of the involved models by adding additional explanatory terms. Choosing which terms to add is a combinatorially complex problem, and one generally wants to add terms that are nonredundant. Here, uBIA can be useful as well by providing a broad picture of how various terms in the more complex models interact, and hence biasing us towards growing the model complexity in complementary, rather than competing directions. For example Prentice et al., 2016 considered a Hidden Markov model to describe ganglion cell activity, where different hidden modes activated at different time points. For a fixed mode, the probability distribution of the neural population activity was modeled as belonging to an exponential family. Before adding a new mode, one can calculate the uBIA exchange interactions between it and the already existing modes, as in Equation 9. If the exchange interactions are negative, the new mode is (partially) redundant with the existing one, and should probably be skipped in favor of another, more independent mode. Again, as long as the likelihood function can be written analytically and is differentiable with respect to the model parameters, uBIA can be applied in this way.
Direct application of MaxEnt methods to synthetic and experimental data
Request a detailed protocolTo illustrate that traditional MaxEnt methods for creating generative models do not work in our undersampled data regime, we apply the methods described in Ganmor et al., 2015 to both our synthetic dataset and to our data collected from songbirds. First, note that the Ganmor method assumes that activity is dominated by a single (silent) state and then detects words in a hierarchical fashion. Specifically, higher order patterns (i.e. deviations from the silent state) cannot be detected unless all constituent lower order patterns have already been shown to be statistically significant. This requires datasets much larger than our approach, which can identify higherorder patterns directly, as illustrated in Figure 7.
Geometric interpretation of uBIA field
Request a detailed protocolThe geometric interpretation of the ${h}_{\mu}$ fields in Equation 9 is illustrated in Figure 8, and it showcases a part of how uBIA weights the addition of new parameters in terms of the improvement of the fitting, but without the need of building a explicit model. In relation to a null model located at ${\theta}^{*}$, the inclusion of a new parameter ${\theta}_{\mu}$ in general will improve the value of the loglikelihood $\mathcal{L}$. This improvement would have an approximated value of $\mathrm{\Delta}{\stackrel{~}{\mathcal{L}}}_{\mu}$ and it would require to move from ${\theta}^{*}$ a distance $\mathrm{\Delta}{\stackrel{~}{\theta}}_{\mu}$. The sign of the field ${h}_{\mu}$, which indicates the presence or absence of the parameter ${\theta}_{\mu}$, is determined by how big is the improvement, and the magnitude of the field by how far you need to move to fit such parameter. Then a rather small parameter that provides an acceptable improvement to the loglikelihood will have a high field.
Effect of absent words
Request a detailed protocolTo calculate posterior expectations of inclusion of words, we focus only on words that appear in a specific dataset. There are many more words that do not, and the effect of these absent words on uBIA results must be analyzed.
We start with noticing that, of the exponentially many possible words, majority do not happen in a realistic data set. In particular, this includes most of long words. At the same time, a priori expectations for the frequency of such words, Equation 13, decrease exponentially fast with the word length. Thus the fields, Equation 10, for the words that do not occur are small, and the posterior expectation for including these words in the dictionary is $\u27e8{s}_{\mu}\u27e9\approx 1/2$, so that we do not need to analyze them explicitly. A bit more complicated is the fact that all words affect each other’s probability to be included in the dictionary through the exchange couplings ${J}_{\mu \nu}$, so that, in principle, the sum in the mean field equations, Equation 17, is over exponentially many terms. Thus it is possible for the absent words collectively to have a significant effect on the probability of inclusion of more common words into the dictionary. Here, we show that this collective effect on the interaction terms is exponentially small in $N$, as long as the empirical averages ${n}_{i}/M\ll 1$.
To illustrate this, we start with the probabilities $p({\sigma}_{i}=1)={p}_{i}$ of a single variable $i$ being active. We then define the average such probability $q={N}^{1}{\sum}_{i}{p}_{i}$. Without the loss of generality, we assume $q<1/2$, and otherwise we rename ${\sigma}_{i}\to 1{\sigma}_{i}$. Denoting a long word of a high order $k$ that does not occur in the data as ${\mathit{\sigma}}_{\omega}$, we have ${n}_{\omega}=0$. Then the corresponding field is
Here, we consider as high order words those, for which ${q}^{k}M\ll 1$ (in general, $\u27e8{\sigma}_{\omega}\u27e9M=\u27e8{n}_{\omega}\u27e9\ll 1$, which happens for $k\sim 4\mathrm{\dots}5$ for our datasets). Then the magnetization is
This illustrates our first assertion that none of these nonoccurring words will be included in the dictionary. However, as a group, they may still have an effect on words of lower orders. To estimate this effect, for a word ${\mathit{\sigma}}_{\mu}$ of a low order k_{0}, we calculate the effective field ${\stackrel{~}{h}}_{\mu}^{\text{eff}}$, which all of the nonoccurring words ${\mathit{\sigma}}_{\omega}$ have on it. First we notice that, if ${V}_{\mu}$ and ${V}_{\omega}$ do not overlap, then their covariance is zero, and ${J}_{\mu \omega}=0$. That is, only highorder words that overlap with ${V}_{\mu}$ can contribute to ${\stackrel{~}{h}}_{\mu}^{\text{eff}}$. Since $\mathrm{cov}({\mathit{\sigma}}_{\mu},{\mathit{\sigma}}_{\omega})\sim {q}^{k}(1{q}^{{k}_{0}})$, the couplings are
Using Equation 16, this gives for the typical effective field that absent words have on the word μ
where the number of words of order $k$ that overlap with ${\mathit{\sigma}}_{\mu}$ and can affect it is given by the combinatorial coefficient $N(k)\simeq \left(\genfrac{}{}{0pt}{}{N{k}_{0}}{k{k}_{0}}\right)$. This has a very sharp peak at $k=(N+{k}_{0})/2$, where $N\simeq {2}^{N{k}_{0}}$. We can approximate the sum in Equation 22 as the argument of the sum evaluated at this peak $k=(N+{k}_{0})/2$, obtaining an effective field coming from high order words
In other words, even the combined effect of all higher order absent words is small if the average frequency of individual letters is smaller than 1/2. We thus can disregard all nonoccurring words in the mean field equations.
We stress that, for this to hold, the average of the binary variables ${\sigma}_{i}$ must be small, $q={N}^{1}\sum _{i}p({\sigma}_{i}=1)<1/2$. In our songbird dataset, this condition was fulfilled with $q\sim 0.2$. However, in 4% of cases the probability to have a spike in a certain time bin was ${p}_{i}>1/2$. Thus to stay on the safe side, we performed additional analyses by redefining variables as ${\sigma}_{i}\to 1{\sigma}_{i}$ if the presence of a spike in a bin was >50 %. In other words, in such cases, we defined the absence of the spike as 1 and the presence as 0. For our datasets, the findings did not change with this redefinition.
This previous analysis does not imply that absent words of high order are irrelevant — it only says that they cannot be detected with the available datasets. In the numerical implementation of the method, we filter out long absent words ω such that $\u27e8{\sigma}_{\omega}\u27e9M=\u27e8{n}_{\omega}\u27e9<0.02$, with this cutoff determined by Equation 1820, so that, for these words, ${h}_{\omega}\ll 1$. These words get assigned 1/2 as the posterior probability of inclusion in the dictionary, and their contribution to the mean field equations is neglected. In contrast, if a word ω is absent but $\u27e8{n}_{\omega}\u27e9\ge 0.02$, we include them in the analysis, Equation 8. Such words may turn out to be relevant code words, especially if they happen a lot less frequently than expected a priori.
Testing the predictive power of the uBIA dictionaries
Request a detailed protocolIn this section, we test whether the codewords found in data from songbird premotor neurons can be used to predict the subsequent behavior. We compare two logistic regression models: one that uses the activity in the 20 time bins to predict the behavior and another that only uses as features the activity of the few relevant codewords, usually far fewer than 20. The features corresponding to the codewords are binary, and they are only active when all the time bins of such words are active. This means that the model using the time bins is more complex, as it already has all the information that the codewords model has and more, though it does not account for combinatorial effects of combining spikes into patterns. In order to properly test the predictive power between these two models with different complexity we perform twofold crossvalidation, using a loglikelihood loss function. As is common in these cases, an L2 penalty is included to help the convergence to a solution (the models were implemented with the Classify function from Mathematica, whose optimization is done by the LBFGS algorithm). As shown by Tang et al., 2014, not all neurons in our dataset are timing neurons, or even code for the behavior at all. Thus we restrict the comparison to those cases that have at least 4 codewords (27 case in total, with 10 codewords on average). Both of the logistic regression models have the following structure
where $y$ corresponds to the behavior, and the features correspond to the time bins in one case (${z}_{i}={x}_{i}$) and to the codewords in the other (${z}_{i}={\prod}_{j\in {V}_{i}}{x}_{j}$), while ${\beta}_{i}$ are the coefficients of the model. The loss function used is the loglikelihood with the L2 penalty,
where $M$ is the number of samples, and λ is the regularization strength. In our analysis, as different datasets have different number of samples, we show the results for the mean crossentropy over the test data, which correspond to the normalized loglikelihood.
Tang et al., 2014 showed that individual neurons on average carry around 0.12 bits at a 2ms scale. So for both models, we expect the prediction accuracy to be barely above chance, especially since we are focusing on a particular prediction model (a logistic regression), and may be missing predictive features not easily incorporated in it. Figure 9a shows the scatter plot of accuracy in the 27 analyzed datasets, plotting the prediction using the time bins activity on the horizontal axis versus prediction using only the codewords activity on the vertical one. We observe that the models based on codewords are consistently better than the ones using all the 20 time bins, and the difference is significant (inset). We additionally evaluate the quality of prediction using the mean crossentropy between the model and the data. Figure 9b shows that the models with the codewords have lower mean crossentropies and thus generalize better (see Inset).
Data availability
The software implementation of uBIA is available from GitHub (copy archived at swh:1:rev:a374c0d478958aaf38415c7b616bbdebe83c6219). The data used in this work is available from figshare.
References

Information geometry on hierarchy of probability distributionsInformation Theory, IEEE Transactions On 47:1701–1711.https://doi.org/10.1109/18.930911

Deciphering the spike train of a sensory neuron: counts and temporal patterns in the rat whisker pathwayThe Journal of Neuroscience 26:9216–9226.https://doi.org/10.1523/JNEUROSCI.149106.2006

Reading a neural codeScience (New York, N.Y.) 252:1854–1857.https://doi.org/10.1126/science.2063199

Birdsong and human speech: common themes and mechanismsAnnual Review of Neuroscience 22:567–631.https://doi.org/10.1146/annurev.neuro.22.1.567

Universal mechanisms of sound production and control in birds and mammalsNature Communications 6:8978.https://doi.org/10.1038/ncomms9978

Information theoretic approaches to understanding circuit functionCurrent Opinion in Neurobiology 22:653–659.https://doi.org/10.1016/j.conb.2012.06.005

Bayesian feature selection for highdimensional linear regression via the Ising approximation with applications to genomicsBioinformatics (Oxford, England) 31:1754–1761.https://doi.org/10.1093/bioinformatics/btv037

Bayesian inference for generalized linear models for spiking neuronsFrontiers in Computational Neuroscience 4:12.https://doi.org/10.3389/fncom.2010.00012

Stimulusdependent maximum entropy models of neural population codesPLOS Computational Biology 9:e1002922.https://doi.org/10.1371/journal.pcbi.1002922

Social context modulates singingrelated neural activity in the songbird forebrainNature Neuroscience 2:209–211.https://doi.org/10.1038/6306

Emergent dynamics of laboratory insect swarmsScientific Reports 3:1073.https://doi.org/10.1038/srep01073

A simple computational principle predicts vocal adaptation dynamics across age and error sizeFrontiers in Integrative Neuroscience 8:75.https://doi.org/10.3389/fnint.2014.00075

Spike rate and spike timing contributions to coding taste quality information in rat peripheryFrontiers in Integrative Neuroscience 5:18.https://doi.org/10.3389/fnint.2011.00018

Multivariate dependence and genetic networks inferenceIET Systems Biology 4:428–440.https://doi.org/10.1049/ietsyb.2010.0009

On the Sufficiency of Pairwise Interactions in Maximum Entropy Models of NetworksJournal of Statistical Physics 162:1294–1308.https://doi.org/10.1007/s1095501614565

BookReverseengineering biological networks from large data setsIn: Munsky B, Hlavacek WS, Tsimring L, editors. Quantitative Biology: Theory, Computational Methods, and Models. Cambridge, MA: MIT Press. pp. 213–246.

Neural coding of natural stimuli: information at submillisecond resolutionPLOS Computational Biology 4:e1000025.https://doi.org/10.1371/journal.pcbi.1000025

Renormalizing complex models: It is hard without landauJournal Club Condensed Matter Physics 941:868–899.https://doi.org/10.1016/j.nuclphysb.2018.07.004

Maximum likelihood estimation of cascade pointprocess neural encoding modelsNetwork (Bristol, England) 15:243–262.

Collective animal behavior from Bayesian estimation and probability matchingPLOS Computational Biology 7:e1002282.https://doi.org/10.1371/journal.pcbi.1002282

ErrorRobust Modes of the Retinal Population CodePLOS Computational Biology 12:e1005148.https://doi.org/10.1371/journal.pcbi.1005148

Temporal coding of visual information in the thalamusThe Journal of Neuroscience 20:5392–5400.

Spiketiming precision underlies the coding efficiency of auditory receptor neuronsJournal of Neurophysiology 95:2541–2552.https://doi.org/10.1152/jn.00891.2005

Maximum entropy models as a tool for building precise neural controlsCurrent Opinion in Neurobiology 46:120–126.https://doi.org/10.1016/j.conb.2017.08.001

Zipf’s law and criticality in multivariate data without finetuningPhysical Review Letters 113:068102.https://doi.org/10.1103/PhysRevLett.113.068102

Central contributions to acoustic variation in birdsongThe Journal of Neuroscience 28:10370–10379.https://doi.org/10.1523/JNEUROSCI.244808.2008

Adult birdsong is actively maintained by error correctionNature Neuroscience 12:927–931.https://doi.org/10.1038/nn.2336

Millisecond Spike Timing Codes for Motor ControlTrends in Neurosciences 41:644–648.https://doi.org/10.1016/j.tins.2018.08.010

Neural coding: The enigma of the brainCurrent Biology 5:1370–1371.https://doi.org/10.1016/s09609822(95)002739

Entropy and Information in Neural Spike TrainsPhysical Review Letters 80:197–200.https://doi.org/10.1103/PhysRevLett.80.197

Linked control of syllable sequence and phonology in birdsongThe Journal of Neuroscience 30:12936–12949.https://doi.org/10.1523/JNEUROSCI.269010.2010
Decision letter

Tatyana O SharpeeReviewing Editor; Salk Institute for Biological Studies, United States

Aleksandra M WalczakSenior Editor; CNRS LPENS, France
Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]
Thank you for submitting your work entitled "Unsupervised Bayesian Ising Approximation for revealing the neural dictionary in songbirds" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The reviewers have opted to remain anonymous.
Our decision has been reached after consultation between the reviewers and your reviewing editor. It took an unusually long time to reach this decision due to both personal and global health emergencies. We are sorry for the delay. Based on these discussions and the individual reviews below, we regret to inform you that this manuscript will not be considered further for publication in eLife as a Research Article.
However, your reviewing editor and reviewers have agreed that they would be willing to consider a revised version of the paper, reformatted as a "Tools and Resources" paper, if you choose to resubmit it as such to eLife. That revised paper should address fully the methodological and technical questions raised by the reviewers and your reviewing editor in this letter. As it stands, the reviewing team did not feel that they had enough information to properly judge the merits of the method, and that would be reassessed if a new and revised Tools and Resources manuscript is submitted.
The individual reviewer comments are appended below and can be used either as directives for that new submission, or as helpful pointers for a submission elsewhere. Summary thoughts are included here, with the detailed reviews from the 2 reviewers included at the end.
The paper details a new approach to learning the input/output relationships in complex biological systems that drive things like motor behavior or (perhaps) protein folding and other processes. The work focuses on birdsong as an example of a complex motor task potentially modulated by precise temporal patterns of spiking in the bird motor cortex (RA). The method claims to be able to use very few jointly recorded samples of the input and output to learn both a model for the distribution of the spiking patterns observed as well as how they drive behavior. The method uses strong priors in a Bayesian Ising Approximation to identify the neural code words that are expressed at unusually high or low frequency with respect to the behavioral output. The authors find a large number of spike patterns correlated with behavioral variations in syllable pitch. The method has potential applications beyond birdsong; the ability to fit input/output relationships with very few samples is appealing.
The reviewing team, however, had some major reservations about the results and felt that the current draft did not give them enough details to properly evaluate the method. A summary of those major points is given here:
1) The method:
The reviewers felt that there wasn't enough detail presented to clearly analyze the results of the model. Much of the detail was in the SI and Methods, where the exposition wasn't clear or detailed enough for the reader to properly assess the results.
Questions raised included:
– What's the null model?
– How does this method scale to larger neural populations?
– Is the clustering used here justified? Can it be compared to other, simpler clustering methods?
– How do you justify the choice of prior and how are the unknowns fit from the data?
– Other extant methods for finding reliable interactions in exponential models were not referenced or compared.
– Artificial data used in the SI are based on properties of the recorded neurons, and as the authors explain the cells seem to be quite weakly coupled, so the relevance of the model for other data is not clear.
2) The behavioral results:
The reviewers were not convinced by the pitch results, primarily because of the coarsegraining of pitch into just two categories. Again, much of the explanation was relegated to the Methods part of the paper and could have been presented more clearly. If this method works, this is clearly interesting, but the results presented seemed rather weak.
3) The format of the paper:
The paper read a bit like 3 manuscripts glued together, each of which has merits, but none of which seemed to satisfactorily stand on its own. The paper starts with a grand introduction that talks about solving a major challenge in quantitative biology. Then it quickly dives into something that reads more like a methods paper, but with insufficient exposition and clarity. Finally, the paper shifts to the application of the method to behavior and neural data, the results of which seemed somewhat weak and preliminary.
Taking these critiques together, the reviewing team felt that the paper could not be reasonably revised for eLife as a Research Article. However, the team did find the central method potentially interesting if more details are provided that allow for proper assessment of it. The team felt that it could be a good "Tools and Resources" advance, if the method holds up to this more detailed scrutiny.
Reviewer 2:
The paper is a potentially interesting application of a recent statistical method, the Bayesian Ising Approximation (Fisher and Mehta), aimed at addressing the problem of estimating a complex distribution, in particular the relationship between a code and an output, from very few samples. This is accomplished by heavily weighting by the prior. The authors have applied this method to support the idea that there are precisely timed spike patterns that are especially significant in the generation of behavior, in this case in the output of syllables of birdsong. They use the method to identify "code words", patterns that are claimed to appear with notably high or low frequency in correspondence with a behavioral variable. They apply this method to find a larger number of precise spike patterns correlated with behavioral variations in the tails of the behavioral distribution. They suggest an interpretation that perhaps larger exploratory motor gestures require a more precise spiking code; although it might also be argued that this shows that there are more ways to achieve these larger fluctuations? While the claim is counterintuitive, it rests on the power of the method, whose exposition left a number of gaps.
1. It is unclear to what extent this goal is meaningful in the case that the paper analyses: premotor drive of muscles in the birdsong pathway. It seems evident that timing of RA spikes necessarily sculpt motor output; and the authors have previously analyzed this in detail to reveal timing precision (Tang et al.). It was not clear, however, that specific patterns meaningfully emerge from a sea of similar spike trains to encode exact outputs. The authors ask whether precise patterns in 2ms bins have a significant relationship with binarized song characteristics in 40ms binsa very low fidelity relationship on the "behavior" side compared with a high level of precision on the neural side. This seems a dramatic discrepancy between the dimensionality of neural response vs behavior that deserves more justification. How general a space of coding relationships (nonmonotonic, etc) does this method explore given this binarization?
2. Further, it was unclear whether a temporally structured firing rate would be equally convincing as a representation of behavioral variation. While the concept of "irreducible dictionaries" is appealing, does this method fairly compare a timing code to a rate code? For example, doesn't the competition component of the model suppress the identification of firing ratelike contributions? Looking at Figure 1, the cartoon of eliminating overlapping but similar code words seem equivalent to ignoring a less temporally precise method of coding for behavior. That is to say, if you are forcing your model to choose one of several similar code words in a winnertake all inclusion choice, isn't the timing precision you see partially a factor of not looking at the other similar code words which didn't win out? The comparison of this method to GLM methods of relating the spiking to behavior is unclear. One alternative model was analyzed in the online methods, but is not clearly explained and such comparisons deserve to be treated in the main text. It should also be easy to report the simple 0th order analysis of how well you can predict the binary version of behavior simply with a spike count within the 40 ms window, or with some intermediate level of temporal variation. Seeing these comparisons as a control of the assertion that precise timing matters in this system would enhance intuition for the necessity of this method and better justify the claim "that they predict behavioral features better than other approaches" (line 256).
3. In general, the exposition of the method leaves many questions and could be improved considerably in clarity. I have tried to summarize the logic of the paper's core argument below and underscored throughout where more help could be given to understand the meaning and generality of assumptions made.
Logic of the calculation:
The paper sets up the problem of writing down a model for the joint probability of a behavioral event s0 and the firing of a spike in each, i, of N bins, σi. The authors write this in terms of an expansion over every possible combination of σi,
Log P(σθμ) = log Z + Σ θμ Π _{i in Vμ} σi.
Q3.1: please expand a little on the exponential form. In maximum entropy approaches, this form of distribution arises from the maximum entropy solution, but if we are avoiding MaxEnt approaches, can you briefly justify it? In this approach, a firing pattern with spikes in bins (1,2,3,4) is considered to be independent of the events (1,2) and (3,4); whereas in maxent, if one is only expanding up to order 2, one hypothesizes that the probability of the word (1,2,3,4) will be absorbed into the pairwise probabilities, and one can ask how different the data is from that guess. It seems that here one is parametrizing with vastly more parametersone for every possible spike/behavior combination without a systematic method of cutting them down (or at least one that was clear to me; it is addressed at the end of the Online Methods calculations once we have moved into the s variables but it would be helpful to get some intuition from the start).
One then defines "indicator variables" sμ which tell us which patterns appear significantly less or more often than one would expect under some null model. It would be good to specify this null model right away, and in straightforward terms; I don't see it defined in the main text at all.
Q3.2: Please explain at this point the null model for the frequency of appearance of any possible pattern of spikes and its correlation with behavior? Should it not depend on the specific behavior?
It is unclear why one is now allowed to write that the original definition of the model is EQUAL to the model now written including only terms defined by a criterion defined by our subjective interest in them:
Log P(σθ) = log Z + Σ θμ sμ Π _{i in Vμ} σi
Q3.3: Why does one not here need to define a new and different probability distribution, something that makes some claim about belonging to the codeword dictionary, in which these terms are the only ones that appear?
Now the authors turn to the task of estimating the 2N parameters θμ. Instead of doing this estimation directly from the data (i.e. the a posteriori expectation), they use a prior on these values: they choose to assume that each of the values are narrowly distributed about some mean θ∗μ:
P(θ∗μ : sμ=1) ~ exp((θμ −θ∗μ)2).
This distribution is taken only to hold if the word is indeed a code word, ie statistically associated with a behavior. If it is not a code word, the prior is not defined.
Q3.4: Please justify this choice in more detail. Why are the epsilons not dependent on μ?
Q3.5: The argument for where these 2N values of θ∗μ come from is very unclear. Please spell out more clearly how constraining the means of the individual σis provides sufficient constraints for all 2N values, or why this is not needed.
Q3.6: Why does a code word have an a priori likelihood of 0.5 of being in the dictionary?
Now rather than estimating or computing the values of the θμs, the next step is to flip the variables in the distribution we seek, to P(stheta), and then integrate out the thetas, based (it seems?) only on their prior distributions, Equation (5).
Q3.7: Now that s can again take value of either 0 or 1, why did we not need the distribution over θs in the case that s = 0? How can a distribution over s on the LHS in Equation (5) limit itself to the case of s = 1 on the RHS?
The calculations in the online methods may resolve some of these issues? But these all appear to start from Equation (5) and at least for me at this stage, the route to this equation is unclear.
Reviewer #3:
The manuscript suggests a new approach to learning coding dictionaries. The method is applied to neural population data recorded in singing birds. While the idea of the approach is interesting, it was hard to understand the details or how well it works. It seems that too much has been pushed to the SI, and more controls and examples over synthetic data sets would be useful. Moreover, and importantly – it was hard to interpret the results. Also, the current approach was not compared against other methods presented in recent years for learning metrics on neural population patterns, or even more simple clustering approaches.
Briefly, the authors use a loglinear model for the population words of N neurons and two behavioral states – using 2^(N+1) terms that span all potential orders of interactions among cells (theta's). To avoid arbitrary cutoff on the order of interactions that they consider, and use the important ones, they use a constructed prior distribution over theta's to learn a posterior model over the population words after integrating out the theta's. This gives an Ising model over 2^N interacting 'spins' where each sping now stands for a word that may or may not have appeared in the sampled data. The local fields acting on each of these spins is related to the prob of appearance of the words the spins stand for, and the pairwise interaction terms stand for an overlap between the words. These interactions now imply a 'competition' between words in terms of the explanatory power they carry (and are therefore typically negative)
Thus, if I understand this correctly, the constructed model gives a form of compression of the population 'codebook' by picking words that are dominant to be included in the dictionary and omitting similar words with less explanatory power and frequency.
Critically, it was unclear to me how one should evaluate the results. It is hard to interpret the structure of the dictionary in Figures45, and in particular over the pooling of conditions in Figure 5. In addition, there have been different approaches towards learning neural dictionaries in recent years, which are not mentioned here or compared against the current approach. Maybe this is too naïve, but it seems natural to ask how well would naïve clustering or Editdistance based methods work here? Then, how does the current approach compare to more structured metrics such as those presented in Ganmor et al. eLife 2015, Prentice et al. Plos Comp Biol 2016, Rubin et al. Nature Comm, 2019, or Chaudhuri et al. Nature Neuro, 2019 ?
Finally, how would approach work for larger populations, where all observed patterns would appear just once?
It was not clear how should one interpret the nature of the organization of patterns in Figure 4B (the gap between lines, their very linear dependency?)
[Editors’ note: further revisions were suggested prior to acceptance, as described below.]
Thank you for submitting your article "Unsupervised Bayesian Ising Approximation for decoding neural activity and other biological dictionaries" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Aleksandra Walczak as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
1) The numerical comparison to other existing methods is minimal and should be expanded. The analysis of the irreducibility of the codewords has insufficient support based on the numerical simulations. Moreover, the generality of the tool and comparison to other methods are discussed in almost entirely theoretical terms, which makes the claim on immediate utility for other datasets less convincing, especially outside the neuroscience community. In particular, the identified codewords are relatively short (3rd, 4th order stats) and N=20 is easy to fit an Ising model; so one would think that it would be relatively straightforward to fit the Ganmor model jointly. At least on artificial data one should be able to explore a larger N regime where there is a more immediate gain relative to existing models.
2) Although the paper is written as a methods paper, emphasizing the technical contributions and promising wide applicability to a range of different types of datasets, the numerical validation of the method is very much restricted to the statistical regime of the songbird dataset. From the perspective of a potential future user of the tool it's less clear how the method would behave on different datasets, and what needs to happen in practice for adopting the tool to data with different statistics. Based on current content, it would be better to focus the abstract and introduction on applications that are actually studied here.
3) The songbird analysis already reveals some challenges with respect to interpretability: in particular it is not clear how much information about the underlying neural processes can be revealed by summary statistics generated by the method, such as the number of codewords and their length distribution.
4) Limiting the analysis to N=20 patterns with somewhat random code structure makes it hard for a potential user of the tool to work out what needs to change when switching to a novel dataset (with respect to N, M, codeword complexity, sparsity of neural responses). Some work could be invested in spelling out the considerations of applicability with respect to quantities that an experimentalist would know about the data.
5) Is it possible to include multiple binary quantifications of behavior, similarly to how words are constructed from neural spike trains? For example, one can envision describing a particular song segment with respect to multiple binary features simultaneously.
Reviewer #1 (Recommendations for the authors):
Line 168: I think the notation here should P(s_mu=0) not s_mu=1.
Reviewer #2 (Recommendations for the authors):
I am overall very excited about the framework and I see significant technical potential in the approach. Nonetheless, there are several missed opportunities in the numerical analysis, and things that could and should probably be improved to maximize the impact of the work on the broader community.
– Limiting the analysis to N=20 patterns with somewhat random code structure makes it hard for a potential user of the tool to work out what needs to change when switching to a novel dataset (with respect to N, M, codeword complexity, sparsity of neural responses). Some work could be invested in spelling out the considerations of applicability with respect to quantities that an experimentalist would know about the data.
– The codeword irreducibility could be analyzed in more detail, again in terms of experimentally relevant quantities.
– The comparison to other methods is minimal. The identified codewords are relatively short (third, fourth order stats) and N=20 is easy to fit an Ising model; so I'd think that it would be relatively straightforward to fit the Ganmor model jointly. At least on artificial data one should be able to explore a larger N regime where there is a more immediate gain relative to existing models.
[Editors’ note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Unsupervised Bayesian Ising Approximation for decoding neural activity and other biological dictionaries" for further consideration by eLife. Your revised article has been evaluated by Aleksandra Walczak (Senior Editor) and a Reviewing Editor.
The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:
Essential revisions:
Thank you for the revised manuscript. The reviewers found the manuscript much improved and close to acceptance. Nevertheless, reviewers are suggesting a few additional analyses to demonstrate the advantages of the method and increase its impact.
1) Please consider the null model that includes pairwise correlations.
2) Please test the method to a larger dataset.
Reviewer #2 (Recommendations for the authors):
I generally find that the revisions have improved the overall clarity of the text, and addressed most of my questions.
Nonetheless, I found the answer to two of my concerns underwhelming:
– Discussing the practical considerations about translating the approach to other datasets (new discussion paragraph) is quite vague, essentially saying that the hyperparameters will more or less have to be figured out de novo. I was expecting a quantitative description of how different statistical properties of the data affect the process, which would translate into something closer to a concrete recipe for hyperparameters adjustment. The fact that this is not straightforward to do makes a broad adoption of the tool less likely.
– The concern about the overall interpretability of the extracted statistics remains unaddressed. I don't agree with the position is that "it only matters to estimate these statistics well, interpretation can come later." Being able to estimate a quantity does not necessarily make it useful to do. In the context of songbird specifically, the paper make the case for the results being difficult to interpret even when the statistics are well estimated. This seems like a pretty big problem, at least big enough to warrant a minimum amount of thought and a couple of sentences in the discussion.
Reviewer #3 (Recommendations for the authors):
The authors already made a great effort to improve their manuscript. However, from my side, I provide the following concerns about the significance and broad impact of the current revised manuscript.
Strength of the manuscript
The traditional model of maximum entropy type requires a large amount of data to predict effective interactions among elements (e.g., neurons), and fails to identify statistically over or underrepresented (relative to a null model) codewords, and thus the neuralbehavior prediction remains elusive. This manuscript provides a novel approach that can systematically search for the significant codewords, yet requiring fewer data samples (compared to the number of higherorder model parameters).
The method first writes all orders of interactions in a probability distribution form; finding nonzero higher order interaction parameters could tell which codewords should be included in the neural dictionary and how they matter. To make the method algorithmically solvable, the authors turn the optimization into an Ising model of interacting indicator variables. This indicator variable is responsible for identification of key codewords. The inferred value of the external fields signals the importance of the codeword, while the sign of the highorder interaction parameter reflects whether the word is over or underrepresented. To explain the neuralbehavior correlates, the authors also include the behavior variable into the components of the activity. This framework works consistently in the songbird motor experiments.
Weakness of the manuscript
To relate the neural activity and behavior, the authors test their method in both synthetic data and songbird real data. There are two weakness, considering the impact of the method in a broad context. First, the variable size is limited to N~20, which is severely below the number of neurons the neural data scientists get interested in (e.g., for modeling retinal ganglion cells, and even cortical cells). This may be a tradeoff between all order of interactions to be considered and the number of model parameters. Second, a null model is chosen to be an independent model that neglects correlations in the data, which may affect the characterization of the redundancy of the uBIA model. Even if an irreducible ensemble of codewords is obtained, what is the predictive power of this ensemble with respect to control the macroscopic behavior (e.g., skilled motor control or sensorimotor learning), rather than showing statistics of neuralbehavior activity, remains obscure in the current manuscript.
Considering the strength and weakness of the manuscript, I recommend the following issues for the authors to improve their idea.
1. The proposed model is limited to symmetric couplings and i.i.d data samples. First, in a neural circuit, the asymmetric coupling is common among neurons; Second, the temporal order of each spiking activity may play a key role in encoding information. For example, a paper "Blindfold learning of an accurate neural metric" (PNAS, 2018) takes the temporal information into account. When putting in the context of neuralbehavior relationship, these two factors would become important. But they seem neglected in the current manuscript.
2. To achieve a higherorderinteractionincluded and nonredundant model is really important problem in the field. For example, the reliable interaction model proposed in ref. 26 is not fully compared in the current manuscript, i.e., what is the new advance? This is better to clearly demonstrated in a plot in the main text. Second, a recent paper (Phys Rev E, 104, 024407, 2021) used an informationbased criterion to identify statistically significant couplings, which may be applicable to the context the authors consider here. In this eLife submission, the method relies on the magnetization inclusion threshold, which may be expensive for real data analysis.
3. On the equation 1, the authors seem to use the local model parameter to detect the global significance of the collective codeword, which I could not understand very well, because a codeword is determined by all order of interactions considered in the loglinear model.
4. Technically, the logic going from Equation 7 to Equation 8 is broken, since the sdependence in Equation 8 does not naturally arise from Equation7.
5. I could not understand well how the sign of reflects whether the word is over or underrepresented, and even, how the parameter account for the reducibility of the dictionaries. This part is better to be expanded in the manuscript, although these parameters have highly nonlinear function relationship.
https://doi.org/10.7554/eLife.68192.sa1Author response
[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]
The individual reviewer comments are appended below and can be used either as directives for that new submission, or as helpful pointers for a submission elsewhere. Summary thoughts are included here, with the detailed reviews from the 2 reviewers included at the end.
The paper details a new approach to learning the input/output relationships in complex biological systems that drive things like motor behavior or (perhaps) protein folding and other processes. The work focuses on birdsong as an example of a complex motor task potentially modulated by precise temporal patterns of spiking in the bird motor cortex (RA). The method claims to be able to use very few jointly recorded samples of the input and output to learn both a model for the distribution of the spiking patterns observed as well as how they drive behavior. The method uses strong priors in a Bayesian Ising Approximation to identify the neural code words that are expressed at unusually high or low frequency with respect to the behavioral output. The authors find a large number of spike patterns correlated with behavioral variations in syllable pitch. The method has potential applications beyond birdsong; the ability to fit input/output relationships with very few samples is appealing.
This is an excellent overall summary of the paper. However, there is one crucial aspect of our approach which is summarized incorrectly and which we clarify briey here as well as in detail in the revised manuscript: uBIA does not fit or reconstruct the input/output relationship from data. Rather, it detects which keywords contribute statistically significantly to the input/output dictionary, but unlike many other methods, it does not build a model of the input/output relationship. This is crucial, since it is one of the reasons why comparing uBIA to other approaches is hard, and it is also one of the main reasons why uBIA can work with fewer assumptions and smaller datasets compared to other methods. We have emphasized that uBIA does not build an explicit model of the data in the new revision.
The reviewing team, however, had some major reservations about the results and felt that the current draft did not give them enough details to properly evaluate the method. A summary of those major points is given here:
1) The method:
The reviewers felt that there wasn't enough detail presented to clearly analyze the results of the model. Much of the detail was in the SI and Methods, where the exposition wasn't clear or detailed enough for the reader to properly assess the results.
Following these concerns, we have completely restructured the paper, moved some of the calculations from the Methods into the main text, deemphasized biological applications and reemphasized and explained in more detail the algorithmic details, as described below.
Questions raised included:
– What's the null model?
In the revised manuscript, we now explicitly define the null model after Equation 3.
– How does this method scale to larger neural populations?
It is hard to be precise answering this. The size of the population is not the most crucial variable since, as is, the method analyzes 2^{N} possible terms in a model with just N variables. What is more important is the quality and quantity of sampling, the strength of the contribution of terms to the likelihood, and so on. We discussed this explicitly in the Discussion section, but also throughout the manuscript, wherever relevant, such as in the first paragraph of Page 4 and in the last paragraph of the “Testing and finetuning uBIA on synthetic data” section.
– Is the clustering used here justified? Can it be compared to other, simpler clustering methods?
We apologize, but we do not understand the question here, since there is no clustering used in our method. We hope that the other revisions and clarifications described here are sufficient to clarify our overall approach.
– How do you justify the choice of prior and how are the unknowns fit from the data?
We have completely reworked the presentation of the method, and we hope that the new section “The unsupervised BIA method (uBIA) for dictionary reconstruction” answers these questions. Briefly, as explained in this section, we choose the priors we work with because they are standard (e.g., Gaussian), and allow for the strongly regularized regime, which is a key for our work. And then we chose the weakest of such strongly regularizing priors (largest), as detailed in the manuscript.
– Other extant methods for finding reliable interactions in exponential models were not referenced or compared.
We now introduce the necessary references and comparisons, largely in the first section of the supplementary materials. Briefly, as we discussed in the preamble to this response letter, the goal of our method is very different from many others. Thus, as explained in the revised Supplemental Information, our method is complimentary to, rather than a competitor of, the other methods we review, which complicates the task of comparing them.
– Artificial data used in the SI are based on properties of the recorded neurons, and as the authors explain the cells seem to be quite weakly coupled, so the relevance of the model for other data is not clear.
Just to make sure there is no confusion: the data is from single cell recordings, and the interactions are between time points of neural activity, and not within a population of neurons. However, our response to the substance of the comment is that there is no way a priori to know if a method would work for other data, without trying it. We explain this in the revised Discussion section, where we also comment on what the determinants of the success will be. We only have access to the neural dataset, and analyzing it was a lot of work – we simply cannot do more in one article. To check if the method works in other fields, we must first publish it, so that other researchers can use it on their data. This is precisely why we are submitting the article to eLife.
2) The behavioral results:
The reviewers were not convinced by the pitch results, primarily because of the coarsegraining of pitch into just two categories. Again, much of the explanation was relegated to the Methods part of the paper and could have been presented more clearly. If this method works, this is clearly interesting, but the results presented seemed rather weak.
We thank the reviewers for directing our attention to this. As the reviewers rightly point out, uBIA uses discretized measures of motor output, and in some cases such discretization will discard relevant behavioral data, as when a (continuous) measure of vocal pitch is discretized into two categories. However, as explained below this coarsegraining does not limit the usefulness of the uBIA approach, either in the specific case of the songbird vocal motor system or in its usefulness to biologists more generally. In the case of the songbird data, Dr. Nemenman and Sober’s prior work (Tang et al., 2014), which analyzes the same dataset, has demonstrated significant mutual information between behavior and millisecondscale spike timing in neurons of the songbird motor cortex, even when the continuous pitch variable was discretized into just two categories. Moreover, that same paper demonstrated similar statistical dependencies when pitch was discretized into 3, 5, or 8 groups, showing that information between millisecondscale (cortical) spike patterns and a coarsegrained representation of behavior does not depend critically on the details of the discretization.
More generally, the loss of resolution due to the (artificial) discretization is inevitable. However, it would be hard for the discretization to create new codewords – by discretizing, we largely lose codewords, for which the behavior is mapped into a single bin. We believe that this is acceptable – the method does not claim to discover all codewords (this would require infinitely large datasets!), and so, as long as we keep making the same error again and again (deliberately missing some possible keywords), and still come out with relatively large dictionaries reconstructed, the approach remains useful. In addition, one can partially address this concern by trying different binarizations. Specifically, in the last part of the paper, we did just that – binarizing the data into different groups, and hence exploring the explorationexploitation tradeoff. We now emphasize this in the text, in the header of the “Reconstructing songbird neural motor codes” section and at the end of the “Dictionaries for exploratory vs. typical behaviors” section.
3) The format of the paper:
The paper read a bit like 3 manuscripts glued together, each of which has merits, but none of which seemed to satisfactorily stand on its own. The paper starts with a grand introduction that talks about solving a major challenge in quantitative biology. Then it quickly dives into something that reads more like a methods paper, but with insufficient exposition and clarity. Finally, the paper shifts to the application of the method to behavior and neural data, the results of which seemed somewhat weak and preliminary.
We agree with these concerns, and we took them seriously. To respond to them, we have completely changed the paper structure, and the paper is now a methods paper. We think the exposition has improved because of the change, and we hope that the reviewers agree.
Taking these critiques together, the reviewing team felt that the paper could not be reasonably revised for eLife as a Research Article. However, the team did find the central method potentially interesting if more details are provided that allow for proper assessment of it. The team felt that it could be a good "Tools and Resources" advance, if the method holds up to this more detailed scrutiny.
Thank you. This is a great suggestion, and we are resubmitting the paper as a “Tools and Resources” article here.
Finally, we also change Figure 4 to make it easier to read.
Reviewer 2:
The paper is a potentially interesting application of a recent statistical method, the Bayesian Ising Approximation (Fisher and Mehta), aimed at addressing the problem of estimating a complex distribution, in particular the relationship between a code and an output, from very few samples. This is accomplished by heavily weighting by the prior. The authors have applied this method to support the idea that there are precisely timed spike patterns that are especially significant in the generation of behavior, in this case in the output of syllables of birdsong. They use the method to identify "code words", patterns that are claimed to appear with notably high or low frequency in correspondence with a behavioral variable. They apply this method to find a larger number of precise spike patterns correlated with behavioral variations in the tails of the behavioral distribution. They suggest an interpretation that perhaps larger exploratory motor gestures require a more precise spiking code; although it might also be argued that this shows that there are more ways to achieve these larger fluctuations? While the claim is counterintuitive, it rests on the power of the method, whose exposition left a number of gaps.
1. It is unclear to what extent this goal is meaningful in the case that the paper analyses: premotor drive of muscles in the birdsong pathway. It seems evident that timing of RA spikes necessarily sculpt motor output; and the authors have previously analyzed this in detail to reveal timing precision (Tang et al.). It was not clear, however, that specific patterns meaningfully emerge from a sea of similar spike trains to encode exact outputs. The authors ask whether precise patterns in 2ms bins have a significant relationship with binarized song characteristics in 40ms binsa very low fidelity relationship on the "behavior" side compared with a high level of precision on the neural side. This seems a dramatic discrepancy between the dimensionality of neural response vs behavior that deserves more justification. How general a space of coding relationships (nonmonotonic, etc) does this method explore given this binarization?
We answered this question in detail above (see “(2) The behavioral results” in our reply to the Editors above). Briefly, by discretizing, we certainly lose a lot of possible coding relationships – but we also lose many of them simply due to the finite size of the data sets. These are the errors that everyone working in the field has to agree to make: it is ok to lose codewords, as long as we do not introduce many spurious ones. Finally, exploring different discretizations certainly increases the diversity of the codes we can detect – and we do, indeed, detect nonmonotonic codes in the section on exploratory vs. typical behavior.
2. Further, it was unclear whether a temporally structured firing rate would be equally convincing as a representation of behavioral variation. While the concept of "irreducible dictionaries" is appealing, does this method fairly compare a timing code to a rate code? For example, doesn't the competition component of the model suppress the identification of firing ratelike contributions? Looking at Figure 1, the cartoon of eliminating overlapping but similar code words seem equivalent to ignoring a less temporally precise method of coding for behavior. That is to say, if you are forcing your model to choose one of several similar code words in a winnertake all inclusion choice, isn't the timing precision you see partially a factor of not looking at the other similar code words which didn't win out?
Thank you for this important question. Parenthetically, our method actually allows for partially overlapping words, as the dictionaries shown by tic marks in Figure 5 illustrate. However, our main response to this comment is that the framing of the timerate debate by the reviewer is inconsistent with our view of the matter. Of course, both the rate and the precise timing contribute to encoding information about the ensuing behavior – the question is how much is contributed by each, and how. As we tried to elucidate in our recent review paper (Sober et al., 2018), we view neural codes on a twodimensional continuum, where temporal precision is on one axis, and the order of the code is on the other (see Figure 2d of Sober et al., 2018 for a reproduction of the argument from the above mentioned review). While allowing the rate code to change the rate over orders of magnitude on a millisecond scale would introduce the millisecondlevel precision, it would still be a first order code, not requiring to know the neural activity at two instances to predict the behavior. To the extent that most of the codewords that we observe involve multiple spikes, cf Figure 5c, the codes are not simple first order rate codes – they are pattern codes, with temporal resolution of a few ms. We have checked and correct the entire manuscript so that nowhere in the manuscript we contrast timing to rate, but instead we focus on the difference between rate and timed patterns code.
The comparison of this method to GLM methods of relating the spiking to behavior is unclear. One alternative model was analyzed in the online methods, but is not clearly explained and such comparisons deserve to be treated in the main text. It should also be easy to report the simple 0th order analysis of how well you can predict the binary version of behavior simply with a spike count within the 40 ms window, or with some intermediate level of temporal variation. Seeing these comparisons as a control of the assertion that precise timing matters in this system would enhance intuition for the necessity of this method and better justify the claim "that they predict behavioral features better than other approaches" (line 256).
Thank you for this important question. As we showed in Tang et al. (and mentioned in the manuscript), there is essentially no information between the spike count at 20+ ms resolution and the pitch, which makes making such comparisons not illustrative. Crucially, as we now stress throughout the manuscript, and in particular in “Overview of prior related methods in the literature” (see also our response to the introduction of the Editorial Decision letter), uBIA does not build a generative model of the underlying distribution; it merely detects which features need to be part of the model, and after that one can build the model using any number of other methods, including GLMs. In other words, uBIA doesn’t compete with (and should not be compared to) other methods, and should be viewed as a complementary technique.
3. In general, the exposition of the method leaves many questions and could be improved considerably in clarity. I have tried to summarize the logic of the paper's core argument below and underscored throughout where more help could be given to understand the meaning and generality of assumptions made. Logic of the calculation:
We agree that the original manuscript has left a lot to be desired, and we hope that the extensive changes we have made to the manuscript have solved these problems.
The paper sets up the problem of writing down a model for the joint probability of a behavioral event s0 and the firing of a spike in each, i, of N bins, σi. The authors write this in terms of an expansion over every possible combination of σi, Log P(σθμ) = log Z + Σ θμ Π _{i in Vμ} σi.
Q3.1: Please expand a little on the exponential form. In maximum entropy approaches, this form of distribution arises from the maximum entropy solution, but if we are avoiding MaxEnt approaches, can you briefly justify it?
This is the most general form of the underlying probability distribution, and it does not lead to any loss of generality. We now emphasized this in the text before Equation (1).
In this approach, a firing pattern with spikes in bins (1,2,3,4) is considered to be independent of the events (1,2) and (3,4); whereas in maxent, if one is only expanding up to order 2, one hypothesizes that the probability of the word (1,2,3,4) will be absorbed into the pairwise probabilities, and one can ask how different the data is from that guess. It seems that here one is parametrizing with vastly more parametersone for every possible spike/behavior combination without a systematic method of cutting them down (or at least one that was clear to me; it is addressed at the end of the Online Methods calculations once we have moved into the s variables but it would be helpful to get some intuition from the start).
This is not quite correct – the whole point of the method is that one should (and does) consider all possible combinations, but then one systematically evaluates the evidence for needing to include a certain combination into the probability distribution. We hope that the current revision, which has moved the appendix into the main text, has resolved this ambiguity. Specifically, the paragraph before Equation (2) has been revised to address this question as well.
One then defines "indicator variables" sμ which tell us which patterns appear significantly less or more often than one would expect under some null model. It would be good to specify this null model right away, and in straightforward terms; I don't see it defined in the main text at all.
Q3.2: Please explain at this point the null model for the frequency of appearance of any possible pattern of spikes and its correlation with behavior? Should it not depend on the specific behavior?
In the revised manuscript, we now explicitly define the null model after Equation 3. Briefly, the null model matches the firing rates in all bins to experimental observations.
It is unclear why one is now allowed to write that the original definition of the model is EQUAL to the model now written including only terms defined by a criterion defined by our subjective interest in them: Log P(σθ) = log Z + Σ θμ sμ Π _{i in Vμ} σi
Q3.3: Why does one not here need to define a new and different probability distribution, something that makes some claim about belonging to the codeword dictionary, in which these terms are the only ones that appear?
Thank you very much for this critique. Indeed, what’s written in Equation (2) is the same distribution as in Equation (1), following a series of algebraic transformations. We now stressed it in the text before Equation (2) that no new quantities are being defined.
Now the authors turn to the task of estimating the 2^N parameters θμ. Instead of doing this estimation directly from the data (i.e. the a posteriori expectation), they use a prior on these values: they choose to assume that each of the values are narrowly distributed about some mean θ*μ: P(θ*μ : sμ=1) ~ exp((θμ −θ*μ)2). This distribution is taken only to hold if the word is indeed a code word, ie statistically associated with a behavior. If it is not a code word, the prior is not defined.
Q3.4: Please justify this choice in more detail. Why are the epsilons not dependent on μ?
It would be impossible to obtain a simple a posteriori expectation here – with 2^{N} equal to about 2,000,000 in our example, and with only a few hundred samples, the uncertainties would be too large. This is where Bayesian methods are useful and needed. We explained in the revision why we made this specific choice of the prior. We would like to point out here that we do not define the prior for noncodewords simply because it doesn’t matter – those words won’t contribute to the distribution at all.
Q3.5: The argument for where these 2^N values of θ*μ come from is very unclear. Please spell out more clearly how constraining the means of the individual σis provides sufficient constraints for all 2^N values, or why this is not needed.
There are 2^{N} values of θs because there are 2^{N} different inclusion/ exclusion combinations of N spins. The constraining on the rates (means) does not provide sufficient information for all 2^{N}. This is, in fact, the point of the algorithm: there’s no way to determine 2^{N} values from small data sets, and we thus need to do something else. Specifically, we only ask a question of which θs are statistically significantly nonzero, and hence should be included as codewords, but we never try to determine their values directly. We hope that the current revision makes this point clear.
Q3.6: Why does a code word have an a priori likelihood of 0.5 of being in the dictionary?
As in any Bayesian analysis, we need to define a prior, and we choose the least informative 50/50 prior in our case – a priori we do not know whether a particular codeword is used or not used by the animal. If we had a reason to believe a priori that a certain word should or should not be included in the dictionary with a different probability, then that different prior could be used. We now explain this in the text in the paragraph following Equation (3).
Now rather than estimating or computing the values of the θμs, the next step is to flip the variables in the distribution we seek, to P(stheta), and then integrate out the thetas, based (it seems?) only on their prior distributions, Equation (5).
This is not entirely correct. We are interested in the posterior distribution of s given the observed data, and integrating out θ. We choose to integrate θ out since, as we argued before this equation, the posterior expectation of θ cannot be determined with a reasonable accuracy from datasets of realistic sizes. Here, crucially, the a priori expectations and data get combined, and the integration is not controlled just by the prior distributions.
Q3.7: Now that s can again take value of either 0 or 1, why did we not need the distribution over θs in the case that s = 0? How can a distribution over s on the LHS in Equation (5) limit itself to the case of s = 1 on the RHS?
It’s not that we do not need the distribution of θs for s = 0, it’s that the distribution does not matter: if s_{µ} = 0, the corresponding θ_{µ} has no effect on the distributions of data. We hope that this has become clearer in the current revision.
The calculations in the online methods may resolve some of these issues? But these all appear to start from Equation (5) and at least for me at this stage, the route to this equation is unclear.
We hope that this has become clearer in the current revision, which included a major restructuring of the paper.
Reviewer #3:
The manuscript suggests a new approach to learning coding dictionaries. The method is applied to neural population data recorded in singing birds. While the idea of the approach is interesting, it was hard to understand the details or how well it works. It seems that too much has been pushed to the SI, and more controls and examples over synthetic data sets would be useful. Moreover, and importantly – it was hard to interpret the results. Also, the current approach was not compared against other methods presented in recent years for learning metrics on neural population patterns, or even more simple clustering approaches.
We have substantially revised this manuscript to address this concern, and discuss this in the “Overview of prior related methods in the literature” section, as well as in the reply to the editorial decision above. In particular, we added a description emphasizing that our method does not compete with most others in the literature, and cannot really be compared to them, as the goals are different. Indeed, uBIA learns dictionaries – in other words, it detects which words are codewords. Most other methods, in contrast, take a predefined set of putative codewords, and then build a generative model based on them. While usually the set of putative codewords is defined as words with some number of spikes, nothing prevents to use these other methods in conjunction with uBIA, so that uBIA first detects the codewords, and other methods are then used to build a model based on them.
Briefly, the authors use a loglinear model for the population words of N neurons and two behavioral states – using 2^(N+1) terms that span all potential orders of interactions among cells (theta's). To avoid arbitrary cutoff on the order of interactions that they consider, and use the important ones, they use a constructed prior distribution over theta's to learn a posterior model over the population words after integrating out the theta's. This gives an Ising model over 2^N interacting 'spins' where each sping now stands for a word that may or may not have appeared in the sampled data. The local fields acting on each of these spins is related to the prob of appearance of the words the spins stand for, and the pairwise interaction terms stand for an overlap between the words. These interactions now imply a 'competition' between words in terms of the explanatory power they carry (and are therefore typically negative).
Thus, if I understand this correctly, the constructed model gives a form of compression of the population 'codebook' by picking words that are dominant to be included in the dictionary and omitting similar words with less explanatory power and frequency.
This is, indeed, a reasonable summary of our paper. We would like to point out, however, that compression is not the primary goal here, and it emerges naturally, not at the expense of the quality of the fit to the data. In other words, if there is sufficient evidence for even overlapping words to be included in the dictionary, then both will be. Finally, some words – especially nonoverlapping ones – may have positive interactions, promoting rather than inhibiting each other.
Critically, it was unclear to me how one should evaluate the results. It is hard to interpret the structure of the dictionary in Figures45, and in particular over the pooling of conditions in Figure 5.
We thank Reviewer for these comments. Some of them were answered above in our response to the editorial summary questions, and we refer the Reviewer there, to our response to “The behavioral results” editorial comments. In addition, here we point out a few more things. First, consistent with our resubmitting this paper as a “Tools and Resources” contribution rather than a research article, we have deemphasized the interpretation of the birdsong data and focused on the mechanics the method itself. Second, it is simply not possible to do give a full interpretation requested by the Reviewer, without pooling, since we lack tools to collect the amounts and the types of data we would need. In particular, pooling is done, in part, to overcome the statistical power problems caused by small datasets. Further, the ultimate test of any method for dictionary reconstruction must be causal stimulation: one needs to build a model based on the detected codewords, predict responses to possible experimental perturbations, and then apply the perturbation and compare the results to the predictions. We do not do this here, but neither do the absolute majority of other publications in the field (our previous publication on bird breathing is one of the few exceptions, and the goal is to do something similar in the current system as well, but this requires development of new experimental methods). We already did what the best publications in the field do – we build the dictionaries, and then used them, in the “Verification of the inferred dictionaries”, section to build a simple statistical model of the code, and we showed that this model outperforms at least some competitors. Beyond this, we do not know how to provide a more useful biological interpretation. However, again, the same concern is applicable to essentially any other publication in the field.
As argued in the previous paragraph, pooling of conditions in Figure 5 (now Figure 6), indeed, adds an additional complexity to interpretation of individual dictionaries, as using a pooled dictionary to predicted pooled data would certainly be pointless (this would be like trying to translate from a corpus of books written in different languages using a mixture of dictionaries that translate between different pairs of languages). However, what would make more sense in such pooled books analogy – and is precisely what we do for the birdsong data – is to ask questions of the type: how are Romance languages different from Germanic ones? Are the words of similar length? Are roots similar? etc. We hope that this analogy is illustrative for the Reviewer.
In addition, there have been different approaches towards learning neural dictionaries in recent years, which are not mentioned here or compared against the current approach. Maybe this is too naïve, but it seems natural to ask how well would naïve clustering or Editdistance based methods work here? Then, how does the current approach compare to more structured metrics such as those presented in Ganmor et al. eLife 2015, Prentice et al. Plos Comp Biol 2016, Rubin et al. Nature Comm, 2019, or Chaudhuri et al. Nature Neuro, 2019?
We now review the relation of all of the methods mentioned by the reviewer to uBIA in “Overview of prior related methods in the literature”. Here we also add that, as we now emphasize in the manuscript, our method is complementary to the ones mentioned by the reviewer, and does not compete with them. These methods can be used to reconstruct models of the neural code after uBIA detects which keywords should enter the model.
Finally, how would approach work for larger populations, where all observed patterns would appear just once?
Performance of uBIA will certainly degrade in this situation. However, no method will work well then, so that this cannot be viewed as a something particularly bad about uBIA. That said, uBIA has two major strengths in this regime, which are now emphasized in the manuscript. First, uBIA analyzes not just patterns, but all subpatterns. In other words, as long as there are even partial coincidences among the spiking patterns, they will be detected, and these coinciding subpatterns will become keywords in the dictionary. It is clear that at least some short subpatterns must coincide in any reasonable dataset – e.g., there are only N first order words formed by N interacting units – and any realistic dataset will have, at least, Μ»Ν, ensuring that, at least, such simple onespike (and maybe even many twospikes) words have good statistics behind them. Second, uBIA detects not just overrepresentation of patterns, but also their underrepresentation. Thus, if a complex spike word happens once (or does not happen at all), while it should happen many times based on its composition from small subpatterns, such word could enter the uBIA dictionary as well, with its absence potentially encoding the behavior.
It was not clear how should one interpret the nature of the organization of patterns in Figure 4B (the gap between lines, their very linear dependency?)
In Figure 4B (now 5B), the reason that there are no points near x = y (in other words, there is a gap there) is because x = y means that the observed frequency of a spike codeword is roughly the same as its frequency expected from the observation of its various subparts. Then, by definition, it isn’t significantly under/overrepresented. The width of the gap around the diagonal is determined by various significance threshold, which we set as explained in the text.
[Editors’ note: what follows is the authors’ response to the second round of review.]
Essential revisions:
1) The numerical comparison to other existing methods is minimal and should be expanded.
We have argued in our previous submission that there really are no other methods to compare to, designed to work in the regime similar to uBIA. It seemed to us that it would be unfair to run other methods on our datasets, see them not work well (as expected – because they make assumptions that are invalid in our regime), and then claim success. However, since the concern has been raised again, we really have to address it. To do this, we added a section in the Online Methods “Direct application of MaxEnt methods to synthetic and experimental data”, in which we compare uBIA to the relevant interactions model of Ganmor et al., with which uBIA has the highest similarity. The results are as expected – a method not designed for our data regime fails. We emphasize here again that the relative superiority of uBIA on these data should not be taken as a slight directed at other methods, but rather as an indication than, to cover different data regimes, multiple methods should be combined. We emphasized this in the “Overview of prior related methods in the literature” supplemental section.
The analysis of the irreducibility of the codewords has insufficient support based on the numerical simulations. Moreover, the generality of the tool and comparison to other methods are discussed in almost entirely theoretical terms, which makes the claim on immediate utility for other datasets less convincing, especially outside the neuroscience community. In particular, the identified codewords are relatively short (3rd, 4th order stats) and N=20 is easy to fit an Ising model; so one would think that it would be relatively straightforward to fit the Ganmor model jointly.
We hope that the addition of the new comparison figure (described above) partially alleviates these concerns. Additionally, we point out that 3rd and 4th order words are long, as most others deal with just pairs, as illustrated in the new Figure 7. Indeed, it is not easy to fit an N = 20 Ising model with 4th order terms, because there are 20 ∗ 19 ∗ 18 ∗ 17/(4 ∗ 3 ∗ 2 ∗ 1) = 4845 terms in this model, which cannot be fit from just a few hundred samples, which is precisely why the Ganmor model fails in this case (Figure 7).
At least on artificial data one should be able to explore a larger N regime where there is a more immediate gain relative to existing models.
Indeed, in the newlyadded Figure 7 we explored dataset sizes up to M = 10^{5}, showing that uBIA begins detecting interactions with datasets orders of magnitude smaller than those required by alternative methods.
2) Although the paper is written as a methods paper, emphasizing the technical contributions and promising wide applicability to a range of different types of datasets, the numerical validation of the method is very much restricted to the statistical regime of the songbird dataset. From the perspective of a potential future user of the tool it's less clear how the method would behave on different datasets, and what needs to happen in practice for adopting the tool to data with different statistics. Based on current content, it would be better to focus the abstract and introduction on applications that are actually studied here.
We have edited second half of the abstract and a few sentences in the Introduction (see latexdiff file) to make it clear that our main applications to date have been to songbird data.
3) The songbird analysis already reveals some challenges with respect to interpretability: in particular it is not clear how much information about the underlying neural processes can be revealed by summary statistics generated by the method, such as the number of codewords and their length distribution.
The reviewer is correct that our analysis of the songbird data raises a number of important questions for future studies. Although these remain to be answered, we emphasize that before the biological interpretation of over/underrepresented neural patterns can be attempted, such patterns must first be identified. uBIA therefore represents a crucial advance in our ability to address these questions.
4) Limiting the analysis to N=20 patterns with somewhat random code structure makes it hard for a potential user of the tool to work out what needs to change when switching to a novel dataset (with respect to N, M, codeword complexity, sparsity of neural responses). Some work could be invested in spelling out the considerations of applicability with respect to quantities that an experimentalist would know about the data.
We agree with the reviewer and have added a discussion of how to set hyperparameters (secondtolast paragraph of the revised Discussion).
5) Is it possible to include multiple binary quantifications of behavior, similarly to how words are constructed from neural spike trains? For example, one can envision describing a particular song segment with respect to multiple binary features simultaneously.
We explicitly examine this question in “Dictionaries for exploratory vs. typical behaviors” and the corresponding Figure 6, which repeats our analysis for different binary discretizations of our behavioral data.
Reviewer #1 (Recommendations for the authors):
Line 168: I think the notation here should P(s_mu=0) not s_mu=1.
Thank you, this has been corrected
Reviewer #2 (Recommendations for the authors):
I am overall very excited about the framework and I see significant technical potential in the approach. Nonetheless, there are several missed opportunities in the numerical analysis, and things that could and should probably be improved to maximize the impact of the work on the broader community.
We thank the Reviewer for their generally positive comments.
– Limiting the analysis to N=20 patterns with somewhat random code structure makes it hard for a potential user of the tool to work out what needs to change when switching to a novel dataset (with respect to N, M, codeword complexity, sparsity of neural responses). Some work could be invested in spelling out the considerations of applicability with respect to quantities that an experimentalist would know about the data.
Please see our response to this issue above under “Essential revisions.”
– The codeword irreducibility could be analyzed in more detail, again in terms of experimentally relevant quantities.
Please see our response to this issue above under “Essential revisions.”
– The comparison to other methods is minimal. The identified codewords are relatively short (third, fourth order stats) and N=20 is easy to fit an Ising model; so I'd think that it would be relatively straightforward to fit the Ganmor model jointly. At least on artificial data one should be able to explore a larger N regime where there is a more immediate gain relative to existing models.
Please see our response to this issue above under “Essential revisions.”
[Editors’ note: what follows is the authors’ response to the third round of review.]
Essential revisions:
Thank you for the revised manuscript. The reviewers found the manuscript much improved and close to acceptance. Nevertheless, reviewers are suggesting a few additional analyses to demonstrate the advantages of the method and increase its impact.
1) Please consider the null model that includes pairwise correlations.
This is not a simple request. For starters, most methods that aim to reconstruct features of multivariate probability distributions from data use the product of marginal distributions as the null model. In this context, our approach is consistent with the best practices. Further, if we were to follow this request, we think that this would create more issues than it would solve. In the regime we operate in, where the number of pairwise correlations is only a few times smaller than the dataset size, any method for constructing a null model that respects pairwise correlations in data would result in a null model that itself has substantial error bars. In our subsequent estimate of deviation of frequencies of various patterns from the null model, we would then be hampered by these large error bars, and whether some words should be or shouldn’t be included into the dictionary would depend on these fluctuations. To be sure, we agree that extending our method to a different null model is a worthy research project – but this is a research project that will take months of work to do right, and we would prefer to publish the current paper first, before studying such complicated extensions.
In summary, consistent with the second decision letter, we chose not to perform these additional analyses. However, in the revised Discussion, we now suggest that this expansion of the algorithm would be a worthy goal, but also argue briefly that it is not an easy task.
2) Please test the method to a larger dataset.
Again, while we appreciate where the comment is coming from, we would like to push back on this. Our understanding is that by “larger dataset”, the Reviewers mean larger N, the number of units, rather than M, the number of samples, and they are requesting us to list which specific changes to the algorithm’s parameters we need to adopt in this case. Our concern is that we already provided a recipe for parameter adjustment: count the number of patterns that are identified as predictive of both presence and absence of behaviors. Those are obviously wrong, and parameters should be set to avoid them. Any more specific suggestions will depend on not just the size of the dataset, but also on how many patterns really contribute to the underlying distribution, and what is the distribution of the coupling constants associated with these patterns. While doing some runs over the dataset size is relatively easy, we do not see this providing a substantial insight, since all of our findings would be specific to the statistical properties of the data we analyze. In other words, we would like to request the permission to publish the manuscript with no additional simulations of larger datasets. When published, we will then focus on adapting the methods to specific datasets of interest to us and others. For example, one such interesting dataset comes from recording multiple single motor units in marmosets. There N is naturally larger, so that we will be further developing our methods for a concrete application, rather than for solving toy models.
Nonetheless, we took this request very seriously. Now there are two additional paragraphs in the Discussion, which focus on how one would set the detection threshold and the inverse regularization strength, the two main hyperparameters of our algorithm.
Reviewer #2 (Recommendations for the authors):
I generally find that the revisions have improved the overall clarity of the text, and addressed most of my questions.
Thank you! We are glad to hear this.
Nonetheless, I found the answer to two of my concerns underwhelming:
– Discussing the practical considerations about translating the approach to other datasets (new discussion paragraph) is quite vague, essentially saying that the hyperparameters will more or less have to be figured out de novo. I was expecting a quantitative description of how different statistical properties of the data affect the process, which would translate into something closer to a concrete recipe for hyperparameters adjustment. The fact that this is not straightforward to do makes a broad adoption of the tool less likely.
As we tried to emphasize in the response to the Editors above, we do not expect there to be a practical guide at this point. We have analyzed many datasets in our careers, and we firmly believe that interesting discoveries require one to operate at the edge of applicability of any method, so that no standard, simple prescription would work. In our specific case, there are simply too many parameters to explore: the number of measurements N, the number of units M, the number and the order of the true words contributing to the dictionary, the statistics of the coupling constants associated with these words, and so on. We do not expect there to be simple heuristics, and the Reviewer is absolutely correct that one will have to work hard to adjust the algorithm to each new qualitatively different dataset.
The Reviewer is rightly worried that this may decrease the adoption of the algorithm, but we believe that this is not going to be the case. Indeed, as a part of the SimonsEmory International Consortium on Motor Control, we ran training workshops to disseminate the method in the community. The workshops were attended, cumulatively, by many dozens of trainees from within and outside the Consortium. That is, the method is being adopted by a number of labs. This has created a weird dynamics where these labs, in their own publications, want to refer to an article with the original description of the method, and this article (the current one!) is not yet published.
In other words, we would prefer to publish the manuscript now with minimal changes and explore different data regimes in future work. That said, we added a shortened version of the above arguments, and a succinct description of how one should set the hyperparameters of the algorithm to the Discussion section of the manuscript. Specifically, there are now two new paragraphs where we focus on setting correct values for the detection threshold m and the inverse regularization strength ϵ, the two most important hyperparameters of uBIA.
– The concern about the overall interpretability of the extracted statistics remains unaddressed. I don't agree with the position is that "it only matters to estimate these statistics well, interpretation can come later." Being able to estimate a quantity does not necessarily make it useful to do. In the context of songbird specifically, the paper make the case for the results being difficult to interpret even when the statistics are well estimated. This seems like a pretty big problem, at least big enough to warrant a minimum amount of thought and a couple of sentences in the discussion.
We agree with the general sentiment and address the Reviewer’s comment in two ways. First, we would like to point out that every statistical analysis approach will suffer from the same problem: statistical significance is not the same as biological significance. The best way to interpret statistical significance is through perturbation experiments, as we now explain in the revised “Validation of the inferred dictionaries” section. However, doing such experiments is beyond both the scope of the paper and the limit of current experimental technology, since precisely stimulating individual neurons in behaving songbirds is not yet achievable. Therefore, the field is unable to fully validate uBIA (or, really, any correlationbased analysis) using perturbations.
Second, we have added a paragraph to our discussion of the interpretability of the spiking patterns identified by uBIA. The revised “Dictionaries for exploratory vs. typical behaviors” now ends with a paragraph that includes an expanded discussion of the role that brain nucleus LMAN might play in motor exploration (per the suggestion of Reviewer 2 in their recent comments).
Reviewer #3 (Recommendations for the authors):
The authors already made a great effort to improve their manuscript. However, from my side, I provide the following concerns about the significance and broad impact of the current revised manuscript.
We thank the Reviewer for these helpful suggestions.
Strength of the manuscript
The traditional model of maximum entropy type requires a large amount of data to predict effective interactions among elements (e.g., neurons), and fails to identify statistically over or underrepresented (relative to a null model) codewords, and thus the neuralbehavior prediction remains elusive. This manuscript provides a novel approach that can systematically search for the significant codewords, yet requiring fewer data samples (compared to the number of higherorder model parameters).
Thank you, we agree with this assessment.
The method first writes all orders of interactions in a probability distribution form; finding nonzero higher order interaction parameters could tell which codewords should be included in the neural dictionary and how they matter. To make the method algorithmically solvable, the authors turn the optimization into an Ising model of interacting indicator variables. This indicator variable is responsible for identification of key codewords. The inferred value of the external fields signals the importance of the codeword, while the sign of the highorder interaction parameter reflects whether the word is over or underrepresented. To explain the neuralbehavior correlates, the authors also include the behavior variable into the components of the activity. This framework works consistently in the songbird motor experiments.
Thank you for this correct summary.
Weakness of the manuscript
To relate the neural activity and behavior, the authors test their method in both synthetic data and songbird real data. There are two weakness, considering the impact of the method in a broad context. First, the variable size is limited to N~20, which is severely below the number of neurons the neural data scientists get interested in (e.g., for modeling retinal ganglion cells, and even cortical cells). This may be a tradeoff between all order of interactions to be considered and the number of model parameters.
Indeed, this is a tradeoff between being systematic in detecting codewords of all orders, and the dataset sizes. Systematic analysis of codewords to all orders will necessarily require smaller N at the same number of recordings, M, compared to methods that focus on linear or, at most, pairwise codes. That said, we also point out that in many applications, specifically in recording from motor units, N of just a few is the state of the art – recording even from a handful of units at the same time is already hard. This and similar contexts is what our methods are devised for, as not all of biological data have become as high dimensional rapidly as, say, neuropixels. We have introduced the relevant discussion in the Discussion section of the manuscript, as well as in the synthetic data sections, where we explain why we chose to focus on datasets with specific properties.
Second, a null model is chosen to be an independent model that neglects correlations in the data, which may affect the characterization of the redundancy of the uBIA model. Even if an irreducible ensemble of codewords is obtained, what is the predictive power of this ensemble with respect to control the macroscopic behavior (e.g., skilled motor control or sensorimotor learning), rather than showing statistics of neuralbehavior activity, remains obscure in the current manuscript.
These are absolutely valid concerns. However, as we pointed out above, the same questions can be asked of essentially all neural motor control papers. We are eager to publish this manuscript precisely for the reason that it’s time to leave the development of statistical methods behind us, and to focus on the biology that can be discovered by applying them to learn new biology. As we explained above, we introduced a few sentences to discuss the interpretability questions in the sections on experimental data.
Considering the strength and weakness of the manuscript, I recommend the following issues for the authors to improve their idea.
1. The proposed model is limited to symmetric couplings and i.i.d data samples. First, in a neural circuit, the asymmetric coupling is common among neurons; Second, the temporal order of each spiking activity may play a key role in encoding information. For example, a paper "Blindfold learning of an accurate neural metric" (PNAS, 2018) takes the temporal information into account. When putting in the context of neuralbehavior relationship, these two factors would become important. But they seem neglected in the current manuscript.
The Reviewer is absolutely correct that temporal correlations matter, that (symmetric) statistical correlations are not the same as (asymmetric) neural interactions, and so on. However, any statistical analysis method must have a well delineated domain of applicability. There are certainly a lot of scenarios where symmetric correlations without time ordering are sufficient to answer questions. Our approach focuses on those. We added these caveats to the Discussion, when we summarize pros and cons of the method. However, we chose not to reference the above mentioned PNAS article, as it is only tangentially related to our work in our opinion.
2. To achieve a higherorderinteractionincluded and nonredundant model is really important problem in the field. For example, the reliable interaction model proposed in ref. 26 is not fully compared in the current manuscript, i.e., what is the new advance?
Please see section “Direct application of MaxEnt methods to synthetic and experimental data”, which deals with comparisons to the Ganmor et al. work.
This is better to clearly demonstrated in a plot in the main text.
As we have discussed above in the response to the Reviewer 2 minor comment about the structure of the paper, we strongly believe that this comparison should be kept out of the main paper text. The main reason is that, as we have argued consistently, the Ganmor approach and uBIA are not competing, but are complementary: uBIA identifies the dictionary, and the Ganmor approach can build a model on that dictionary. Since, unaided by the constraints of uBIA, the latter makes assumptions incompatible with the statistics of the data we have, it cannot work well on our data, and the above referenced section in the Methods demonstrates this beyond doubt. However, taking this as an evidence of success of uBIA over other methods would be disingenuous. As an analogy, if a Jeep gets through the mud where a Corvette gets stuck, it’s only an indication that these cars are designed for different tasks, and people would not consider a comparison of the two newsworthy. We similarly think that the comparison of uBIA and the Ganmor’s approach is not newsworthy enough to be in the main text.
Second, a recent paper (Phys Rev E, 104, 024407, 2021) used an informationbased criterion to identify statistically significant couplings, which may be applicable to the context the authors consider here. In this eLife submission, the method relies on the magnetization inclusion threshold, which may be expensive for real data analysis.
This is an interesting paper (we will refer to it as the PRE method hereafter). Thank you for alerting us to it. There are substantial differences between this method and ours. Most importantly, the PRE method aims to build a generative model of the data, not just to identify which features of must be accounted for in the model. Thus, other things being equal, on very general grounds we know that it would require more data. However, other things are not equal. The PRE method is a pairwise methods (though the decimation discussed there may introduce some effective higher order couplings, but only of a very specific structure), and thus it cannot account for higher order structures, which uBIA notices. Finally, the PRE method detects important interactions by starting with a fully connected model, and decimating the weakest interactions turn by turn. Such approach will not work when the number of interactions, which scales as N^{2} increases; in contrast, we expect uBIA to be limited by the number of patterns actually seen in the data, and hence depend largely on M and not N in its complexity. We decided not to repeat this long discussion in the paper text, but we now mention this PRE article in the “Overview of prior related methods in the literature” section, and discuss the reduction in the number of interaction terms that it provides.
3. On the equation 1, the authors seem to use the local model parameter to detect the global significance of the collective codeword, which I could not understand very well, because a codeword is determined by all order of interactions considered in the loglinear model.
Equation 1 does not define the significance. It is just the allorders model of the underlying probability distribution, and the significance is not defined till we introduce s_{µ} and calculate their postrior expectations much later in the paper, up to Equation 16.
4. Technically, the logic going from Equation 7 to Equation 8 is broken, since the sdependence in Equation 8 does not naturally arise from Equation7.
Thank you! This has been fixed now by reintroducing the indicator variables between Equations 7 and 8 explicitly.
5. I could not understand well how the sign of reflects whether the word is over or underrepresented, and even, how the parameter account for the reducibility of the dictionaries. This part is better to be expanded in the manuscript, although these parameters have highly nonlinear function relationship.
Equation 10 shows that, to include a word or not is determined largely by the deviation of its frequency from the null model. By comparing the expected and the empirical frequencies (overbarred and anglebracketed quantities), we know if the word is over or underrepresented. This is described in the paragraph that starts with “Equation (10) has a straightforward interpretation.”
https://doi.org/10.7554/eLife.68192.sa2Article and author information
Author details
Funding
National Institutes of Health (R01EB022872)
 Damián G Hernández
 Samuel J Sober
 Ilya Nemenman
National Institutes of Health (R01NS084844)
 Samuel J Sober
 Ilya Nemenman
National Institutes of Health (R01NS099375)
 Damián G Hernández
 Samuel J Sober
 Ilya Nemenman
National Science Foundation (BCS1822677 (CRCNS Program))
 Samuel J Sober
 Ilya Nemenman
Simons Foundation (The SimonsEmory International Consortium on Motor Control)
 Samuel J Sober
 Ilya Nemenman
Simons Foundation (Simons Investigator in MPS)
 Ilya Nemenman
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank David Hoffman and Pankaj Mehta for valuable discussions. This work was supported in part by NIH Grants R01EB022872, R01NS084844, and R01NS099375, NSF grant BCS1822677 (CRCNS Program), and a grant from the Simons Foundation as part of the SimonsEmory International Consortium on Motor Control. IN acknowledges hospitality of the Kavli Institute for Theoretical Physics, supported in part by NSF Grant PHY1748958, NIH Grant R25GM067110, and the Gordon and Betty Moore Foundation Grant 2919.01. IN and SJS further acknowledge hospitality of the Aspen Center for Physics, which is supported by NSF grant PHY1607611.
Senior Editor
 Aleksandra M Walczak, CNRS LPENS, France
Reviewing Editor
 Tatyana O Sharpee, Salk Institute for Biological Studies, United States
Version history
 Preprint posted: November 20, 2019 (view preprint)
 Received: March 9, 2021
 Accepted: March 19, 2022
 Accepted Manuscript published: March 22, 2022 (version 1)
 Version of Record published: April 7, 2022 (version 2)
Copyright
© 2022, Hernández et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,040
 Page views

 213
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
Analysis of neuronal activity in the hippocampus of behaving animals has revealed cells acting as ‘Time Cells’, which exhibit selective spiking patterns at specific time intervals since a triggering event, and ‘Distance Cells’, which encode the traversal of specific distances. Other neurons exhibit a combination of these features, alongside place selectivity. This study aims to investigate how the task performed by animals during recording sessions influences the formation of these representations. We analyzed data from a treadmill running study conducted by Kraus et al., 2013, in which rats were trained to run at different velocities. The rats were recorded in two trial contexts: a ‘fixed time’ condition, where the animal ran on the treadmill for a predetermined duration before proceeding, and a ‘fixed distance’ condition, where the animal ran a specific distance on the treadmill. Our findings indicate that the type of experimental condition significantly influenced the encoding of hippocampal cells. Specifically, distanceencoding cells dominated in fixeddistance experiments, whereas timeencoding cells dominated in fixedtime experiments. These results underscore the flexible coding capabilities of the hippocampus, which are shaped by overrepresentation of salient variables associated with reward conditions.

 Neuroscience
Because of their close relationship with humans, nonhuman apes (chimpanzees, bonobos, gorillas, orangutans, and gibbons, including siamangs) are of great scientific interest. The goal of understanding their complex behavior would be greatly advanced by the ability to perform videobased pose tracking. Tracking, however, requires highquality annotated datasets of ape photographs. Here we present OpenApePose, a new public dataset of 71,868 photographs, annotated with 16 body landmarks of six ape species in naturalistic contexts. We show that a standard deep net (HRNetW48) trained on ape photos can reliably track outofsample ape photos better than networks trained on monkeys (specifically, the OpenMonkeyPose dataset) and on humans (COCO) can. This trained network can track apes almost as well as the other networks can track their respective taxa, and models trained without one of the six ape species can track the heldout species better than the monkey and human models can. Ultimately, the results of our analyses highlight the importance of large, specialized databases for animal tracking systems and confirm the utility of our new ape database.