Model discovery to link neural activity to behavioral tasks
Abstract
Brains are not engineered solutions to a welldefined problem but arose through selective pressure acting on random variation. It is therefore unclear how well a model chosen by an experimenter can relate neural activity to experimental conditions. Here, we developed ‘model identification of neural encoding (MINE).’ MINE is an accessible framework using convolutional neural networks (CNNs) to discover and characterize a model that relates aspects of tasks to neural activity. Although flexible, CNNs are difficult to interpret. We use Taylor decomposition approaches to understand the discovered model and how it maps task features to activity. We apply MINE to a published cortical dataset as well as experiments designed to probe thermoregulatory circuits in zebrafish. Here, MINE allowed us to characterize neurons according to their receptive field and computational complexity, features that anatomically segregate in the brain. We also identified a new class of neurons that integrate thermosensory and behavioral information that eluded us previously when using traditional clustering and regressionbased approaches.
Editor's evaluation
This useful article describes a sensitive method for identifying the contributions of different behavioral and stimulus parameters to neural activity. The method has been convincingly validated using simulated data and applied to example stateoftheart datasets from mouse and zebrafish. The method could be productively applied to a wide range of experiments in behavioral and systems neuroscience.
https://doi.org/10.7554/eLife.83289.sa0Introduction
Contemporary neuroscience generates large datasets of neural activity in behaving animals (Engert, 2014; Musall et al., 2019b; Urai et al., 2022). To gain insight from these largescale recordings, it is desirable to identify neurons with activity related to the behavioral task at hand. A common approach to this problem is to intuit a functional form (‘a model’) that relates predictors such as sensory stimuli, motor actions, and internal states to neural activity. Neurons can subsequently be classified into those with activity explained by features of the task and the chosen model and those with activity likely to be unrelated or background noise. A simple yet powerful approach is to use linear regression to explain neural activity as the weighted sum of sensory and motor features recorded during the experiment (Miri et al., 2011; Harvey et al., 2012; Portugues et al., 2014; Musall et al., 2019a). Since linear regression can accommodate nonlinear transformations of input variables, this technique can encompass diverse relationships between predictors and neural activity. Similarly, more flexible solutions using basis functions (Poggio, 1990; Hastie et al., 2009) can be used to identify taskrelated neurons. However, brains are not engineered solutions to a welldefined problem but arose through selective pressure acting on random variation (Eliasmith and Anderson, 2002; Niven and Laughlin, 2008; Zabihi et al., 2021). It is therefore unclear how well neural activity can be captured by a welldefined function chosen by the experimenter as the test model.
Artificial neural networks (ANNs) can in principle accommodate any mapping of predictors to neural activity (Hornik et al., 1989; Gorban and Wunsch, 1998). At the same time, they can be designed to generalize well to untrained inputs (Anders and John, 1991; Srivastava et al., 2014) often overcoming problems related to explosion of variance and overfitting to training data associated with other solutions incorporating large numbers of parameters (Hastie et al., 2009; James et al., 2013). Due to this flexibility, insights into nonlinear receptive fields of visual and auditory neurons (Lehky et al., 1992; Lau et al., 2002; Prenger et al., 2004; Ukita et al., 2019; Keshishian et al., 2020) and into the encoding of limb motion in somatosensory cortex have been gained using ANNs (Lucas et al., 2019). However, an obvious drawback of ANNs is that they are much harder to interpret than models based on intuition and data exploration.
Here, we introduce ‘model identification of neural encoding’ (MINE). MINE combines convolutional neural networks (CNNs) to learn mappings from predictors (stimuli, behavioral actions, internal states) to neural activity (Figure 1) with a deep characterization of this relationship. This allows discovering a model or functional form from the data that relates predictors to activity and to subsequently describe this model, thereby inverting the usual approach. Using Taylor expansion approaches, MINE reveals the computational complexity such as the nonlinearity of the relationship (Figure 2), characterizes receptive fields as indicators of processing (Figure 3), and reveals on which specific predictors or their interactions the neural activity depends (Figure 4). By incorporating a convolutional layer, temporal or spatial transformations of inputs introduced by the technique (such as calcium indicator effects) or by the brain (such as differentiation of a signal, edge detection) will be learned seamlessly by MINE. These transformations therefore do not have to be captured through a priori transformations of the task variables. While the architecture and hyperparameters of the CNN used by MINE impose limits on which relationships can be modeled, we consider the convolutional network largely ‘modelfree’ because it does not make any explicit assumptions about the underlying probability distributions or functional forms of the data.
Here, we demonstrate the utility of MINE using a groundtruth dataset (Figures 1–4) and a cortical mouse widefield imaging dataset (Figure 5). We then designed a set of experiments to exhaustively probe thermoregulatory circuits in larval zebrafish (Figures 6 and 7). Specifically, we exploit the flexibility of MINE to provide randomly varied temperature stimuli across zebrafish and imaging planes while maintaining the ability to identify functional groups of neurons based on features of the trained CNNs. Using MINE, we discover a new functional class of neurons integrating thermosensory with behavioral information. Combining MINE with anatomical analysis, we also map functional features derived with MINE to a standard zebrafish brain.
Results
A model discovery approach to identify taskrelevant neurons
MINE uses CNNs to overcome the challenges with predefined models while maintaining interpretability. Feedforward ANNs are capable of approximating any function (Cybenko, 1989; Hornik et al., 1989; Gorban and Wunsch, 1998) and therefore afford great flexibility in capturing the relationship between ‘predictors’ (sensory stimuli, behavioral actions, internal states, etc.) and the activity of individual neurons. We designed a simple threelayered network architecture (Figure 1A). It consists of a linear convolutional layer (to accommodate temporal transformations such as calcium indicator effects) and two dense layers that capture nonlinear transformations. The architecture’s simplicity speeds up training and eases interpretation while capturing transformations across time (i.e. the convolutional layers) and nonlinear effects (i.e. dense layers with nonlinear activations functions). We chose a continuously differentiable activation function for our dense network layers. Unlike the popular ReLu nonlinearity, this allows us to calculate higherorder derivatives that capture interaction effects and that allow us to quantify the computational complexity of transformations. Most network hyperparameters including the specific activation function (‘swish,’ Ramachandran et al., 2017) were determined by minimizing test error on a small dataset (see ‘Methods’). We chose a fixed length of 10 s for our convolutional filters, which can be adjusted as needed, to capture temporal effects in all subsequent experiments.
To test MINE’s ability to capture neural transformations and encoding of predictors, we generated a groundtruth dataset. This dataset consists of four randomly generated predictors (Figure 1B). Two of these, S1 and S2, vary continuously in time mimicking sensory stimuli. The other two, M1 and M2, are discrete like motor or decision variables. From these predictors, we generated ‘neural responses’ that depend on single predictors or interactions with and without intervening nonlinear transformations (Figure 1C and Figure 1—figure supplement 1A). We added Gaussian noise to all neural responses after convolution with a calcium kernel to approach conditions that might be observed in functional calcium imaging experiments (Figure 1D). For each neuron in our groundtruth dataset, we subsequently trained a CNN to predict the activity from the predictors (Figure 1E). To assess generalization, we split our dataset into a training set (twothirds of the data) and a validation set (last third) (Figure 1B and D). Training for 100 epochs led to excellent predictions of ‘neural activity’ while maintaining generalizability as assessed by the squared correlation (${r}^{2}$ value) to the test data (Figure 1E and F).
We sought to compare our ANNbased approach to another widespread approach to model singleneuron activity, linear regression. Using the four predictors (Figure 1B) as inputs, a simple linear regression model fails to explain activity in most cases (Figure 1F and Figure 1—figure supplement 1C). This is expected since the chosen type of linear regression model cannot learn the dynamics of the calcium indicator unlike the CNN. We therefore constructed an alternative model in which we convolved the predictors with the known ‘calcium kernel’ (Figure 1—figure supplement 1B). In this expanded linear model, we also included all firstorder interaction terms by including pairwise products between predictors (Figure 1—figure supplement 1B). This type of model, capturing interactions and accounting for an estimated indicator effect, is popular in the analysis of largescale calcium imaging datasets (Miri et al., 2011; Ahrens et al., 2012; Portugues et al., 2014; Chen et al., 2018; Stringer et al., 2019). As expected, this model matches the performance of the CNN in more response categories including nonlinear interactions (Figure 1F). However, the function of this model was designed using a posteriori knowledge about the responses. Nonetheless, the expanded linear model is poor in capturing some nonlinear transformations of predictors and fails to capture responses that relate to the time derivative of an input, for example, as expected in adapting neurons (Figure 1F). While other models could clearly be designed to overcome these challenges, this further illustrates the point that a modelbased approach is limited to the constraints of the chosen model.
As shown, MINE can identify responses that depend on predictors independent of the linearity of these relationships. The underlying CNN is able to learn temporal transformations of inputs such as shifts in time or convolutions. Otherwise these transformations have to be explicitly provided to a regression model or the predictor matrix has to be augmented for the model to implicitly learn them (see the comparison model used in the analysis of zebrafish data below) (Miri et al., 2011; Musall et al., 2019a). MINE removes the requirement of estimating calcium response kernels that might differ across different neurons and can also identify responses that depend on derivatives of the input. This means that one predictor such as position can be used to identify neurons that depend on the velocity or acceleration of a stimulus, without the need of augmenting the predictor matrix.
MINE characterizes computational complexity
Linear computations are limited in their expressivity, and it is generally believed that the computational power of the nervous system depends on nonlinear computation (Hubel and Wiesel, 1968; Churchland et al., 1994; Carandini et al., 2005). In spite of this, the responses of many neurons tend to depend almost linearly on sensory or behavioral features (Miri et al., 2011; Thompson et al., 2016; Pho et al., 2018; Musall et al., 2019a). The latter idea aligns with the hypothesis that information important to the animal should be linearly decodable by neural circuits (Marder and Abbott, 1995; Eliasmith and Anderson, 2002; Shamir and Sompolinsky, 2006). Disambiguating linear from nonlinear processing therefore provides important insight into circuit structure and function. Once fit to neural data, our CNNs model the transformations from predictors to neural activity (or from neural activity to action). We therefore set out to quantify the complexity of the function these networks implement as a proxy to classifying the actual transformations between stimuli, neural activity, and behavior.
To arrive at a metric of ‘computational complexity,’ we used Taylor expansion to approximate the function implemented by the CNN. The Taylor expansion approximates a function, such as the CNN, around a specific input by a polynomial of up to infinite order. The weights of individual terms in this polynomial are determined by the values of the derivatives of the CNN output with respect to each input element (in our case predictors and timepoints). These derivatives describe the expected change in the output of the CNN given a small change in the predictor input. Calculating derivatives of increasing order at the point of expansion allows predicting the value of the CNN output at other points (akin to predicting the position of a car in the near future based on its current location, velocity, and acceleration). We specifically compute the first and secondorder partial derivatives of the CNN output with respect to each feature (predictor and timepoint) in the input at the average of the training data. This allows formulating the Taylor expansion of the network around the data mean. It is then possible to compare the quality of Taylor expansions with variable numbers of terms in predicting the true network output. If the network were to implement a linear function, the firstorder term (the gradient $J$ of the function with respect to the input) should suffice to explain a large fraction of the variance in the output (Figure 2A). Nonlinear functions should depend on higherorder terms such as the secondorder partial derivatives, $H$. We chose to define ‘computational complexity’ based on the requirement of the second or higherorder terms in the Taylor expansion (Figure 2A). Specifically, we assign complexity 0 if the linear term in the expansion is enough to explain the activity, 1 if the quadratic term is needed, and 2 if higherorder terms are required. Nonlinear neurons are therefore split into two categories, depending on the complexity of the transformation.
We tested these metrics on groundtruth data in the following manner: we mixed predictors to varying degrees with either linear (as a control) or nonlinear transformations of the same predictors. By increasing the contribution of the transformed predictor, we thereby inject varying amounts of nonlinearity into the response. We then applied MINE, training a CNN to learn the transformation between the predictor and the mixed output and calculating the coefficients of determination (${R}^{2}$) for truncations of the Taylor series after the linear as well as the secondorder term. Increasing the contribution of an arbitrary nonlinear function (i.e. ${\mathrm{tanh}}^{2}\left({x}^{3}\right)$) leads to a marked decrease in both of these ${R}^{2}$ values (Figure 2B). Importantly, these metrics do not simply quantify a loss of linear correlation. Calculating the derivative of a function is a linear operation, and increasing the contribution of a derivative of the predictor indeed does not decrease either ${R}^{2}$ value in spite of reducing linear correlation between the predictor and the response (Figure 2C and Figure 2—figure supplement 1A–C). When increasing the contribution of a rectified linear version of the predictor, the amount of variance explained by the linear truncation drops while this is not the case for the truncation after the secondorder term (Figure 2D). This is in contrast to the contribution of ${\mathrm{tanh}}^{2}\left({x}^{3}\right)$ above and justifies our interpretation of computational complexity as a metric for how much a relationship diverges from linearity.
To quantify its usefulness to distinguish linear and nonlinear transformations, we systematically evaluated the classification performance of the fraction of variance explained by the linear truncation of the Taylor expansion. To this end, we generated random linear and nonlinear transformations of the predictors (see ‘Methods’), trained the CNN, and calculated the ${R}^{2}$ value of the linear truncation. We compared this ${R}^{2}$ value to two other metrics of nonlinearity: the curvature of the function implemented by the network and the nonlinearity coefficient Philipp and Carbonell, 2018; Philipp, 2021 of the network (see ‘Methods’). Notably, a decrease in the explained variance by the linear truncation led on average to increases in both these metrics of nonlinearity (Figure 2—figure supplement 1D and E). To quantify classification performance, we used ROC analysis. This analysis revealed that the ${R}^{2}$ value ranks a nonlinear transformation lower than a linear one in 99% of cases (Figure 2E). Importantly, this metric allows disambiguating linear and nonlinear processing with acceptable falsepositive and falsenegative rates ($<0.05$) at a cutoff of ${R}^{2}<0.8$ as a nonlinearity threshold Figure 2—figure supplement 1F; a feature that is highly desirable when applying the classification to realworld data.
In summary, MINE increases the interpretability of the CNN model and classifies the transformations the model encapsulates according to their computational complexity. Since the CNN encapsulates the transformations that are enacted by neural circuits between stimuli and neural activity or neural activity and behavior, this information provides important insight into the computations that give rise to neural activity.
MINE characterizes neural receptive fields
Receptive fields compactly describe stimulus features that drive a neuron’s response (Dayan and Abbott, 2001). They reveal integration times (those times over which coefficients of the receptive field are different from zero) and zerocrossings within the receptive field signal adapting responses that indicate that a neuron encodes the derivative of a stimulus across time (e.g. velocity encoding neurons) or that it detects edges across space. It is therefore desirable to use MINE to extract receptive fields of the neurons fit by the CNN. Because of their descriptive nature, different methods have been developed to extract receptive fields, commonly referred to as ‘system identification’ approaches. The Wiener/Volterra expansion of functions provides a powerful framework for system identification (Poggio and Reichardt, 1973; Aertsen and Johannesma, 1981; Friston et al., 1998; Mammano, 1990; Marmarelis, 2004; Mitsis et al., 2007; Mitsis, 2011). Since the input to the CNN at the heart of MINE contains information about each predictor across time, the Taylor expansion introduced in the previous section (Figure 2A) is equivalent to the Volterra expansion of a system processing information across time equal to the history length of the CNN. We were therefore wondering whether we could extract receptive fields from the gradient $J$ and Hessian $H$ of the CNN in a manner similar to how they can be derived from the first and secondorder Volterra kernels (${k}_{1}$ and ${k}_{2}$ ; see ‘Methods’) (Marmarelis, 1997; Marmarelis, 2004; Mitsis et al., 2007). To this end, we simulated a system that uses two parallel receptive fields to process an input. The filtered responses are subsequently passed through two differently structured nonlinearities to yield the output (Figure 3A and ‘Methods’). Notably, due to the structure of the nonlinearities, we expect the first receptive field (here called ‘linear receptive field’) to be equivalent to $J$/${k}_{1}$ and the second receptive field (here called ‘nonlinear receptive field’) to appear as an eigenvector of $H$ and ${k}_{2}$.
As a comparison, we used regression to directly fit the Volterra kernels (see ‘Methods’). Notably, just like MINE, apart from the truncation of the series after the second term, the Volterra analysis is highly flexible since most scalarvalued functions can be approximated using an infinite Volterra series (Volterra, 1959). System identification often employs specifically designed stimuli such as Gaussian white noise (Marmarelis and Marmarelis, 1978; Korenberg and Hunter, 1990; Rieke et al., 1999; Schwartz et al., 2006; Gollisch and Meister, 2008), which are difficult to realize in practice and severely restrict experimental conditions. Nonetheless, we first benchmarked both approaches using Gaussian white noise as input (Figure 3—figure supplement 1A). As expected, both MINE and directly fitting the Volterra kernels yielded receptive fields that were nearly indistinguishable from the ground truth (Figure 3—figure supplement 1B). This demonstrates that our simulation indeed results in receptive fields that can be discovered using MINE or a secondorder Volterra model. However, for the analysis to be useful it is critical that receptive fields can be extracted on arbitrary and slowly varying stimuli as expected, for example, during naturalistic behavior. We therefore repeated the analysis using slowly varying stimuli (Figure 3—figure supplement 1A). Under these more naturalistic conditions, MINE still yielded wellfitting receptive fields (Figure 3B and Figure 3—figure supplement 1C and D). An ordinary regression fit of the Volterra kernels, on the other hand, failed to recover the receptive fields (Figure 3B), which is in line with the observation that ANNs form an efficient route to Volterra analysis (Wray and Green, 1994). We assumed that this failure was due to a lack of constraints, which meant that the ordinary regression model could not handle the departure from Gaussian white noise stimuli. We therefore refit the Volterra kernels using Ridge regression (Hastie et al., 2009) with increasing penalties. Indeed, for high penalties, the direct fit of the Volterra kernels yielded receptive fields of almost comparable quality to MINE (Figure 3B and Figure 3—figure supplement 1E–H). The CNN model fit by MINE also had slightly higher predictive power compared to the bestfit Volterra model as indicated by higher correlations of predicted activity on a test dataset (Figure 3C). Repeating the same analysis with a second set of filters yielded similar results: MINE proved to be a slightly superior approach to extract the receptive fields (Figure 3D–F). This is not to say that no other way could be found to extract the receptive fields, but it underlines the fact that MINE is a practical solution to this problem, which yields highquality receptive fields in addition to other information about the relationship between predictors and neural responses.
We used the same simulation to assess how the predictive power of the different models (MINE and Ridge regression), as well as the quality of extracted receptive fields, depends on the amount of training data. As expected, predictive power on a validation set increases with the number of training samples for all methods. The Ridge model with the lower penalty outperforms MINE for lower numbers of training samples (Figure 3—figure supplement 1I). However, this model does not yield usable receptive fields and in fact both Ridge models show a deterioration of extracted receptive fields for large numbers of training samples (Figure 3—figure supplement 1J). This is likely a result of the competition between the regularization penalty and the overall error of the fit. MINE appears more robust to this effect, but for very long sample lengths it may be required to adjust the CNNs regularization as well (Figure 3—figure supplement 1J).
Since Ridge regression constrains linear regression models, we were wondering how the effective degrees of freedom of the CNN and the different models would compare. Interestingly, in spite of having nearly 14,000 parameters, the effective degrees of freedom of the CNN model are <50 and the Ridge regression models that are successful in identifying the receptive fields approach similar effective degrees of freedom (Figure 3—figure supplement 1K). This is in line with successful approaches to system identification employing constraints such as Laguerre basis functions (Friston et al., 1998; Marmarelis, 2004; Mitsis et al., 2007). In the case of the CNN used by MINE, the effective degrees of freedom are limited by the L1 penalty on the weights (sparsity constraint), the Dropout, and the limited number of training epochs (Figure 3—figure supplement 1L).
Overall these results suggest that MINE can recover the types of receptive fields of neurons that can be obtained with system identification approaches under a broader set of biologically relevant stimulus conditions.
Taylor analysis identifies predictors driving the response
Compared to regression models, the CNNs seemingly have a drawback: a lack of interpretability. Statistical methods can identify the factors that significantly contribute to a regression model’s output. Similarly, the magnitude of individual weights in models fit to data can give an idea of the importance of specific predictors. Corresponding overt parameters do not exist in the CNN model. In principle, the extracted receptive fields could be used to identify contributing predictors since it would be expected that they are ‘unstructured’ for unimportant inputs. However, what to consider as ‘unstructured’ is not clearly defined. Theoretically, it would be possible to refit successful CNN models in a stepwise manner, leaving out or adding in specific predictors (Benjamin et al., 2018). However, since we are interested in uncovering the relationships between predictors, this could only succeed if all individual combinations between inputs are tested. This would be prohibitive for large sets of sensory and behavioral predictors as it would require repeatedly retraining all networks.
We therefore again utilized Taylor expansion. To account for the relationships that cannot be sufficiently explained by the secondorder Taylor expansion shown in Figure 2A, we perform local expansions at various points instead of one expansion around the data average (Figure 4A). This allows for variation in the network gradient $J$ and Hessian $H$ in cases of high computational complexity. Since the Taylor decomposition is a sum of terms that depend on individual inputs or their combinations, we can use it to determine how important individual predictors, such as sensory stimuli, are in shaping the neural response (Figure 4A and Figure 4—figure supplement 1A). Across our groundtruth dataset, we find a high correlation between the change in output of the Taylor expansion and the true change in network output (Figure 4B, left panel), indicating that the local expansions using first and secondorder derivatives sufficiently capture the relationship between predictors and neural activity. Importantly, decomposition is a way to efficiently test the importance of different predictors in contributing to network output and hence neural activity. We calculate a ‘Taylor metric’ score, which measures the fraction of explained variance of the response that is lost when terms are removed from the full Taylor expansion (Figure 4B).
On our groundtruth dataset, the Taylor metric correctly identifies the contributing predictors and their interactions. Sorting individual terms by this metric consistently ranks those that we expect to contribute (Figure 4C–E, red bars) higher than those that should not (Figure 4C–E, blue bars). This is true both for individual predictors and interaction terms in the case where the response depends on the product of inputs (Figure 4C–E and Figure 4—figure supplement 1BH).
In summary, MINE was able to correctly identify contributions of predictors, such as sensory stimuli or behavioral actions, to neural responses by local expansions of the trained CNNs. MINE also correctly identifies nonlinear interactions in generating the neural responses on our groundtruth dataset. This indicates that we can further reduce the lack of interpretability of the CNN models and approach the expressivity of linear regression models while maintaining the ability to model nonlinear transformations and interactions of task variables.
MINE characterizes cortical sensorimotor processing
Encouraged by MINE’s ability to identify responses related to behavior and stimuli and its ability to characterize the nature of that relationship, we wanted to test the method on biological data. We applied MINE to a publicly available widefield calcium imaging dataset recorded in the mouse cortex (Musall et al., 2019a). The dataset consists of 22 taskrelated predictors (stimuli, reward states, instructed, and noninstructed movements) and calcium activity time series across 200 temporal components that were used to compress each session’s widefield imaging data Musall et al., 2019a from 13 mice. It had previously been analyzed using linear regression (Musall et al., 2019a). We analyzed one session from each of the 13 mice present in the dataset with MINE (Figure 5A). As in the groundtruth data, we split the data into a training set (first twothirds of the session time) and a testset (last third). To be able to capture not only the activity caused by past stimuli but also the preparatory activity leading up to movements, we shifted the predictor traces with respect to the activity trace (see ‘Methods’). As expected, correlations between MINE predictions and true activity on the test dataset are overall smaller than on the training set; however, in the majority of cases the CNN generalizes well to the new data (Figure 5B). Given that many individual neurons contribute to each pixel and temporal component within this dataset, and that a lot of brain activity is likely unrelated to the behavioral task, we expect that a large fraction of variance in each component is likely unexplainable by any model that does not take internal brain states into account. We therefore chose a lenient correlation cutoff of $r=0.1$ for the test data (red line in Figure 5B) to decide that a component had been successfully fit by MINE (Figure 5B and C). On average, this led to the identification of >91% of all components per session (Figure 5C) on which Taylor metric and complexity were computed. Notably, MINE assigns low complexity to the majority of components; only 3% of fit components have a linear approximation score <0.8. This means that the relationships between predictors and neural activity are largely linear (Figure 5D). While this may seem surprising given the high nonlinearity of cortical processing, it is likely caused by the low resolution of the data. Each component blends the activity of hundreds of thousands of neurons. Averaging across many individual neurons that each may have their own nonlinear responses likely obfuscates nonlinear effects.
Across all sessions, we found a broad representation of predictors (Figure 5E). As expected from the findings of Musall et al., 2019a, uninstructed movements appear overrepresented sporting the largest Taylor metric in more than half of the fit components. We next mapped the results of MINE back into acquisition space and recalculated the Taylor metrics for each imaging pixel for one example session. The obtained results further validate MINE’s utility on biological data. Specifically, visual stimuli are shown to contribute most strongly to responses in visual cortical areas with the expected left–right asymmetry (Figure 5F). At the same time, comparing visual sensory responses to whisk responses shows that cortical regions enhanced for whisking (Figure 5G) correspond very well to regions marked strongly with a whisk event kernel in Figure 2 of Musall et al., 2019a. Furthermore, instructed left and right grab events largely contribute to motor cortical regions, again with the expected left–right asymmetry (Figure 5H). Repeating this procedure on all sessions (Figure 5—figure supplement 1) reveals general agreement. But we note that not all sessions seem to have regions that are as clearly related to the chosen inputs according to Taylor analysis, which is especially apparent for left and right instructed grabs (Figure 5—figure supplement 1C).
We next sought to determine the receptive fields that govern stimulus processing in individual pixels (Lehky et al., 1992; Dayan and Abbott, 2001). We extracted the receptive fields across pixels that are strongly related to either left/right visual stimuli or left/right instructed grabs. We broadly clustered the receptive fields into two groups (see ‘Methods’) to separate excitatory and inhibitory effects. Receptive fields for left and right visual stimuli in contralateral brain regions are highly similar to each other (Figure 5I, left). The excitatory effect is stronger than the inhibitory effect (larger coefficients in the positive receptive fields) and both are clearly biphasic. This indicates that the events around the time of the stimulus as well as $\sim 1.5\mathrm{s}$ in the past strongly influence the activity of the visual neurons. We note that this biphasic structure mimics the linear regression kernel in Figure 3b of Musall et al., 2019a. The visual receptive fields have essentially zero weight after the current time, which is expected since future stimuli are unlikely to influence current neural activity. The receptive fields of left and right grab neurons, on the other hand, are much sharper, indicating that these neurons influence movement over short timescales (Figure 5I, right). Furthermore, the grabrelated receptive fields contain coefficients different from baseline for up to 100 ms into the future. This suggests preparatory activity, that is, that future movements are reflected in current neural activity.
In summary, the results presented above demonstrate the applicability of our method to biological data and the potential for identifying diverse feature sets of sensorimotor processing.
Functional characterization of thermoregulatory circuits
Encouraged by MINE’s performance on groundtruth and mouse cortical data, we sought to use it to gain novel insight into zebrafish thermoregulatory circuits. Temporal transformations of stimuli are a notable feature of how zebrafish process temperature information. In previous research, we identified neurons and centers in the larval zebrafish brain that process temperature: specifically, neurons that compute the rate of change of the temperature stimulus (Haesemeyer et al., 2018) as well as neurons whose activity is consistent with integration of temperature fluctuations (Haesemeyer et al., 2019). Due to the nature of these transformations, which were unknown a priori, simple regressionbased approaches failed to identify neurons involved in temperature processing. We therefore previously resorted to clustering to identify these neurons. However, because behavior is stochastic, one particular drawback of clustering was that it precluded identifying neurons that integrate thermosensory stimuli and behavioral actions. Since MINE can learn and capture both temporal transformations and interactions, we set out to gain deeper insight into thermoregulatory processing. To this end, we used MINE to identify and characterize neurons during a thermosensory task, revealing predictors contributing to their activity, extracting their receptive fields and characterizing their computational complexity.
To classify neurons by clustering either requires presenting the same stimulus to every animal and neuron while imaging or it requires a strategy of clustering eventtriggered activity. With MINE we do not need to enforce any stimulus constraint. As a result, we were able to probe a wider variety of stimulus conditions. We imaged a total of 750 planes across 25 larval zebrafish that expressed the nuclear calcium indicator H2B:GCaMP6s Freeman et al., 2014 in all neurons and the excitatory neuron marker vglut2amCherry Satou et al., 2013 in presumed glutamatergic neurons. This dataset provided slightly more than fourfold coverage of the larval zebrafish brain. On each imaging plane, we presented a heat stimulus generated by randomly composing sine waves with frequencies between 0.005 Hz and 0.075 Hz (Figure 6A and Figure 6—figure supplement 1A). We concurrently recorded elicited tail motion at 250 Hz. We segmented the imaging data using CaImAn (Giovannucci et al., 2019), which identified $\sim 433,000$ active neurons across our dataset. From the tail motion data, we extracted (1) swim starts and (2) tail features that correlate with swim displacement (‘vigor’) and turn angle (‘direction’) in free swimming experiments (Figure 6A and Figure 6—figure supplement 1C).
We used the information about stimulus and elicited behaviors across time as inputs to MINE fitting CNNs to each neural calcium response (Figure 6B and C). The initial twothirds of time served as training data and the last third as a test/validation set. Notably, due to the random nature of both our stimulus and the elicited behavior, test and training data were largely uncorrelated (Figure 6—figure supplement 1B). Predictive power over the test data therefore indicates that the neural network model generalizes and truly captures how stimulus and behavioral data are related to neural activity. We chose 50% of explained variance on the test data, corresponding to a correlation of $r=\sqrt{0.5}$, as a stringent cutoff to determine whether a neuron can be modeled by MINE. A considerably higher threshold was used on this dataset compared with the cortical data since we recorded singlecell activity. It is therefore less likely that activity of a true responder is mixed with unrelated background activity. Using cyclic permutations of the data as a control revealed that this cutoff corresponds to a 93fold enrichment over control data (Figure 6—figure supplement 1D).
As a comparison, we fit a linear regression model to this dataset. The inputs included timeshifted versions of stimulus and behavioral variables to allow the model to learn temporal transformations Musall et al., 2019a and thereby put it on the same footing as the CNN model (Figure 6C). We used Ridge regression Hastie et al., 2009 to improve generalization of the linear model. At the designated cutoff, a total of $\sim 42,000$ neurons were identified using either MINE or regression. 40% of these neurons, however, were exclusively identified by MINE (Figure 6D), indicating the superiority of the modeldiscovery approach. In fact, MINE consistently identifies more neurons regardless of the cutoff imposed on test data (Figure 6—figure supplement 1E).
We used MINE to assign the identified neurons to functional types according to (1) whether their encoding of stimulus or behavioral features is linear or nonlinear (computational complexity) and (2) which stimulus and behavioral features drove their activity (Taylor analysis). As before, we determined nonlinearity when the truncation of the Taylor expansion after the linear term did not explain at least 80% of the variance of the CNN output. For the Taylor analysis, we chose a stringent cutoff, requiring the Taylor metric to be significantly >0.1 with a pvalue $<0.05$ on the whole dataset (Bonferroni correction; effective pvalue $\sim 1.26\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}{10}^{6}$), to determine whether a predictor influenced the activity of a neuron. Since interaction terms had very small Taylor metrics, we ignored them for neuron classification purposes. In total, we identified 33 functional neuron classes all of which were identified across multiple fish (Figure 6E). As a control, we labeled reticulospinal neurons in six fish via spinal backfills and as expected the vast majority of these neurons were classified by MINE to be driven by behavioral features (Figure 6F). Across the brain, we identified multiple functional classes of neurons with mixed selectivity, that is, neurons with activity jointly driven by the temperature stimulus and behavioral outputs (Figure 6E, blue filled bars). These mark a previously unidentified class of neurons in the thermoregulatory circuit that might play a key role in behavioral thermoregulation: they could allow the animal to relate behavioral output to temperature changes and thereby characterize the thermal landscape. Analyzing the success of the linear comparison model according to functional class revealed a bias toward the identification of behaviorrelated activity and as expected low computational complexity. Stimulus or mixedselectivity neurons, on the other hand, were underrepresented in the pool identified by the linear model (Figure 6G). This points to potential problems of bias when identifying neurons by means of such linear models.
We registered all our imaging data to a standard zebrafish reference brain (ZBrain; Randlett et al., 2015). This allowed us to assess the distribution of identified functional neuron types throughout the brain using two complementary methods. We performed anatomical clustering (see ‘Methods’) to visualize regions with high densities of each functional type (Figure 6H) and determined in which annotated anatomical regions a functional neuron type is enriched (Figure 6—figure supplement 1G–I). While Stimulus, Behavior and Mixedselectivity neurons are generally broadly distributed, they are enriched in specific brain regions (Figure 6H and Figure 6—figure supplement 1GI). We found stimulusdriven neurons to be enriched in telencephalic regions, as well as the habenula and its output region the interpeduncular nucleus (Figure 6—figure supplement 1G). As expected, behaviordriven neurons are enriched in hindbrain regions and the midbrain nucleus of the medial longitudinal fasciculus (Figure 6—figure supplement 1H; Severi et al., 2014; Thiele et al., 2014). Mixedselectivity neurons occupy some regions shared with stimulusdriven neurons such as telencephalic area M4 and the interpeduncular nucleus but are also strongly enriched in the raphe superior as well as the caudal hypothalamus and the torus longitudinalis (Figure 6—figure supplement 1I). Our current dataset did not cover sensory ganglia well (especially not the trigeminal ganglion) with the exception of the olfactory epithelium where we found temperaturesensitive neurons in a medial zone (Figure 6—figure supplement 1I). This is in line with reports in mice and frogs that describe specific thermosensitive neurons in the olfactory epithelium of these species (Schmid et al., 2010; Kludt et al., 2015; Fleischer, 2021). Overall these results demonstrate that functional cell types identified by MINE segregate spatially within the brain, likely forming organizational units within the thermoregulatory circuit.
Computational features of thermoregulatory circuits
We next sought to use MINE to analyze computational features of thermoregulatory circuits. We subdivided neural classes according to the behavioral features they control, the sensory features they extract (their receptive fields), and their computational complexity. In the case of mixedselectivity neurons, we used the predictive power of MINE to gain insight into how they integrate thermosensory and behavioral information. Mapping computational features to the zebrafish brain subsequently revealed the anatomical organization of computational features. Analyzing brain regions for enrichment of encoding different behavioral features suggests a segregation in the control of swim starts, swim speed (vigor), and turn angle (directionality). Even though behaviorrelated neurons are enriched in the hindbrain overall (Figure 6H), subdividing the neurons reveals a broader distribution (Figure 7A and Figure 7—figure supplement 1A–C). Swim start neurons are enriched in the dorsal thalamus as well as the medial hindbrain, while the ventral thalamus, the forebrain, and the locus coeruleus preferentially encode swim speed (Figure 7 and Figure 7—figure supplement 1A and B). Neurons that encode turning are enriched in mid and hindbrain regions (Figure 7 and Figure 7—figure supplement 1C). Notably, these neurons could be involved in controlling these behavioral features or alternatively maintain efference copies of the motor commands for other purposes (Odstrcil et al., 2022).
To investigate the stimulus features that drive temperature encoding neurons, we performed clustering on their receptive fields that we retrieved using MINE as described above (Figure 7B and C and Figure 7—figure supplement 1D). A clustering approach based on the cosine similarity of receptive fields (see ‘Methods’) resulted in 15 clusters. For each of these clusters, neurons came from at least 15 fish across our dataset. Each cluster represents a functional neuron class that extracts specific features of the temperature stimulus (Figure 7B, left). Localizing different clusters in the brain reveals an anatomical organization of processing. Clusters 1 and 4 are enriched in the forebrain (Figure 7B, right), and their receptive fields suggest that they are most responsive to heating and cooling stimuli over slow timescales, respectively (Figure 7C). Their response to temperature increases or decreases is evidenced by the zerocrossing of their receptive fields while the slow rise/fall of the coefficients suggests computation over longer timescales. Cluster 3, on the other hand, computes the rate of heating over shorter timescales (Figure 7C) and is almost exclusively localized to the hindbrain (Figure 7B, right). Cluster 12 represents neurons that are excited by colder temperatures since the receptive field exclusively has negative coefficients (Figure 7C). This neuron type is found predominantly in hind and midbrain regions (Figure 7B, right). We note that some of the uncovered receptive fields (e.g. clusters 1 and 4) have the largest departures from 0 at the start and end of the receptive fields. This might indicate that the chosen history length (here 10 s) is too short and does not cover the entire timescale of processing, which could be a result of the rather slow nuclear calcium indicator we chose for this study.
Mixedselectivity neurons could provide zebrafish with important information about the thermal environment such as the slope of temperature gradients the fish can use to thermoregulate. We therefore wondered how these neurons combine thermosensory and behavioral information by visualizing response landscapes (akin to Heras et al., 2019). Here, we specifically focused on neurons that are driven by both swim starts and temperature inputs and that were classified as nonlinear (indicated by a red arrowhead in Figure 6E). We extracted the neurons’ receptive fields for these quantities and analyzed predicted responses of the CNN fit by MINE to combinations of swim and sensory drive. We defined ‘drive’ as scaled versions of the receptive field since it represents the ideal stimulus driving a neuron (Dayan and Abbott, 2001) (see ‘Methods’). Clustering neurons according to these response landscapes (Figure 7D and Figure 7—figure supplement 1E) revealed diverse modes of integration across 10 clusters. Some neurons appear to add behavioral inputs linearly with thermosensory inputs (Figure 7 and Figure 7—figure supplement 1E, clusters 8 and 9). Other neurons, however, show gating (Figure 7D, left, and Figure 7—figure supplement 1E, clusters 0 and 1). Here swim drive effectively scales the sensory responsiveness. Importantly, since these are temporal receptive fields, the swim drive input will scale with relative timing of swims, being the strongest when behavior occurs within the time window of the receptive field. These neurons could therefore act as coincidence detectors that inform larval zebrafish that temperature changes are linked to behavior as expected in a temperature gradient. We also found a cluster of neurons that display more complex integration of sensory and motor information (Figure 7D, right). At lowerthanaverage swim drive, these neurons symmetrically respond to both low and high sensory drive. However, at high swim drive, the response to positive sensory drive is enhanced, making the response of the neuron to temperature asymmetric in this regime.
Lastly, we sought to gain insight into computational features themselves, that is, those that occur between the sensory input and the observed activity of the neurons. Specifically, we used MINE to analyze the complexity of these computations according to truncations of the Taylor expansion after the linear and secondorder terms (Figure 2A). We scored linear neurons as having complexity 0, those for which the linear truncation explains <80% of variance (nonlinear) but where the secondorder expansion explains at least 50% of variance as having complexity 1 and the remainder requiring higherorder terms of the expansion as complexity 2. We found that complexity is mapped onto both functional (Figure 7E) and anatomical features (Figure 7F), indicating that this division is not arbitrary but rather indicates a meaningful difference in neuron function. Specifically, averaging the complexity score for different receptive field clusters reveals that three of these clusters have lower complexity than expected, indicating that most neurons in these clusters have responses that linearly depend on the temperature stimulus (Figure 7E). One cluster (cluster 6) had an average complexity score >1, indicating that the majority of neurons in this cluster compute highly nonlinear transformations of the stimulus. Anatomical clustering of neurons by computational complexity reveals that different complexity classes are generally intermingled (Figure 7G). However, there appears to be some regionalization. Within the hindbrain, neurons of complexity class 2 are found medially in rhombomeres 1 and 2. We previously found that neurons in this region carry little information about ongoing behavior, which might indicate that they process thermosensory stimuli for other functions (Haesemeyer et al., 2018). Prominent clusters of complexity 2 neurons can also be found in the habenula and dorsal telencephalon consistent with the idea that these structures are involved in higherorder computations rather than the shorttime control of thermoregulatory behaviors. Linear neurons, on the other hand, are enriched in the posterior and medial hindbrain.
In summary, these data reveal the power of MINE to identify computational features leading to a detailed classification of neurons into functional types. This automated, unbiased, and highly flexible analysis framework has the potential to greatly aid the analysis of largescale neural data. To facilitate the adoption of MINE, we expose its functionality through a simple Python interface. This interface allows fitting the corresponding CNN to neural data and extracting all metrics (Table 1).
Discussion
A common goal in analyzing neural data is to relate the activity of neurons to sensory stimuli, behavioral actions, internal states, or other features observed during ongoing behavior. To explore this connection, we neuroscientists often define a model dictating the structure of this relationship. When well thoughtout, these defined models have the great advantage that their parameters can be interpreted in a biologically meaningful manner. However, even flexible models, if defined a priori, run the risk of not being able to match transformations occurring in the less than intuitive biological brain. Modelselection based on probabilistic programs overcomes some of these challenges, allowing for a modelfree and flexible approach while maintaining predictive power (Saad et al., 2019). One drawback of this approach however is reduced interpretability of the nature of the relationship between inputs (in our case, stimuli, behaviors, etc.) and the output (in our case, neural activity). Here, we have presented MINE to overcome some of these challenges and offer an accessible framework for comprehensive analysis. Despite MINE’s increased flexibility over predefined models, it maintains interpretability: through Taylor decomposition approaches, MINE (1) characterizes the computational complexity of neural processing giving rise to the activity of an individual neuron, (2) extracts linear and nonlinear receptive fields that define the ideal inputs driving a neuron, and (c) yields information about which predictors are most important in driving the neural response.
Discovering encoding of information by neurons
Different paradigms are used to interpret neural activity. MINE specifically supports an encoding view, inspecting which information is encoded by the neurons of interest. ANNs have, however, also been used from a decoding perspective, probing what information can be gleaned about ongoing behavior from neural activity (Frey et al., 2021; Schneider et al., 2022). Here, we restricted the question of information encoding to individual neurons. Nonetheless, MINE could easily be extended to model joint population encoding. The most straightforward approach to this problem would be to decompose the population activity through methods such as principal or independent component analysis. These components would then be fit by MINE instead of singleneuron activity. Similarly, we only demonstrate the relationship of neural activity to external features (stimuli, behavioral actions) but neural population activity or local field potentials could be used as predictors. We currently do not exploit or characterize joint processing strategies. The efficiency of modelfitting could likely be improved by having the CNN model the relationships of multiple neurons (network outputs) to the same predictor inputs. Given the general applicability of the Taylor decomposition strategies, such networks could be analyzed in the same way as the current simpler CNN underlying MINE.
ANNs have been used extensively as models of brains to identify the principles underlying neural processing (McClelland et al., 1987). In the present work, CNNs exclusively serve as tools to find the relationships between features of interest and the activity of individual neurons. The network is a standin for the circuits transforming these features upstream (for sensation) or downstream (for behavior), as well as the computational properties of the analyzed neuron. We are therefore interested in the properties of the whole CNN instead of its individual units. Extracting the CNN properties aims to characterize the neuron of interest in terms of how it encodes the features. Such an approach has been used previously to characterize neural receptive fields, recognizing the power of ANNs in capturing nonlinear transformations. For example, characterizations of nonlinear visual receptive fields (Lehky et al., 1992; Lau et al., 2002; Prenger et al., 2004; Ukita et al., 2019) and nonlinear auditory receptive fields (Keshishian et al., 2020). In retinal processing, works by Baccus and Ganguli McIntosh et al., 2016; Tanaka et al., 2019 have used the increased flexibility of neural network models over classical linear–nonlinear models to characterize how retinal circuits process natural scene inputs. Internal neurons in the network matched retinal interneurons that were not used for data fitting and resulting models generalized better to novel data than linear–nonlinear models in spite of increased flexibility. Our comparison to Volterra analysis suggests that this might be because the effective degrees of freedom of CNN models are much lower than the total number of parameters but that the space in which these limited degrees of freedom act is better matched to neural data. We go beyond previous approaches in this study by combining modalities and providing an easily accessible framework for analysis of arbitrary neural data. MINE handles and disambiguates multiple different inputs originating from different domains (such as sensory stimuli and behavioral actions). MINE, furthermore, performs a more detailed characterization producing information about the linearity of the relationship of inputs and neural activity. These advances are made possible through mathematical analysis of the network.
Characterizing the model implemented by the CNN
Different techniques aim at understanding how ANNs arrive at their outputs given a set of inputs (Samek et al., 2019). These techniques are generally referred to as ‘explainable machine learning’ and are particularly important in understanding what features classifiers use to make their decisions. These include visualization techniques as well as layerwise relevance propagation (Binder et al., 2016) that aims to identify important input features driving the output of a classifier. In our case, however, we were interested to know which inputs persistently (across experimental time) are involved in driving the output of the network, that is, the activity of the modeled neuron. Notably, once the network is trained the different inputs are not separable anymore. This means that there is no guarantee that leaving out an unimportant input will not affect the output of the trained network. A workaround would be to iteratively train the network with subsets of inputs (Benjamin et al., 2018). However, this becomes infeasible even with a moderate number of inputs since all combinations would need to be tested. We therefore chose a continuously differentiable activation function for our network layers. This enabled Taylor decomposition to transform the network, pointbypoint, into a separable sum of terms depending on individual inputs. This approach allowed us to infer which specific inputs drive the network output and hence the activity recorded in the modeled neuron. Notably this technique can also identify multiplicative interactions, which indicate coincidence detection. Yet, while such interactions were evident in our groundtruth dataset, we found little evidence of multiplicative interactions in the biological data. This is likely due to added noise obfuscating the contribution of these terms, which is comparatively small even on groundtruth data (Figure 4D). In fact, most information about a multiplicative interaction would be carried in the first derivative: namely the derivative with respect to one input would vary according to another input. While we did not directly exploit this insight in the present work, we do find evidence of multiplicative interactions in the mixedselectivity neurons some of which show gating of sensory responses by behavior (Figure 7D and Figure 7—figure supplement 1E).
To characterize the complexity of the relationship between features (stimuli, behavior, etc.) and neural activity, we analyzed truncations of the Taylor expansion of the network around the data average. While not guaranteed for all functions, we expect that the Taylor series converges for our network and hence that given enough terms the expansion would fully characterize the computation performed by the CNN. The ANN fit by MINE must encapsulate the transformations between the chosen features and the neural activity in order to be able to predict the activity. The classification of linear versus nonlinear processing by the network should therefore apply to the neural transformations as well. Especially in the absence of connectomics data, the metric of computational complexity can serve to group neurons into functional circuits. One might specifically expect that sensory processing proceeds from representations of lower complexity to those of higher complexity, that is, further departures from a simple encoding of stimuli. This type of extraction of more and more complex features is, for example, observed in cortical processing of visual stimuli (Felleman and Van Essen, 1991; D’Souza et al., 2022). We note, however, that computational complexity is not an absolute metric. Comparing the linearity for neurons responding to the same feature (e.g. a stimulus) is useful, but making the comparison across different features likely is not. In our zebrafish dataset, the majority of temperaturedriven neurons were classified as nonlinear while most behaviordriven neurons were classified as linear. This, however, does not mean that the stimulus is processed less linearly than the behavior – it is likely a result of our stimulus feature being the raw temperature while our behavior features are removed from the nonlinear transformation that changes swim commands into appropriate tail motion.
We currently determine linear receptive fields (Dayan and Abbott, 2001; Schwartz et al., 2006) of fit neurons using the firstorder derivatives of the network output with respect to the input. Similar to the analysis of higherorder Volterra kernels (Marmarelis, 2004; Sandler and Marmarelis, 2015), one could analyze eigenvectors of the matrix of second or higherorder derivatives to get a deeper understanding of how this receptive field changes with stimuli in the case of neurons classified as nonlinear. Comparing direct fitting of Volterra kernels and extraction of first and secondorder kernels from MINE using Taylor analysis indeed highlights the feasibility of this approach (Figure 3). One crucial advantage of MINE is that it is considerably more robust to departures of stimuli from uncorrelated white noise with Gaussian distributed intensities, the ideal stimuli used for Volterra analysis (Marmarelis and Marmarelis, 1978; Paninski, 2002; Marmarelis, 2004; Schwartz et al., 2006). Such stimuli are impractical, especially when the goal is to study neural activity in freely behaving animals. However, our comparison to directly fitting the Volterra kernels also revealed that it is in fact possible to obtain the receptive fields this way. Volterra analysis, unlike MINE, could be fully unconstrained and therefore pose an advantage. While MINE is constrained by architecture and hyperparameters, our comparisons of extracting receptive fields with MINE versus an unconstrained fit of Volterra kernels revealed the limitations of not imposing any constraints (Figure 3 and Figure 3—figure supplement 1). MINE offers a simpletouse alternative that maintains flexibility while limiting effective degrees of freedom to allow for meaningful fits to realworld data. Independent of MINE, the structure of receptive fields is influenced by the acquisition modality. For our zebrafish data, we used a nuclear localized GCaMP6s. This variant has a long decay time (Freeman et al., 2014), and since MINE is fit on calcium data, effects of this decay time will appear in the extracted receptive fields. This likely explains why the receptive fields shown for the zebrafish data extend over many seconds. This issue could be overcome by better deconvolution of the imaging data which would require a higher acquisition framerate.
Novel insight into biological circuits
Applying MINE to biological data revealed its potential to discover the relationships between stimuli and behaviors and neural activity across species and imaging modalities. On a mouse cortical widefield imaging dataset, MINE recovered previously published structure in the neural data. Task elements driving neural activity could be mapped to expected cortical regions and receptive field analysis revealed expected structure in visual receptive fields while the receptive fields associated with neurons encoding instructed movements showed preparatory activity. Performing twophoton calcium imaging in larval zebrafish while simultaneously presenting random temperature stimuli and recording behavior allowed a deeper dive into thermoregulatory circuits. MINE allowed us for the first time to identify the neurons that integrate temperature stimuli and information about behavior. This was not possible previously since regressionbased approaches failed to identify the neurons encoding temperature while clustering to identify these neurons relied on removing stochastic motorrelated activity (Haesemeyer et al., 2018). Neurons integrating stimuli and behavior might serve an important role in thermoregulation. Specifically, they might allow larval zebrafish to estimate thermal gradients by comparing behavioral actions and their consequences in terms of changes in environmental temperature. While we have only begun to analyze the responses of these mixedselectivity neurons, a simple clustering according to their response to combinations of sensory and motor drive revealed a variety of computed features (Figure 7D and Figure 7—figure supplement 1E). Combining receptive field analysis with the idea of computational complexity, on the other hand, provides a way to functionally subdivide temperature responsive neurons in great detail. This will greatly aid future studies trying to understand thermoregulatory circuits. Notably, the functional features we identified map to anatomical features in the brain pointing to structural subdivisions mirroring function. The ability to compare neurons, even though we used randomized stimuli, indicates another advantage of MINE. When recording neural activity during free exploration, across animals different neurons will be recorded under very different stimulus and behavior conditions. This means that their responses can usually be compared only by choosing different trigger events (certain behaviors, specific stimulus conditions) and analyzing neural activity around those. MINE, on the other hand, allows comparing networks fit on the whole activity time series removing the requirement and bias of choosing specific triggering events.
One surprising finding in comparing the cortical and zebrafish brain responses was the prevalence of nonlinear processing in zebrafish and the apparent lack of it in the cortical dataset. This is likely due to the different spatial resolution in the datasets. While the zebrafish data has singleneuron resolution, this is not the case for the cortical widefield data especially since we apply MINE not to individual pixels but activity components. Each of these components combines the activity of thousands of neurons. It is likely that this averaging obfuscates individual nonlinear effects.
In summary, MINE allows for flexible analysis of the relationship between measured quantities such as stimuli, behaviors, or internal states and neural activity. The flexibility comes from MINE turning analysis on its head: instead of predefining a model, MINE approximates the functional form that relates predictors to neural activity and subsequently discovers this functional form. This is possible since MINE does not make any assumptions about the probability distribution or functional form underlying the data it models. This allows MINE to identify more neurons than regression approaches on our zebrafish dataset (Figure 6D) while doing so in a less biased manner. The regression approach especially failed on stimulusdriven units, leading to an overrepresentation of behaviorrelated activity (Figure 6G). Testing how many neurons in our receptive field clusters were also identified by the linear model (Figure 7 and Figure 7—figure supplement 1D) revealed that the linear model identified neurons belonging to each cluster. This argues that MINE increases sensitivity for identifying neurons, but at this level of analysis MINE did not necessarily identify functional types that were entirely missed by the linear model. However, in four clusters the linear model identified less than five neurons. This effectively means that the cluster would have been missed entirely when using the linear model data as the base for analysis. Accordingly, applying our clustering on the neurons identified by the linear model only recovers 4 instead of the 15 unique receptive field clusters that are obtained with MINE.
We propose that the fit CNN models the computation of upstream or downstream circuits of the neuron in question but we do not think or investigate whether it models the circuitry itself. This could be investigated in the future, but given the simplicity of the CNN used in MINE it is unlikely to provide interesting details in this manner. Instead, bringing more advanced networkwide analysis methods to bear on the fit CNN could reveal other interesting features of the computation ‘surrounding’ the neuron in question.
Methods
Animal handling and experimental procedures were approved by the Ohio State University Institutional Animal Care and Use Committee (IACUC protocol # 2019A00000137 and 2019A00000137R1).
Design and training of convolutional neural network
A network that was simple, and therefore quick to train, but maintained expressivity was sought for this study and a threelayer CNN was chosen. We note that the input to the network is a chunk of time equal to the size of the convolutional layers. A similar architecture could therefore be realized without separate convolutional layers by simply expanding the number of input nodes. The first layer was built to contain 80 convolutional filters. We do not expect that the network will have to learn 80 separate filters; however, the advantage is that these filters contain a large amount of pretraining variation. This eases learning of temporal features important for the generation of the neural response. The convolutional layer was kept linear and followed by two 64unitsized dense layers with swish activation function, which subsequently fed into a linear output unit. This in effect means that each unit in the first dense layer performs a convolution on the input data with a filter that is a weighted average of the 80 convolutional filters.
The groundtruth dataset as well as part of a previous zebrafish dataset (Haesemeyer et al., 2018) were used to optimize the following hyperparameters: a sparsity constraint to aid generalization, the learning rate, the number of training epochs, and the activation function. The activation function was to be continuously differentiable, a requirement for subsequent analyses. Dropout was added during training to further help with generalization; however, the rate was not optimized and set at 50%. The history length (10 s) of the network input was set somewhat arbitrarily, but initial tests with shorter lengths led to problems of fitting the larval zebrafish dataset likely because of the slow calcium indicator time constant of nuclear GCaMP6s (Freeman et al., 2014). We note that we did not optimize the overall architecture of the network. This could certainly be done and all subsequent analysis methods are architecture agnostic as long as the network is a feedforward architecture. Related Python file in the repository: model.py.
Characterizing the linearity and complexity of the transformation
To compute measures of the complexity of the relationship between the predictors and neural activity, a Taylor expansion across the predictor average $\overline{x}$ was performed and the predictions of truncations of this Taylor expansion were compared to the true model output, sampling predictors every second. The Jacobian (vector of firstorder derivatives of the output with respect to the input) and the Hessian (matrix of secondorder derivatives of the output with respect to the input) were computed using Tensorflow. We note that since our network has a single output, the Jacobian is not a matrix but simply the gradient vector of the output with respect to the input.
The Jacobian and Hessian were used to compute the following first and secondorder approximations of the network output using the Taylor expansion:
These truncations were used to compute a ‘linear approximation score’ $LA{S}_{f}$ and a ‘secondorder approximation score’ $SO{S}_{f}$ as the coefficients of determination quantifying the variance explained by the truncation of the true network output.
We note that for a purely linear function we expect $LA{S}_{f}$ to be 1, and this score is therefore a measure of the linearity of the relationship between predictors and the neural response. $SO{S}_{f}$, on the other hand, quantifies how good a secondorder model is in predicting the neural response, and we therefore use it to further define the complexity of the relationship between the predictors and the neural activity.
Based on ROC analysis on groundtruth data, $LA{S}_{f}$ was thresholded at ${R}^{2}=0.8$. Neurons for which the linear expansion explained <80% of variance of the true network output were considered nonlinear (see ‘Results’). To assign complexity classes to zebrafish neurons, $SO{S}_{f}$ was thresholded at ${R}^{2}=0.5$; in other words, neurons for which the seccondorder expansion did not explain at least 50% of the variance of the true network output were considered complex. Then neurons were assigned to one of three respective complexity classes: 0 for neurons for which the linear expansion $LA{S}_{f}$ indicated linearity, 1 if $LA{S}_{f}<0.8$ and $SO{S}_{f}\ge 0.5$, or 2 if the neuron was deemed nonlinear and $SO{S}_{f}<0.5$.
As a comparison, two metrics related to the nonlinearity of the function implemented by the CNN were calculated: a ‘curvature metric’ that quantifies the magnitude of the influence of the secondorder derivative, and the ‘nonlinearity coefficient’ that approximates the error made by linearizing the network. It is important to note that given random inputs that are unrelated to the training data, it is very likely that the CNN will produce outputs that nonlinearly depend on these inputs. Both metrics are therefore calculated with data that forms part of the training manifold (Raghu et al., 2023). We also note that the ‘curvature metric’ does not in fact compute the curvature of the multidimensional function implemented by the CNN. Instead, the average magnitude of the vector induced by the Hessian on a unit magnitude step in the data space is used as a proxy. With $x$, $x}^{\prime$, and $f$ defined as in ‘Identifying contributing predictors by Taylor decomposition’, the curvature metric is calculated according to
The nonlinearity coefficient (NLC) was computed according to Philipp and Carbonell, 2018 quantifying
where $D$ is the data distribution, $Tr$ is the trace operator, and $Co{v}_{\overrightarrow{x}}$ is the data covariance while $Va{r}_{f}$ Varf is the output variance (since the output in the present case is a single value). The right form highlights that the NLC compares the variance of the linear approximation of the output ($J{\left(\overrightarrow{x}\right)}^{T}\left(\overrightarrow{{x}^{\prime}}\overrightarrow{x}\right)$) with the true output variance.
Related Python files in the repository: taylorDecomp.py, perf_nlc_nonlin.py and utilities.py.
Comparison to Volterra analysis/determination of receptive fields
The discretetime Volterra expansion with finite memory of a function up to order 2 is defined as
where $r\left(t\right)$ is the response and $s\left(t\right)$ is the stimulus, and ${k}_{0}$, ${k}_{1}$, and ${k}_{2}$ are the zeroth, first, and secondorder Volterra kernels, respectively.
This can be rewritten in matrix form where
such that
We note that since our network takes predictors across time as input, it can be easily seen that $\overrightarrow{x}$ is equivalent to $\overrightarrow{s}$ and therefore that the Taylor expansion (Equation 2) of the CNN is equivalent to the Volterra series with finite memory above (Equation 11). This indicates that there is a direct correspondence between the elements of the Taylor expansion and those of the Volterra expansion, where $J$ corresponds to ${k}_{1}$ and $H$ corresponds to ${k}_{2}$ and our CNN would be an effective means to fit the Volterra kernels (Wray and Green, 1994). This suggests that neural receptive fields can be extracted from $J$ and $H$ of the Taylor expansion in the same manner as they can be obtained from ${k}_{1}$ and $k2$ of the Volterra expansion.
To determine whether MINE could be used to extract systemlevel characteristics (such as receptive fields) akin to Volterra analysis, a model was used to turn a randomly generated stimulus into a response by passing the input through two orthogonal filters followed by two nonlinearities. The outputs of these two parallel stages were subsequently combined using addition to form the response. The stimulus was either generated as a Gaussian white noise stimulus (by drawing values ${S}_{t}\sim \mathcal{N}(0,1)$) or as a smoothly varying random stimulus as described in ‘Groundtruth datasets.’ The model consisted of two sets of orthogonal filters, ${f}_{linear}$ and ${f}_{nonlinear}$ of length 50 timepoints, as well as two nonlinearities ${g}_{linear}$ and ${g}_{nonlinear}$ (see Figure 3). The terms ${f}_{linear}$ and ${f}_{nonlinear}$ for the filters were chosen because the former is expected to be contained within $J$/${k}_{1}$, that is, the linear term of the Volterra expansion while the latter is expected to be contained within $H$/${k}_{2}$, that is, the quadratic term of the Volterra expansion. We note, however, that ${f}_{linear}$ can in addition be extracted from $H$/${k}_{2}$ due to the particular function chosen for ${g}_{linear}$. Given a stimulus $s$, the response $r$ was constructed according to
, where * denotes convolution.
The two nonlinearities were constructed such that ${g}_{linear}$ was asymmetric and therefore led to a shift in the average spike rate, while ${g}_{nonlinear}$ was symmetric around 0 so that it predominantly influences the variance of the spike rate:
The constant multiplier in ${g}_{nonlinear}$ was chosen such that ${r}_{nonlinear}$ is in a similar range as ${r}_{linear}$.
The model therefore generates a response from which the two filters should be recoverable when fitting the first and secondorder Volterra kernels either through direct means akin to canonical system identification approaches or via the CNN used in MINE.
Given the Volterra kernels ${k}_{0}$, ${k}_{1}$, ${k}_{2}$ ($CNN\left(\overline{x}\right)$, $J$, $H$), we attempted to recover ${f}_{linear}$ and ${f}_{nonlinear}$ by extracting the principal dynamic modes PDM from the kernels Marmarelis, 1997 by performing eigen decomposition on an intermediate matrix $Q$, which combines all kernels in the following manner:
The PDM were then defined as the eigenvectors with the largest eigenvalues (Marmarelis, 1997). The eigenvectors with the three largest (positive) eigenvalues were compared to ${f}_{linear}$ and ${f}_{nonlinear}$ by computing the cosine similarity according to Equation 36. The best matches were reported in all plots in Figures Figure 3 and Figure 3—figure supplement 1. We note that for MINE, it was sufficient to compare the eigenvectors with the largest two eigenvalues; however, in the case of directly fitting the Volterra kernels (see below), the eigenvectors with the largest two eigenvalues often included an eigenvector related to the constant term ${k}_{0}$.
As a comparison to MINE, a system identification approach, directly fitting the Volterra kernels, was followed. We note that by creating an appropriate design matrix $X$, the elements of ${k}_{0}$, ${k}_{1}$, and ${k}_{2}$ can be directly obtained through a linear regression fit:
A linear regression fit was used accordingly, either using ordinary least squares or Ridge regression with varying penalties $\alpha $ as indicated in Figure 3 and Figure 3—figure supplement 1. We note that the direct fits were also performed after prewhitening the design matrix as this might help with model fitting. However, this did not improve median filter quality but increased the variance across simulations (data not shown).
Effective degrees of freedom were calculated/estimated according to Hastie et al., 2009. Specifically, for Ridge regression models, the effective degrees of freedom were calculated analytically according to
where $\alpha $ is the Ridge penalty, $\mathbf{I}$ is the identity matrix, and $tr$ is the trace operator.
For the CNN used in MINE, the effective degrees of freedom were estimated using simulations according to
where $\widehat{y}$ is the model prediction, $y$ is the true output, ${\sigma}^{2}$ is the error variance, and $Cov$ is the covariance. Intuitively, this definition of degrees of freedom measures how small variations in the output affect variations in the predicted output. In other words, models with high degrees of freedom are expected to match predictions to any change in output while models with low degrees of freedom are limited in this regard.
All fits were repeated for 100 randomly generated stimuli to obtain different estimates for both MINE and Volterra analysis.
Related Python file in the repository: cnn_sta_test.py, mine_edf.py.
Identifying contributing predictors by Taylor decomposition
To assess the contribution of individual predictors on the network output, a metric based on the Taylor decomposition of the network was computed. The Taylor decomposition was truncated after the second derivative. Below, Taylor decompositions that include all terms for first and secondorder derivatives will be designated as ‘full.’ For all analyses presented in the article, Taylor expansions were performed every second (every five timepoints at our data rate of 5 Hz) and predictions were made 5 s into the future (25 timepoints at our data rate of 5 Hz).
The state of the predictors at timepoint $t$ and the state of the predictors 5 s (in our case, 25 samples) later was obtained. Subsequently, these quantities, together with the Jacobian at timepoint $t$ and the Hessian at timepoint $t$, were used to compute a prediction of the change in network output over 5 s. At the same time, the true change in output was computed. Comparing the predicted changes and true changes via correlation resulted in an ${r}^{2}$ value. This value measures how well the full Taylor expansion approximates the change in network output across experimental time. We note that predicted changes in the output were compared via correlation rather than predicted outputs since we specifically wanted to understand which predictors are important in driving a change in output rather than identifying those that the network might use to set some baseline value of the output. In the following, $X\left(t\right)$ is a vector that contains the values of all predictors across all timepoints fed into the network (e.g. in our case 50 timepoints of history for each predictor) in order to model the activity at time $t$, $f\left(x\right)$ is the CNN applied to a specific predictor input, $J\left(x\right)$ is the Jacobian at that input, and $H\left(x\right)$ the Hessian:
The Taylor expansion around $\overrightarrow{x}$:
The predicted change in network output according to the Taylor expansion:
The true change in network output:
Variance of the output change explained by the Taylor prediction using Equation 24 and Equation 25:
After the computation of ${r}_{full}^{2}$, terms were removed from the Taylor series. We note that both individual timepoints fed into the network and predictors are separated at this level. However, only the removal of predictors onto the quality of prediction was tested, not the importance of specific timepoints across predictors (i.e. all timepoints across a given predictor or interaction were removed to test the importance of the predictors, instead of removing all predictors at one timepoint to test the importance of a specific timepoint). For the removal of a given predictor, both its linear (via $J$) and its squared contribution (via $H$) were subtracted from the full Taylor prediction while for the removal of interaction terms only the corresponding interaction term (via $H$) was subtracted. The following quantities were computed to arrive at the Taylor metric of single predictors ${x}_{n}$ or interactions of two predictors ${x}_{n,m}$, where ${x}_{n}$ and ${J}_{n}$ identify subvectors that contain all elements belonging to the specific predictor $n$ while ${H}_{n,m}$ identifies all elements in the Hessian at the intersection of all rows belonging to the second derivatives with respect to predictor $n$ and columns with respect to the predictor $m$: change predicted by only considering ${x}_{n}$:
Change predicted when removing ${x}_{n}$ using Equation 24 and Equation 27:
Change predicted when only considering interaction between ${x}_{n}$ and ${x}_{m}$:
Change predicted when removing ${x}_{n}{x}_{m}$ interaction using Equation 24 and Equation 29:
Variance explained by Taylor prediction after removing ${x}_{n}$ using Equation 25 and Equation 28:
Variance explained by Taylor prediction after removing ${x}_{n}{x}_{m}$ interaction using Equation 25 and Equation 30:
The Taylor metric was then defined as
and
respectively. For the zebrafish data, we additionally calculated an overall ‘Behavior’ Taylor metric in which all terms (regular and interaction) that belonged to any behavioral predictor were removed and the Taylor metric was subsequently calculated as outlined above. Related Python file in the repository: taylorDecomp.py.
Clustering of neurons according to receptive field
Jacobians were extracted for each neuron at the data average and used as proxies of the receptive field (see section on ‘Comparison to Volterra analysis/determination of receptive fields’). Accordingly, the receptive field of a network and therefore neuron ($R{F}_{f}$) was defined as
This is of course merely an approximation for nonlinear neurons. The firstorder derivative of a nonlinear function is not constant and therefore the receptive field will depend on the inputs. To cluster zebrafish sensory neurons according to the receptive fields, the cosine of the angle between the receptive fields of different neurons ($R{F}_{i}$, $R{F}_{j}$, $i\ne j$) (and during the process, clusters) was computed (cosine similarity, $CS$):
Clustering was subsequently performed by a greedy approach as outlined in Bianco and Engert, 2015. Briefly: all pairwise CS values were computed. Then, the pair of receptive fields with the highest CS was aggregated into a cluster. The average RF in that cluster was computed. Subsequently all pairwise CS values, now containing that new cluster average, were computed again and the procedure was iteratively repeated until all receptive fields were clustered or no pairwise similarities $C{S}_{R{F}_{i},R{F}_{j}}\ge 0.8$ were observed. The advantages of this clustering method are that it is (1) deterministic and (2) the number of clusters does not have to be specified beforehand. In the case of the receptive fields on the mouse dataset, the matrix of pairwise cosine similarities was used to perform spectral clustering. Spectral clustering finds groups of strong connection (high CS in this case) within a similarity graph that are separated from other groups by weak connections (low CS). The rationale for changing clustering methods was computational speed, which is much higher for spectral clustering.
Related Python file in the repository: utilities.py.
Groundtruth datasets
To test the capability of MINE to predict the relationships between predictors and neural activity, four predictors were computergenerated. Two of these were generated by combining sine, square, and triangle waves of random frequencies and amplitudes – simulating smoothly varying sensory inputs. The third predictor was generated as a Poisson process – simulating stochastic, unitary events such as bouts of movement. The last predictor was generated as a Poisson process as well but events were scaled with a random number drawn from a unit normal distribution with mean zero – simulating directed, stochastic movements.
Model responses were subsequently created by linearly or nonlinearly transforming individual predictors and combining them multiplicatively. After this step, all responses were convolved with a ‘calcium kernel’ and $i.i.d.$ Gaussian noise was added to more closely resemble realworld acquisition. MINE was applied to these data: fitting a CNN on twothirds of the data and calculating the correlation between the CNN prediction and the last third (test set) of the data. As a comparison, two linear regression models were fit to the data as well, again split into training and test sets. One of these models used raw predictors as inputs to predict the response (equivalent to the inputs given to MINE). The other used transformed predictors, convolved with the same calcium kernel applied to the responses above, along with all firstorder interaction terms of the predictors to predict the response. Correlations on the test data were subsequently used for comparisons.
Related Python file in the repository: cnn_fit_test.py.
To test the capability of MINE in predicting the nonlinearity of predictor–response relationships, for each test case a new ‘sensorylike predictor’ (introduced above) was generated. For the effect of varying nonlinearity (Figure 3B–D), the predictor was transformed by (1) squaring the hyperbolic tangent of the predictor, (2) differencing the predictor, or (3) rectifying the predictor using a ‘softplus’ transformation. These transformed versions were subsequently mixed to varying fractions with the original predictor to generate a response. The coefficients of determination of truncations of the Taylor expansion after the linear and quadratic terms were then computed as indicated above. Linear correlations were also performed for some of the mixtures to illustrate the effect of both linear and nonlinear transformations on this metric. For ROC analysis of the linearity metric, a set of 500 by 5 random stimuli were generated. For each set, four responses were generated depending only on the first of the five random stimuli (the others were used as noisy detractors to challenge the CNN fit). Two of the four responses were linear transformations: one was a differencing of the first predictor and the other was a convolution with a randomly generated doubleexponential filter. The other two responses were randomized nonlinear transformations. The first was a variablestrength rectification by shifting and scaling a softplus function. The second was a random integer power between one and five of a hyperbolic tangent transformation of the predictor. The known category (linear vs. nonlinear) was subsequently used to calculate false and truepositive rates at various thresholds for the linearity metric (see above) to perform ROC analysis.
Related Python file in the repository: cnn_nonlin_test.py.
Processing of data from Musall et al., 2019b
The data for Musall (Musall et al., 2019a) was downloaded from the published dataset. Temporal components Musall et al., 2019a were loaded from ‘interpVc.mat,’ spatial components from ‘Vc.mat,’ and predictors from ‘orgregData.mat’ for each analyzed session. Predictors were shifted by 5 s relative to the activity traces. The 10slong convolutional filters allowed for observing the influences of events 5 s into the past as well as preparatory activity for actions occurring in the next 5 s. MINE was subsequently used to determine the relationship between predictors and temporal components. The neural activity and the individual terms of the Taylor decomposition were mapped into pixel space. Then spatial maps were generated by computing Taylor metrics as above for each individual pixel time series. This was necessary since the components in ‘interpVc.mat’ are highly linearly dependent, making it impossible to carry measures of explained variance directly from ‘component space’ into ‘pixel space.’ Receptive fields, on the other hand, were directly mapped into pixel space using the spatial component matrix as these are linear. Related Python files in the repository: processMusall.py, mine.py, plotMusall.py.
Zebrafish experiments
All calcium imaging was performed on a custombuilt twophoton microscope at 2.5 Hz with a pixel resolution of 0.78 $\mu m/pixel$ and an average power of 12 mW (at sample) at 950 nm (measured optical resolution of the system is $<1\phantom{\rule{thinmathspace}{0ex}}\mu m$ lateral and $<4\phantom{\rule{thinmathspace}{0ex}}\mu m$ axial). The brain was divided into six overlapping regions: two divisions along the dorsoventral axis and three divisions along the anterior–posterior axis. The most anterior point in the most anterior region was set to the olfactory epithelium while the most posterior point in the most posterior region was the start of the spinal cord. Larval zebrafish expressing nuclear localized GCaMP in all neurons and mCherry in all glutamatergic neurons (mitfa /; elavl3H2B:GCaMP6s +/; vGlut2amCherry) between 5 and 7 d post fertilization were used for all experiments. Larval zebrafish were mounted and stimulated as previously described (Haesemeyer et al., 2018) including scan stabilization as previously described to avoid artifacts associated with heatinginduced expansion of the preparation. The tail of the larva was free and tracked at 250 Hz using an online tracking method previously described (Haesemeyer et al., 2018). Whenever our online tailtracking missed a point along the tail, an image was saved from which the tail was subsequently retracked offline as in Mearns et al., 2020. Since the experiments were performed under a new microscope, we recalculated a model translating stimulus laser strength to fish temperature as described in Haesemeyer et al., 2015; Haesemeyer et al., 2018 and subsequently referred to as the ‘temperature model’.
For each imaging plane, 495 s of a ‘random wave’ stimulus were generated by superimposing sine waves of randomized amplitudes with periods between 13 s and 200 s. The stimuli were set to generate temperatures outside the noxious range between 22°C and 30°C. Imaging planes were spaced 5 $\mu m$ apart and 30 planes were imaged in each fish.
For reticulospinal backfills TexasRed Dextran 10,000 MW (Invitrogen D1828) at 12.5% w/v was pressure injected into the spinal cord of larval zebrafish under Tricaine (Syndel MS222) anesthesia. Injections were performed in fish expressing nuclear localized GCaMP6s in all neurons (mitfa /; Elavl3H2B:GCaMP6s) that were embedded in lowmelting point agarose with a glass capillary having a 15 $\mu m$ tip opening. Fish were unembedded after the procedure, woken from anesthesia, and rested overnight. Fish were screened the next day for labeling, embedded again, and imaged under the twophoton microscope.
Processing of zebrafish data
Raw imaging data was preprocessed using CaImAn for motion correction and to identify units and extract associated calcium traces (Giovannucci et al., 2019). Laser stimulus data was recorded at 20 Hz and converted to temperatures using the temperature model (see above). The 250 Hz tail data was used to extract boutstarts based on the absolute angular derivative of the tail crossing an empirically determined threshold. The vigor was calculated as the sum of the absolute angular derivative across the swimbout, which is a metric of the energy of tail movement. The directionality was calculated as the sum of the angles of the tip of the tail across the swimbout, which is a metric of how onesided the tail movement is. Extracted calcium data, stimulus temperatures, and the three behavior metrics were subsequently interpolated/downsampled to a common 5 Hz timebase. To reduce linear dependence, the behavior metrics were subsequently orthogonalized using a modified Gram–Schmidt process.
Components outside the brain as well as very dim components inside the brain were identified by CaImAn. To avoid analyzing components that likely do not correspond to neurons labeled by GCaMP, all components for which the average brightness across imaging time was <0.1 photons were excluded. The number of analyzed components was thereby reduced from 706,543 to 432,882.
The CNN was subsequently fit to the remaining components using twothirds of imaging time as training data. For each neuron, the stimulus presented during recording and the observed behavioral metrics were used as inputs. Both the inputs and the calcium activity traces were standardized to zero mean and unit standard deviation before fitting. For every neuron in which the correlation $r\ge \sqrt{0.5}$, all discussed MINE metrics were extracted.
To generate the cyclically permuted controls, all calcium data was rotated forward by a third of the experiment length with wrap around. The rationale for constructing the control data in this manner was that it maintains the overall structure of the data, but since both our stimuli and the elicited behavior are stochastic, it should remove any true relationship between predictors and responses. The CNN was fit to the control data in the same manner as to the real data, but no further metrics were extracted. To identify which predictors significantly contributed to neural activity, ${R}_{full}$, ${R}_{Pn}$, and ${R}_{Pn,m}$ were bootstrapped, computing the Taylor metric for each variate. The standard deviation of the bootstrap variate was subsequently used to estimate confidence intervals according to a normal distribution approximation. While confidence intervals could have been computed directly from the bootstrap variates, this would have meant storing all samples for all neurons in questions to retain flexibility in significance cutoff. The significance level was set to p<0.05 after Bonferroni correction across all fit neurons (effective pvalue of $1.26\times {10}^{6}$). To be considered a driver of neural activity, the Taylor metric had to be larger than 0.1 at this significance threshold.
The linear comparison model was constructed akin to Musall et al., 2019a. All predictors were timeshifted by between 0 and 50 timesteps for a total of 200 predictor inputs to the regression model. This setup emulates the convolutional filters present in the CNN used by MINE. A modified Gram–Schmidt process was used for orthogonalization to avoid the problems of QR decomposition observed in cases of near singular design matrices (singular up to the used floatingpoint precision). Ridge regression (Hastie et al., 2009) was used to improve generalization. Setting the ridge penalty to ${10}^{4}$ led to a 25–50% increase of identified neurons on small test datasets compared to a standard regression model.
All zebrafish stacks (except those acquired after spinal backfills) were first registered to an internal Elavl3H2B:GCaMP6s reference using CMTK (Rohlfing and Maurer, 2003). A transformation from the internal reference to the ZBrain (Randlett et al., 2015) was subsequently performed using ANTS (Avants et al., 2009). Micrometer coordinates transformed into ZBrain pixel coordinates, together with region masks present in the ZBrain, were subsequently used to assign all neurons to brain regions.
To identify ZBrain regions in which a specific functional type is enriched, the following process was used. The total number of all types under consideration in each region was computed (i.e. if the comparison was between stimulus, behavior, and mixedselectivity neurons, all these neurons were summed, but if the comparison was between different classes of behaviorrelated neurons only those were summed). Subsequently, the fraction of the type of interest among the total in each region was computed (plotted as bars in Figure 5—figure supplement 1 and Figure 6—figure supplement 1). Similarly, an overall fraction of the type across the whole brain was computed (plotted as vertical blue lines in Figure 5—figure supplement 1 and Figure 6—figure supplement 1) This overall fraction was used to simulate a 95% confidence interval of observed fractions given the total number of neurons considered in each region (horizontal blue lines in Figure 5—figure supplement 1 and Figure 6—figure supplement 1). Any true fraction above this interval was considered significant enrichment. We note that exact confidence intervals could have been computed from the binomial distribution; however, the simulation approach allowed us to use the same method when computing confidence intervals around the average complexity of receptive field clusters.
For reticulospinal backfill experiments, neurons labeled by the backfill were manually annotated. This annotation was used to classify neurons (CaImAn identified components) as reticulospinal.
To study how mixedselectivity neurons integrate information about the temperature stimulus and swim starts, the following strategy was used. Receptive fields for the temperature and swim start predictor were extracted for each neuron as described above. Subsequently, the drive for an input was defined as a scaled version of the receptive fields after they had been scaled to unit norm. Synthetic input stimuli at different combinations of stimulus and swim drive were subsequently generated and fed into the CNN of the neuron of interest. The response of the CNN was recorded to construct the response landscapes shown in Figure 7 and Figure 7—figure supplement 1. To cluster similar response landscapes, pairwise 2D crosscorrelations were computed between the response landscapes of all neurons. Similarity was defined as the maximal value of the crosscorrelation to preferentially cluster on similar shapes of the landscape rather than exact alignment. Spectral clustering was used to form the clusters and for each a randomly selected exemplar is shown in the figures. Due to the lack of alignment, cluster averages are not meaningful.
To perform anatomical clustering, every identified neuron was clustered based on the spatial proximity of their centroids to the centroids of other neurons. A radius of 5 $\mu m$ (roughly the radius of a neuron in the zebrafish brain) was chosen empirically as the clustering radius. The coordinates of the centroids were then placed in a 3D volume with a spatial resolution of 1 $\mu m$, with each centroid occupying one position in this space. Subsequently, a kernel of the chosen radius was constructed. The kernel was then convolved with the centroids of the neurons to generate a ball around each of them. The goal was that if the space occupied by the balls around two neurons overlap, or are adjacent, the two neurons would be clustered together. This was accomplished using connectedcomponents3D (William Silversmith, version 3.12, available on PyPi as connectedcomponents3d; code: https://github.com/seunglab/connectedcomponents3d/, copy archived at Silversmith, 2023) which was used to cluster the neurons based on their proximity. The connectedcomponents3D function takes the 3D volume and assigns each occupied voxel to a cluster. The corresponding centroids were then identified and assigned the same cluster label. For plotting, an alpha value was assigned to each centroid based on the number of neurons present in each cluster. The coordinates of the centroids were then plotted and assigned the computed alpha values.
Experiments were excluded from further analysis for the following reasons: five fish died during functional imaging; eight fish could not be registered to the reference; tail tracking failed during the acquisition of six fish and during the acquisition of four fish problems in the acquisition hardware led to imaging artifacts. Related Python files in the repository: rwave_data_fit.py, rwave_decompose.py, utilities.py.
Data availability
All data generated in this study is publicly available. All code used in this study is available in the following repositories: MINE: https://github.com/haesemeyer/mine_pub (copy archieved at Haesemeyer, 2023). CaImAn preprocessing: https://github.com/haesemeyer/imaging_pipeline (copy archieved at Haesemeyer, 2021). Tail behavior processing: https://bitbucket.org/jcostabile/2p/. Raw experimental data has been deposited to DANDI: https://dandiarchive.org/dandiset/000235/0.230316.1600; https://dandiarchive.org/dandiset/000236/0.230316.2031; https://dandiarchive.org/dandiset/000237/0.230316.1655; https://dandiarchive.org/dandiset/000238/0.230316.1519. Processed data has been deposited on Zenodo: Zebrafish and mouse final processed data: https://doi.org/10.5281/zenodo.7737788. Zebrafish and mouse fit CNN weights part 1/2: https://doi.org/10.5281/zenodo.7738603. Zebrafish and mouse fit CNN weights part 2/2: https://doi.org/10.5281/zenodo.7741542. All data analysis was performed in Python; Tensorflow (Abadi, 2016) was used as the deep learning platform.

ZenodoProcessed data for "Modelfree identification of neural encoding (MINE)" publication.https://doi.org/10.5281/zenodo.7737788

Dandi ArchiveThermoregulatory Responses Forebrain.https://doi.org/10.48324/dandi.000235/0.230316.1600

Dandi ArchiveThermoregulatory Responses Midbrain.https://doi.org/10.48324/dandi.000236/0.230316.2031

Dandi ArchiveThermoregulatory Responses Hindbrain.https://doi.org/10.48324/dandi.000237/0.230316.1655

Dandi ArchiveThermoregulatory Responses Reticulospinal system.https://doi.org/10.48324/dandi.000238/0.230316.1519

ZenodoCNN weight data for "Modelfree identification of neural encoding (MINE)" publication  Set 1.https://doi.org/10.5281/zenodo.7738603

ZenodoCNN weight data for "Modelfree identification of neural encoding (MINE)" publication  Set 2.https://doi.org/10.5281/zenodo.7741542

CSHLDataset from: Singletrial neural dynamics are dominated by richly varied movements.https://doi.org/10.14224/1.38599
References

ConferenceTensorflow: a system for largescale machine learningProceedings of the12th USENIX Symposium on Operating Systems Designand Implementation. pp. 265–283.

The Spectrotemporal receptive field: A functional characteristic of auditory neuronsBiological Cybernetics 42:133–143.https://doi.org/10.1007/BF00336731

A simple weight decay can improve generalization.AdvNeural Inf. Process. Syst 4:950–957.

S.Et al.modern machine learning as a benchmark for fitting neural responses.frontFrontiers in Computational Neuroscience 12:56.https://doi.org/10.3389/fncom.2018.00056

Layerwise relevance propagation for deep neural network architecturesInInformation Science and Applications 1:913–922.https://doi.org/10.1007/9789811005572

M.Et al.do we know what the early visual system Does?JThe Journal of Neuroscience 25:10577–10597.https://doi.org/10.1523/JNEUROSCI.372605.2005

The Computational BrainComputational Neuroscience series, The Computational Brain, MIT Press, 10.7551/mitpress/9780262533393.001.0001.

Approximation by Superpositions of a Sigmoidal function.mathMathematics of Control, Signals, and Systems 2:303–314.https://doi.org/10.1007/BF02551274

BookNeural Engineering: Computational, Representation, and Dynamics in Neurobiological SystemsCambridge, MA, USA: MIT Press.

J.Et al.mapping brain activity at scale with cluster computing.NATNature Methods 11:941–950.https://doi.org/10.1038/nmeth.3041

Nonlinear eventrelated responses in fMRI.MagnMagnetic Resonance in Medicine 39:41–52.https://doi.org/10.1002/mrm.1910390109

Modeling Convergent ON and OFF pathways in the early visual system.BiolBiological Cybernetics 99:263–278.https://doi.org/10.1007/s004220080252y

ConferenceThe general approximation theoremICNN ’98  International Conference on Neural Networks. pp. 1271–1274.https://doi.org/10.1109/IJCNN.1998.685957

Deep attention networks reveal the rules of collective motion in Zebrafish.Plos ComputPLOS Computational Biology 15:e1007354.https://doi.org/10.1371/journal.pcbi.1007354

Receptive fields and functional architecture of monkey Striate cortex.JThe Journal of Physiology 195:215–243.https://doi.org/10.1113/jphysiol.1968.sp008455

BookAn Introduction to Statistical LearningNew York, NY: Springer US.https://doi.org/10.1007/9781461471387

Integrating temperature with odor processing in the olfactory bulb.JThe Journal of Neuroscience 35:7892–7902.https://doi.org/10.1523/JNEUROSCI.057115.2015

The identification of Nonlinear biological systems: Wiener kernel approaches.AnnAnnals of Biomedical Engineering 18:629–654.https://doi.org/10.1007/BF02368452

Predicting responses of Nonlinear neurons in monkey Striate cortex to complex patternsThe Journal of Neuroscience 12:3568–3581.https://doi.org/10.1523/JNEUROSCI.120903568.1992

Neural networks for modeling neural Spiking in S1 cortexFrontiers in Systems Neuroscience 13:13.https://doi.org/10.3389/fnsys.2019.00013

Modeling auditory system Nonlinearities through Volterra series.BiolBiological Cybernetics 63:307–313.https://doi.org/10.1007/BF00203454

Theory in motion.CurrCurrent Opinion in Neurobiology 5:832–840.https://doi.org/10.1016/09594388(95)801138

BookAnalysis of Physiological SystemsBoston, MA: Springer Verlag.https://doi.org/10.1007/9781461339700

Modeling methodology for Nonlinear physiological systems.AnnAnnals of Biomedical Engineering 25:239–251.https://doi.org/10.1007/BF02648038

BookNonlinear Dynamic Modeling of Physiological SystemsInstitute of Electrical and Electronics Engineers.https://doi.org/10.1002/9780471679370

BookParallel Distributed ProcessingCambridge: MIT Press.https://doi.org/10.7551/mitpress/5237.001.0001

Deep learning models of the retinal response to natural scenes.AdvNeural Inf. Process. Syst 29:1369–1377.

Principal dynamic mode analysis of action potential firing in a spider Mechanoreceptor.BiolBiological Cybernetics 96:113–127.https://doi.org/10.1007/s0042200601082

The VolterraWiener approach in neuronal modeling.CONFProc. IEEE Eng. Med. Biol. Soc 1:5912–5915.https://doi.org/10.1109/IEMBS.2011.6091462

Singletrial neural Dynamics are dominated by richly varied movementsNature Neuroscience 22:1677–1686.https://doi.org/10.1038/s4159301905024

Harnessing behavioral diversity to understand neural computations for cognitionCurrent Opinion in Neurobiology 58:229–238.https://doi.org/10.1016/j.conb.2019.09.011

Energy limitation as a selective pressure on the evolution of sensory systems.JJournal of Experimental Biology 211:1792–1804.https://doi.org/10.1242/jeb.017574

Convergence properties of some spiketriggered analysis techniquesNeural Inf. Process. Syst 15:437–464.

A theory of how the brain might Work.Cold spring HarbSymp. Quant. Biol 55:899–910.https://doi.org/10.1101/SQB.1990.055.01.084

ConferenceProceedings of Machine Learning ResearchProceedings of the 34th International Conference on Machine Learning. pp. 2847–2854.

O.Et al.wholebrain activity mapping onto a Zebrafish brain Atlas.NATNature Methods 12:1039–1046.https://doi.org/10.1038/nmeth.3581

Nonrigid image registration in sharedmemory multiprocessor environments with application to brains, breastsIEEE Transactions on Information Technology in Biomedicine 7:16–25.https://doi.org/10.1109/titb.2003.808506

Bayesian synthesis of probabilistic programs for automatic data modeling.ProcProceedings of the ACM on Programming Languages 3:1–32.https://doi.org/10.1145/3290350

BookExplainable AI: Interpreting, Explaining and Visualizing Deep LearningCham: Springer International Publishing.https://doi.org/10.1007/9783030289546

Grueneberg ganglion neurons are finely tuned cold sensors.JThe Journal of Neuroscience 30:7563–7568.https://doi.org/10.1523/JNEUROSCI.060810.2010

Implications of neuronal diversity on population coding.neural ComputNeural Computation 18:1951–1986.https://doi.org/10.1162/neco.2006.18.8.1951

SoftwareCc3D: connected components on Multilabel 3d images, version swh:1:rev:5562f4181fc82466b7a94209e34d14909f9b7f8eSoftware Heritage.

Dropout: A simple way to prevent neural networks from OverfittingMach. Learn. Res 15:1929–1958.

From deep learning to mechanistic understanding in Neuroscience: the structure of retinal prediction.AdvNeural Inf. Process. Syst 32:8537–8547.https://doi.org/10.48550/arXiv.1912.06207

BookTheory of Functionals and of Integral and IntegroDifferential EquationsNew York: Dover Publications.
Decision letter

Damon A ClarkReviewing Editor; Yale University, United States

Michael J FrankSenior Editor; Brown University, United States
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "A modelfree approach to link neural activity to behavioral tasks" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Michael Frank as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
The reviewers were positive about the paper's approach. Below are several points agreed upon by the reviewers that would be important to address before potential publication.
1) R1 had several critiques of the comparison to STA/STC. All reviewers agree that a more complete exposition of the differences between this method and STA/STC would benefit the paper. This could include synthetic cases where the proposed methods works better than STA/STC (for instance requiring less data) or real examples from the datasets where the proposed method is better at classifying cell response types. This paper should describe or show the advantages of this type of analysis over prior ones, ideally using good applestoapples comparisons.
a. Relatedly, R1 was also skeptical of the utility of the linear methods shown in Figure 1; the other reviewers convinced R1 that the limited linear models used were in keeping with standard methods in the zebrafish field. The authors should better describe why these particular (limited) linear models were used, and perhaps cite examples from the field in support of these sorts of models.
2) The paper was a bit dense in some sections. The reviewers recommend moving some more mathematical details to the supplement and focusing on the most important results in the main text. (An example might be the curvature metric, which was a little opaque and did not seem central.)
3) The authors should draw finer distinctions between their method and some other, similar, prior methods, like the Ganguli/Baccus papers that used convolutional networks to fit retinal ganglion cell responses, in an approach that seems similar in some ways to the one used here. This relates also to several reviewer comments about what method exactly is being referred to as 'MINE'.
4) Figure 7: The authors should explain the significance of the result in Figure 7 with respect to signal processing. Are the authors saying that higher order interactions occur deeper in processing? This complexity analysis may also be an example of an approach that is possible with what is presented here but not with STA/STC.
Reviewer #1 (Recommendations for the authors):
1) In MINE's first mention (line 43) and in showing that it can capture various response features, the authors seem to imply that MINE is simply the idea of using ANNs to model neural activity as function of either independent or dependent variables. But surely this has already been done many times. I'm having some trouble seeing what precisely is claimed as new in MINE. My thought was that it is also the first and second order interpretation, but that does not come through in the early descriptions of the method.
2) Related to this, in L55, the authors remark "Aside from network architecture and initialization conditions MINE is essentially modelfree." This is a pretty big constraint, by the model architecture, choice of activation function, etc. STA and STC and systems identification methods more generally seem far more truly model free. This suggests that the framing and title here might need significant reconsideration.
3) The authors make several claims about linear models in Figure 1 that I cannot figure out. In particular, at L137, the authors say that the linear model cannot learn the dynamics of the calcium indicator. But why not? If the stimulus is s_t, the kernel is k_t, and the response is r_t = (s*k)_t + eta_t, then this is exactly what a linear model should be able to learn perfectly, directly from the data through a least squares fit at each time point. I don't understand why an expanded linear model should be required to learn the calcium kernel. In 1F, why is the LM not working as well as the ANN for S1 and M2? Those appear to both be simple linear transformations where r_t = (s*k)_t + eta_t: perfect for a linear model and least squares fitting. (In fact, with Gaussian noise, no model should be able to outperform the least squares fit linear model to estimate k, given sufficient data.) Similarly, dS1/dt is still just a linear transformation of S1 and should be easily learned by a linear model, since it's now just r_t = (s*k')_t + eta_t, where k' is the derivative of the calcium kernel. These results – that the authors' linear model is not capturing linear transformations that exactly match the form of the linear model – are truly puzzling.
4) One of the big proposed advances of this paper is the interpretability of the fitted ANN model. By doing "Taylor expansion" on the model, the authors pull out linear and quadratic terms that describe the dependency of neurons on different nonneural variables (dependent and independent) in these experiments. I have a few comments on this, some more critical than others. Overall, I was not left with a strong sense of what the benefit is, if any, of this method over standard systems identification approaches.
a. This sort of expansion of a functional has more typically been called a Volterra expansion, and there exists a giant literature on estimating Volterra kernels for nonlinear systems. It would be appropriate to cite some of that work and adopt the terminology used in prior work in the field.
b. In comparing their methods to spiketriggered analyses, Figure 4 is not a fair comparison, since it has not decorrelated the stimuli in time, as has been in wide practice in systems identification for quite a while (see Korenberg in the 1990s, Baccus and Meister, 2002). Thus the fair comparison is with S4, and 4 should not be viewed as comparable and quite likely removed.
c. In the comparison between the Jacobian/Hessian of the ANN and the STA/STC equivalents in Figures4 and S4, I don't understand the authors' choice of nonlinearities, which are described after line 1105. In particular, g_sta(x) has a first derivative at x=0, but it's second derivative evaluated at x=0 is also nonzero. By adding this to the other function and putting them through a ReLU (equation 19), there will be both linear and quadratic components for both RFa and RFv, even if g_stc is symmetric, simply because the ReLU is approximated by a linear + quadratic (+ higher) terms. This means that with these functions, I would expect the STA to be a linear combination of the two linear projections (RFa and RFv) and I would expect the STC matrix to have eigenvectors spanning the 2dimensional space of RFa and RFv. It's less clear why this isn't happening in the simulations. RFa could dominate the STA. In Figure 4B, it seems clear that the STC is contaminated by the STA, and given the analysis above, that is not surprising. Is that what is lowering the cosine similarity for the STC analysis in S4? Is only the first eigenvector being considered? If so, it's frequently the case that the STC dimensions also include the STA dimension; often, one reports only the STC components after the STA has been subtracted off. (See for instance Aljadeff et al., 2016.) It looks here as though the authors just report the eigenvector with largest eigenvalue. Is the STC really not working well, or is it just including components of the other filter (appropriately)? It's hard to believe that STC applied to a point nonlinearity acting on two projections of the stimulus would not return eigenvectors that span those two projections of the stimulus.
d. This brings up a conceptual point. The decomposition that the authors perform of the ANN into zeroth, first, second, and higher order terms in Figure 7D could just as easily be written for the raw responses of the neurons. In such a case, with an excellent model, the two expansions should be closely related to one another, and in some cases equivalent. The loworder terms in the Volterra series, when estimated experimentally, depend on higher order terms. (For instance a cubic Volterra term would show up in the linear filter.) Wiener series are typically estimated instead, since these produce an orthogonalized basis, so that this dependence is eliminated (for a given amplitude of inputs X). When computing the STC, one can eliminate the linear component first; that would then be the Wiener estimate of the second order kernel. The second order kernel computed as the Hessian of an ANN is not orthogonalized in the same way, so it might produce different results from STC, but it is not clear that this is what the authors show, or intend to show.
e. The authors on L346 say that higher order terms are contained in the ANN, so that the Jacobian will be different from spike triggered covariance analysis. I think this is correct. However, I'm not sure that makes the Hessian or Jacobian in the ANN a better analysis than the systems identification approaches. In my mind, this is potentially the only advantage of doing this ANN method over STA and STC; but if that's the advantage, then I'd want to be shown very clearly an example where this occurs and where the insight into the system was improved by using the Jacobian/Hessian rather than STA/STC. I imagine that such benefits could accrue when the fitted ANN is close to the true set of transformations of the system, but that would reduce the advantages of being able to do this with any old ANN. Overall, these comments get to the point of: under what conditions would the Jacobian/Hessian/ANN method be better or somehow more informative than STA/STC? And why fit to the ANN if one is going to interpret it with methods equivalent to STA/STC that could be performed directly on the data?
Reviewer #2 (Recommendations for the authors):
While the paper could certainly be published in the current form, there are some points that as a reader I would like to have seen addressed further. I therefore offer some comments and suggestions that the authors might like to consider.
I believe there is a mistake in the formulae given for the Taylor metric in the Methods. Since the R^2 Full is always greater than the R^2 with fewer parameters, the formulae, as written, appear to give a number between 0 and infinity, whereas I think the intention is that they should give a number between 0 and 1 (as appears in the paper).
An important consideration when comparing approaches to modeling neuronal responses is how much data is required to fit the model. In the comparison with regression models, or with the spiketriggered average it would be useful to show how the performance of each approach develops with the length of recording.
The paper could be improved, and the method made more appealing, by more clearly showing qualitative improvements that their approach offers over variations of linear regression methods. In the zebrafish part, there is a convincing argument made that the method allows more neurons to be fit at some threshold, but direct comparison of the outputs of both approaches is restricted to counting the numbers of ROIs identified in various classes. One possible explanation could be that these differences arise because the modelfree method has greater sensitivity for identifying neurons within the same classes that are found with the LM method, but it would be more interesting if the model free method finds particular types of neurons that are clearly missed by the LM method. What could be compelling to see, for example, would be (1) differences in the spatial distributions of neuron classes found with the two methods, e.g. regions where neurons are only found with the modelfree method or (2) Evidence that there are specific functional classes that are missed with the LM method – e.g. do the RF clusters vary in how well they are picked up by the LM?
It would be helpful to have some more information about the choice of network architecture, and how this was optimized. For example, 80 features in the convolutional layer seems like a lot for these relatively simple data sets, and it would be good to understand better how this number was chosen and how this choice can affect the outcome.
Intuitively, it seems like the temporal effects of the calcium indicator response would ideally be applied downstream of the computation of the neural response, but, in the text, this step is described as happening in the first layer of linear convolutional filters. Is this a constraint of the network design, and are there any practical consequences for the possible computations captured by the model?
I don't like that the distributions of neuronal types seem to be shown in the form of renderings of anatomical masks highlighting regions of enrichment. This rather obscures the true distributions, and makes them appear more cleanly separated than they are. For me, these visualizations don't offer much information that goes beyond the tabulation of regions. A direct representation of the density of different functional classes, as is often used in the field, would be richer in information, and might be a way to highlight qualitative differences between the populations identified using the LM and modelfree approaches. Even if it is the case that there are not strong differences in the distributions of functional types, this by itself would be an interesting observation that merits discussion.
I would like to see more discussion of how the neural network models could be explored or leverage to give more insight into the responses of individual neurons. For example, having identified the relevant variables, and model complexity, could this information be used to explore simpler models in a more constrained way? Could visualization approaches like those used in Heras et al. 2019 https://doi.org/10.1371/journal.pcbi.1007354, be used to explore how the network is combining task variables?
Since the function is scalar valued, I believe that what is referred to in the paper as the Jacobian is also simply the gradient of the function. Perhaps this could be mentioned explicitly to aid understanding for a broad audience.
Overall, I congratulate the authors on developing this useful method, and I look forward to seeing it published.
Reviewer #3 (Recommendations for the authors):
1. Overall, as mentioned in my Public Review section, I believe the paper is very valuable and merits publication in eLife. I do think that it would be a pity if the manuscript was not carefully reread and rewritten to make it more accessible for the average reader who may want to implement this analysis on their data. An option would be to shift more material to the Methods and use the main text to convey the method and compare it with the available ones for the 3 datasets that are treated.
2. It would help to understand the various thresholds that appear throughout. I am, for example, not sure what to conclude from the sentence that starts on line 224 or how to interpret the red dashed line in Figure 5B (line 421).
3. Given that the zebrafish experiments were performed by the authors, could they comment on the reliability (across animals) in finding the functional activity described in Figures 6 and 7?
4. The code and data availability is exemplary.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "A modelfree approach to link neural activity to behavioral tasks" for further consideration by eLife. Your revised article has been evaluated by Michael Frank (Senior Editor) and a Reviewing Editor.
The manuscript has been much improved, but there is one large and a few small remaining issues that need to be addressed, as outlined below:
There is only one issue that remains, and it is the issue of calling this a "modelfree" approach, noted by R1. Since no analysis is truly model free, there is a spectrum of how much constraint a model puts on an analysis. There are other approaches discussed in the paper that are closer to the model free end than this one (like STA, STC, MID), so this term does not seem like the right one to use in this case. The authors should find a more accurate description of what their method brings to field.
Reviewer #1 (Recommendations for the authors):
I appreciate the care that the authors have taken in addressing the concerns raised in the first round. I believe that the major hurdles were cleared in the revision. However, I do have a few comments that I believe should be addressed before final acceptance/publication.
1) This method is definitely not "model free". It is a relaxed model requirement, or potentially a very general model, or a model with few assumptions, but I don't think it's just a matter of opinion as to whether it's "modelfree". Especially in comparison with STA and STC and Maximally informative dimensions, which all have even less of an underlying model framework… At a very basic level, the title and acronym must be accurate. Later in the paper, the authors argue that this method's model is actually providing important regularizing power in computing STC, etc. So it is definitely not model free, and the authors must find a different description for the title (and the acronym). I'm sure they can do this, but some suggestions: "Flexible INE"? CNNINE? "Canine"? "Convolutional neurAl Network Identification of Neural Encoding"? Something else? "Modelagnostic"?
2) Figure 7BC. The temporal kernels found by this method are a bit weird and seem not very biological – much of the power lies at 0 and 10 seconds, the two ends of the filter… Is this because the filters aren't long enough? These look quite different from true filters I'd expect to find in neurons, both in their duration and shape. I think this deserves some kind of explanation. I apologize that I didn't note this on the first round – I clearly got distracted by some other parts of the analysis.
https://doi.org/10.7554/eLife.83289.sa1Author response
Essential revisions:
The reviewers were positive about the paper's approach. Below are several points agreed upon by the reviewers that would be important to address before potential publication.
We thank the reviewers for carefully engaging with our manuscript and providing constructive criticism. We believe that our revisions address the raised concerns and considerably strengthen the paper. We provide detailed responses to individual criticisms below. We have reorganized the paper and simplified our definition of nonlinearity with the intention to make the paper less dense and easier to follow. We have also rewritten our comparison to STA/STC, explicitly comparing MINE to Volterra analysis to demonstrate that MINE provides a simpler and more robust way to extract receptive fields. We specifically note that it wasn’t our intention to claim that MINE could do something that all wellcrafted modelbased methods fail at, but rather that MINE provides an accessible framework for performing crucial analyses on the relationship of predictors and neural activity. We specifically highlight these points in different parts of the paper.
1) R1 had several critiques of the comparison to STA/STC. All reviewers agree that a more complete exposition of the differences between this method and STA/STC would benefit the paper. This could include synthetic cases where the proposed methods works better than STA/STC (for instance requiring less data) or real examples from the datasets where the proposed method is better at classifying cell response types. This paper should describe or show the advantages of this type of analysis over prior ones, ideally using good applestoapples comparisons.
We have changed our strategy of comparing MINE to systemidentification approaches. We specifically point out that because of the structure of the CNN underlying MINE (taking predictors across time as inputs) the Taylor expansion of the network is in fact equivalent to the Volterra expansion of a system with finite memory corresponding to the timelength of the inputs given to MINE (Lines 150153).
We use synthetic data simulating the action of two different filters and nonlinearities on the data to compare MINE to directly fitting the Volterra kernels via linear regression, an approach that in theory is truly modelfree. Just like MINE, ordinary linear regression recovers the filters without problem when the input is Gaussian white noise. This highlights the fact that the comparison is indeed fair: Given constraints on the stimulus, our simulation allows full recovery of the filters. Approaching more naturalistic inputs however, ordinary linear regression fails to recover Volterra kernels from which the filters can be extracted while the filters can still be derived using MINE. We then show that increasing an L2 penalty using Ridge regression allows us to indeed recover the filters from directly fitting the Volterra kernels. However, depending on the structure of the filter, MINE still recovers more faithful representations and MINE is overall better in generalizing to new data when compared with the bestfit Ridge regression models (Lines 169183).
We furthermore show that in the process of increasing the Ridge penalty, the effective degrees of freedom of the Ridge regression model approach the effective degrees of freedom of MINE. In other words, system analysis can be indeed truly modelfree, e.g. by imposing no constraints on a regression model that fits the Volterra kernels, however, such an approach does not yield useful results. To obtain useful results, the degrees of freedom of the model have to be reduced, either through penalties on a regression fit or by imposing a structured neural network (Lines 193200). We would also like to point out that the reduction in degrees of freedom is a general theme when applying system analysis to cases in which inputs are not gaussian white noise, whether this is through the use of basis functions, artificial neural networks or constrained regression.
a. Relatedly, R1 was also skeptical of the utility of the linear methods shown in Figure 1; the other reviewers convinced R1 that the limited linear models used were in keeping with standard methods in the zebrafish field. The authors should better describe why these particular (limited) linear models were used, and perhaps cite examples from the field in support of these sorts of models.
We better clarify our choice of model in the first section of the paper. Convolving inputs with a presumed calcium kernel and subsequently generating a model that contains all possible interaction terms is indeed common and we provide corresponding citations from the zebrafish and mouse literature. We also specifically state in the text now (Lines 8082), that it is not our goal to show that MINE can do something no other model can but rather that MINE has advantages over predefined models by flexibly learning data transformations. We would also like to point out that in the comparison to the zebrafish data we already use a regression model that can learn temporal transformations and in spite of that MINE identified more neurons.
2) The paper was a bit dense in some sections. The reviewers recommend moving some more mathematical details to the supplement and focusing on the most important results in the main text. (An example might be the curvature metric, which was a little opaque and did not seem central.)
We have attempted to make the paper less dense. We hope that the reordering of sections as well as the simplification of measuring nonlinearity help with this goal. After showing the capability of MINE to fit arbitrary data relationships we now introduce the Taylor expansion in a section that introduces the derivation of nonlinearity and complexity. Both of these metrics are now based on truncations of the Taylor series which removes the need to introduce curvature and the NLC in detail and which should also increase interpretability of our definition of nonlinearity. We subsequently use the same Taylor expansion to explain the extraction of receptive fields which we contrast with Volterra analysis. Subsequently we provide details for identifying contributing predictors, again via Taylor expansion but this time in a pointwise manner to account for nonlinear relationships, motivated by the section on nonlinearity and complexity. We hope that this flow will make it easier for readers to follow along.
3) The authors should draw finer distinctions between their method and some other, similar, prior methods, like the Ganguli/Baccus papers that used convolutional networks to fit retinal ganglion cell responses, in an approach that seems similar in some ways to the one used here. This relates also to several reviewer comments about what method exactly is being referred to as 'MINE'.
We have addressed this point. We specifically discuss the Ganguli/Baccus papers in the discussion (Lines 482488). We believe that this discussion leads to new insight in relation to some of our discoveries in the comparison to Volterra analysis. We also fully agree that we never explicitly clarified what MINE is. We now specifically state in the introduction (Lines 2126) and at the beginning of the discussion (Lines 446452) that we understand MINE as the combination for obtaining a predictive model of neural activity giving predictor input, characterizing the nonlinearity/complexity of the relationship between inputs and neural activity, identifying receptive fields and characterizing which inputs drive neural activity. As such MINE provides an accessible and easytouse interface for a nearly modelfree characterization of how neurons encode taskrelevant variables.
4) Figure 7: The authors should explain the significance of the result in Figure 7 with respect to signal processing. Are the authors saying that higher order interactions occur deeper in processing? This complexity analysis may also be an example of an approach that is possible with what is presented here but not with STA/STC.
We now explicitly discuss the significance of mapping computational complexity for understanding functional neural circuits (Lines 525530). We have also increased our analysis of mixedselectivity neurons highlighting another feature of MINE: Using the power of the predictive model to better understand how neurons integrate different features in our case thermosensory and motor features (Lines 393411). We believe this to be unique over systemidentification approaches as well. Specifically, our analysis in the comparison with Volterra approaches suggests that it would be very difficult to use system analysis if multiple predictors were involved as this further increases the challenge of obtaining a reasonably fitting model when omitting constraints.
Reviewer #1 (Recommendations for the authors):
1) In MINE's first mention (line 43) and in showing that it can capture various response features, the authors seem to imply that MINE is simply the idea of using ANNs to model neural activity as function of either independent or dependent variables. But surely this has already been done many times. I'm having some trouble seeing what precisely is claimed as new in MINE. My thought was that it is also the first and second order interpretation, but that does not come through in the early descriptions of the method.
We agree with the reviewer that it was never properly spelled out what MINE is. We now clarify this point in the introduction (Lines 2126) where we explicitly state: “Here we introduce “Modelfree identification of neural encoding” (MINE). MINE combines convolutional neural networks (CNN) to learn mappings from predictors (stimuli, behavioral actions, internal states) to neural activity (Figure 1) with a deep characterization of this relationship. Using Taylor expansion approaches, MINE reveals the computational complexity such as the nonlinearity of the relationship (Figure 2), characterizes receptive fields as indicators of processing (Figure 3) and reveals the dependence of neural activity on specific predictors or their interactions (Figure 4).
2) Related to this, in L55, the authors remark "Aside from network architecture and initialization conditions MINE is essentially modelfree." This is a pretty big constraint, by the model architecture, choice of activation function, etc. STA and STC and systems identification methods more generally seem far more truly model free. This suggests that the framing and title here might need significant reconsideration.
While we agree that the architecture of the model, the type of training performed, etc. are indeed constraints, we stand by our point that MINE is essentially modelfree when compared with other approaches in particular in the domain of approximating various functional forms. Specifically, while systems identification approaches can in theory be entirely modelfree we demonstrate that in practice (outside a whitenoise regime) they really aren’t. We compare effective degrees of freedom of the CNN used by MINE to models that directly fit the Volterra kernels to explicitly address this point (Lines 193200). We would also like to point out that it is a general theme to reduce the degrees of freedom in systems identification when stimuli are departing from gaussian white noise in order to still be useful. One example is the use of Laguerre basis functions which effectively reduce the degrees of freedom of the fit or constrained regression approaches which we used here in the form of Ridge regression.
3) The authors make several claims about linear models in Figure 1 that I cannot figure out. In particular, at L137, the authors say that the linear model cannot learn the dynamics of the calcium indicator. But why not? If the stimulus is s_t, the kernel is k_t, and the response is r_t = (s*k)_t + eta_t, then this is exactly what a linear model should be able to learn perfectly, directly from the data through a least squares fit at each time point. I don't understand why an expanded linear model should be required to learn the calcium kernel. In 1F, why is the LM not working as well as the ANN for S1 and M2? Those appear to both be simple linear transformations where r_t = (s*k)_t + eta_t: perfect for a linear model and least squares fitting. (In fact, with Gaussian noise, no model should be able to outperform the least squares fit linear model to estimate k, given sufficient data.) Similarly, dS1/dt is still just a linear transformation of S1 and should be easily learned by a linear model, since it's now just r_t = (s*k')_t + eta_t, where k' is the derivative of the calcium kernel. These results – that the authors' linear model is not capturing linear transformations that exactly match the form of the linear model – are truly puzzling.
We are sorry about causing this confusion. Our explicit goal in section one was to compare MINE to a barebones linear regression model (no timeshifts, no interactions) which is bound to fail as well as a regression model that is often used in the analysis of calcium imaging data: preconvolution with a calcium kernel and inclusion of interaction terms but no timeshifted regressors. We now provide citations to works using this type of model and given its prevalence we do believe it to be a useful benchmark. It is however absolutely clear that linear models can learn linear convolutions of the data (such as calcium kernels, computations of the derivative, etc.). We now specifically state that “While other models could clearly be designed to overcome these challenges, this further illustrates the point that a modelbased approach is limited to the constraints of the chosen model.” (Line 80), to clarify that we do not claim that MINE does something magic. In the zebrafish section however, we use a linear comparison model that can indeed learn temporal dependencies which is the only reason why it can fit ~40% of the data. As a side note: We initially tried to include a linear model that would include all timeshifts and all interactions in section one, however this model never had any predictive power over validation data after fitting. Again, this could likely have been overcome by carefully fitting a regularization parameter but our main point is that MINE provides a convenient and accessible solution to this problem.
4) One of the big proposed advances of this paper is the interpretability of the fitted ANN model. By doing "Taylor expansion" on the model, the authors pull out linear and quadratic terms that describe the dependency of neurons on different nonneural variables (dependent and independent) in these experiments. I have a few comments on this, some more critical than others. Overall, I was not left with a strong sense of what the benefit is, if any, of this method over standard systems identification approaches.
a. This sort of expansion of a functional has more typically been called a Volterra expansion, and there exists a giant literature on estimating Volterra kernels for nonlinear systems. It would be appropriate to cite some of that work and adopt the terminology used in prior work in the field.
We agree with the reviewer and now explicitly state “Since the input to the CNN at the heart of MINE contains information about each predictor across time, the Taylor expansion introduced in the previous section (Figure 2A) is equivalent to the Volterra expansion of a system processing information across time with a filter memory equivalent to the history length of the CNN.” In line with this statement we now also reframe our former STA/STC comparison in terms of Volterra analysis, e.g., stating: “As a comparison, we used regression to directly fit the Volterra kernels (see Methods). Notably, just like MINE, apart from the truncation of the series after the 2nd term, the Volterra analysis is essentially modelfree since any function can be approximated using an infinite Volterra series (Volterra, 1959).” We also provide a more comprehensive citation of literature.
b. In comparing their methods to spiketriggered analyses, Figure 4 is not a fair comparison, since it has not decorrelated the stimuli in time, as has been in wide practice in systems identification for quite a while (see Korenberg in the 1990s, Baccus and Meister, 2002). Thus the fair comparison is with S4, and 4 should not be viewed as comparable and quite likely removed.
We agree with the reviewer. Notably, our new approach, directly fitting the Volterra kernels using linear regression circumvents this problem. We have also expanded our comparison to ensure equal footing between MINE and the comparison approach by including regularization of the fit using Ridge regression.
c. In the comparison between the Jacobian/Hessian of the ANN and the STA/STC equivalents in Figures4 and S4, I don't understand the authors' choice of nonlinearities, which are described after line 1105. In particular, g_sta(x) has a first derivative at x=0, but it's second derivative evaluated at x=0 is also nonzero. By adding this to the other function and putting them through a ReLU (equation 19), there will be both linear and quadratic components for both RFa and RFv, even if g_stc is symmetric, simply because the ReLU is approximated by a linear + quadratic (+ higher) terms. This means that with these functions, I would expect the STA to be a linear combination of the two linear projections (RFa and RFv) and I would expect the STC matrix to have eigenvectors spanning the 2dimensional space of RFa and RFv. It's less clear why this isn't happening in the simulations. RFa could dominate the STA. In Figure 4B, it seems clear that the STC is contaminated by the STA, and given the analysis above, that is not surprising. Is that what is lowering the cosine similarity for the STC analysis in S4? Is only the first eigenvector being considered? If so, it's frequently the case that the STC dimensions also include the STA dimension; often, one reports only the STC components after the STA has been subtracted off. (See for instance Aljadeff et al., 2016.) It looks here as though the authors just report the eigenvector with largest eigenvalue. Is the STC really not working well, or is it just including components of the other filter (appropriately)? It's hard to believe that STC applied to a point nonlinearity acting on two projections of the stimulus would not return eigenvectors that span those two projections of the stimulus.
In the previous comparison we allowed the STA/STC method access to the first two eigenvectors with the largest eigenvalues because of this seeming complication. We would like to note that we did not specifically design the simulation to be as compatible as possible with STA/STC. This would just have repeated a vast body of literature without being practically that relevant. We believe that the important question is whether given a certain system, its filters can be recovered by the analysis at hand. That being said, we now changed our approach to further alleviate this comparison problem: For both MINE and the linear regression fits we combine the firstorder vector (J in the case of the CNN, k1 in the case of Volterra) and the secondorder matrix (H in the case of the CNN, k2 in the case of Volterra) and extract the principal dynamic modes from this matrix according to Marmarelis, 2004. Here we extract the three eigenvectors with the highest eigenvalues and assign the bestfit cases to the “linear” and “nonlinear” filter (Figure 3). Notably, these filters are always contained within the eigenvectors with the largest two eigenvalues for MINE while the regression based fit sometimes produces a third vector with a larger eigenvalue. Therefore, by allowing to choose from the highest three we in fact give an advantage to the regression based fit. The fit on whitenoise input data clearly validates this approach as it flawlessly recovers the filters for both methods. As expected, an ordinarily linear regression model fails in the case of highly correlated stimuli, but filters can be successfully recovered under strongly regularizing conditions, albeit with slightly worse quality than MINE. Furthermore, as regularization increases, the predictive power of the linear model decreases as well. MINE again has a slight edge in this regard.
d. This brings up a conceptual point. The decomposition that the authors perform of the ANN into zeroth, first, second, and higher order terms in Figure 7D could just as easily be written for the raw responses of the neurons. In such a case, with an excellent model, the two expansions should be closely related to one another, and in some cases equivalent. The loworder terms in the Volterra series, when estimated experimentally, depend on higher order terms. (For instance a cubic Volterra term would show up in the linear filter.) Wiener series are typically estimated instead, since these produce an orthogonalized basis, so that this dependence is eliminated (for a given amplitude of inputs X). When computing the STC, one can eliminate the linear component first; that would then be the Wiener estimate of the second order kernel. The second order kernel computed as the Hessian of an ANN is not orthogonalized in the same way, so it might produce different results from STC, but it is not clear that this is what the authors show, or intend to show.
We fully agree with the reviewer and have addressed this point as shown above. Again, as we point out in multiple places of the manuscript, we do not believe MINE to be a magical tool that can do something a carefully crafted model cannot. It simply provides a practical solution to a realworld problem. This is also further illustrated in the greater robustness of MINE to changes in the size of training data compared to directly fitting the Volterra kernels via regression (Figure S3JI).
e. The authors on L346 say that higher order terms are contained in the ANN, so that the Jacobian will be different from spike triggered covariance analysis. I think this is correct. However, I'm not sure that makes the Hessian or Jacobian in the ANN a better analysis than the systems identification approaches. In my mind, this is potentially the only advantage of doing this ANN method over STA and STC; but if that's the advantage, then I'd want to be shown very clearly an example where this occurs and where the insight into the system was improved by using the Jacobian/Hessian rather than STA/STC. I imagine that such benefits could accrue when the fitted ANN is close to the true set of transformations of the system, but that would reduce the advantages of being able to do this with any old ANN. Overall, these comments get to the point of: under what conditions would the Jacobian/Hessian/ANN method be better or somehow more informative than STA/STC? And why fit to the ANN if one is going to interpret it with methods equivalent to STA/STC that could be performed directly on the data?
The message we intended with L346 was that the nonlinearity, which would be estimated separately for STA/STC, will be contained within the ANN itself which likely influences the retrieved filters. We did not mean to imply anything in relation to the quality of extracting filters and have struck this comment. The possibility of modeling higherorder terms of the CNN model however likely underlies the fact that it better generalizes to validation data (Figure 3C and F). This of course partly results because our nonlinearities are not designed to be well approximated by a quadratic function, i.e. the function that the fit of the Volterra kernels could fully approximate. While one might argue that this makes the comparison “unfair” we would argue that in reallife problems nonlinearities will seldom be constrained to a specific functional form (as we highlight in the introduction). We do believe that our new analysis provides a clear idea of the advantages of MINE, especially when taken in context of the entire manuscript: MINE allows extracting highquality filters under more conditions than a direct fit of the Volterra kernels, the resulting model has more predictive power and MINE can extract more information from the fits than can easily be obtained by system identification approaches.
Reviewer #2 (Recommendations for the authors):
While the paper could certainly be published in the current form, there are some points that as a reader I would like to have seen addressed further. I therefore offer some comments and suggestions that the authors might like to consider.
I believe there is a mistake in the formulae given for the Taylor metric in the Methods. Since the R^2 Full is always greater than the R^2 with fewer parameters, the formulae, as written, appear to give a number between 0 and infinity, whereas I think the intention is that they should give a number between 0 and 1 (as appears in the paper).
We fully agree with the reviewer and apologize for this oversight. We computed the correct quantities in our code but miswrote the formulas in the method section. We have corrected this mistake.
An important consideration when comparing approaches to modeling neuronal responses is how much data is required to fit the model. In the comparison with regression models, or with the spiketriggered average it would be useful to show how the performance of each approach develops with the length of recording.
We agree with the reviewer and have included a corresponding analysis in our discussion of receptive field extraction. Figure S3IJ and Lines 184192 in the text.
The paper could be improved, and the method made more appealing, by more clearly showing qualitative improvements that their approach offers over variations of linear regression methods. In the zebrafish part, there is a convincing argument made that the method allows more neurons to be fit at some threshold, but direct comparison of the outputs of both approaches is restricted to counting the numbers of ROIs identified in various classes. One possible explanation could be that these differences arise because the modelfree method has greater sensitivity for identifying neurons within the same classes that are found with the LM method, but it would be more interesting if the model free method finds particular types of neurons that are clearly missed by the LM method. What could be compelling to see, for example, would be (1) differences in the spatial distributions of neuron classes found with the two methods, e.g. regions where neurons are only found with the modelfree method or (2) Evidence that there are specific functional classes that are missed with the LM method – e.g. do the RF clusters vary in how well they are picked up by the LM?
We agree with the reviewer that this is an important point to address. However, whether MINE identifies qualitatively new types or not will also depend on the level of analysis detail. We currently only perform some coarse divisions of neural types to demonstrate various types of information that can be obtained with MINE. As the reviewer suggested we did compare the anatomical distribution of identified types. Making cluster maps (see below where we address our change in visualization) it does indeed appear as though there are parts of the brain where the linear model does not identify stimulus related neurons (dorsal cerebellum, anterior hindbrain). However, we were not comfortable adding this data to the paper. The lack of neurons in these regions could be because the linear model cannot identify “these types of neurons” or simply because it is less sensitive and by chance did not identify neurons within sparse regions.
We did assess how many neurons the linear model identifies in each receptive field cluster and added this information to Figure S7D. While there are no clusters that exclusively contain neurons identified by MINE, there are five clusters where the linear model identifies much fewer neurons than expected based on the overall identified fraction. We discuss this result (Lines 594601) and point to the fact that this has consequences if the linear model data would be used for clustering in the first place, reducing the number of uniquely identified receptive field clusters from 15 to 4. So while MINE largely increases sensitivity, this has realworld consequences on the analysis which we believe to be important.
It would be helpful to have some more information about the choice of network architecture, and how this was optimized. For example, 80 features in the convolutional layer seems like a lot for these relatively simple data sets, and it would be good to understand better how this number was chosen and how this choice can affect the outcome.
We did not systematically explore network architectures and did not optimize the architecture. In part because we wanted to avoid preconceiving an architecture that we thought might be best for the data at hand. We have now clarified this in the methods section “Design and training of convolutional network.” In that section we also provide more detail on our rationale for using 80 convolutional layers. We do not think that 80 features are in fact extracted, but having a large number of layers means a large variation in random features that are extracted pretraining which can aid in network training. Since our convolutional layers are linear, each node in the first dense layer in fact uses one weighted average filter that acts on the input data.
Intuitively, it seems like the temporal effects of the calcium indicator response would ideally be applied downstream of the computation of the neural response, but, in the text, this step is described as happening in the first layer of linear convolutional filters. Is this a constraint of the network design, and are there any practical consequences for the possible computations captured by the model?
We agree with the reviewer that in measurements of brain activity, the convolution by the calcium indicator occurs after any other transformations. We perform all convolutions at the input layer to increase the simplicity of the network. After the initial convolutional layers all timing information is lost in the network which reduces the number of required parameters over a design that allows for convolution at the input and just before the output node. While this does limit expressivity of the network somewhat (since we do not allow for convolutions of the results of the nonlinear operations as would be expected in a calcium imaging experiment) we think that these effects are small and justified by the gain in simplicity in the network. Since we fit separate networks for each neuron, having fewer parameters leads to substantial speed gains. We note however, that all procedures implemented in MINE are agnostic to network architecture and it wouldn’t be a fundamental problem to make this change.
I don't like that the distributions of neuronal types seem to be shown in the form of renderings of anatomical masks highlighting regions of enrichment. This rather obscures the true distributions, and makes them appear more cleanly separated than they are. For me, these visualizations don't offer much information that goes beyond the tabulation of regions. A direct representation of the density of different functional classes, as is often used in the field, would be richer in information, and might be a way to highlight qualitative differences between the populations identified using the LM and modelfree approaches. Even if it is the case that there are not strong differences in the distributions of functional types, this by itself would be an interesting observation that merits discussion.
We agree with the reviewer and have adjusted our visualizations. While we keep the barplot style tabulations for the comparison of stimulus/behavior/mixed and swim start/vigor/direction we now plot centroid maps in all cases. We created these maps using spatial clustering. In brief, we merge neurons into spatial clusters based on proximity (threshold of 5um) and then display all clusters that have at least 10 neurons. While this does not display the entire population of neurons it does give a better sense of the overall distribution. Structure is also visible when showing all neurons, however in 2D projections differences between areas where neurons actually group together versus areas in which there are multiple single neurons across 3D space are obscured. We therefore opted for this middleground between thresholding based on anatomical regions and a plot with disregard for spatial grouping. We hope that this addresses the point raised by the reviewer.
I would like to see more discussion of how the neural network models could be explored or leverage to give more insight into the responses of individual neurons. For example, having identified the relevant variables, and model complexity, could this information be used to explore simpler models in a more constrained way? Could visualization approaches like those used in Heras et al. 2019 https://doi.org/10.1371/journal.pcbi.1007354, be used to explore how the network is combining task variables?
We agree with the reviewer and now exploit a similar visualization for the mixed selectivity neurons (Figure 7D and S7E). Since our inputs are stimuli/behaviors across time we cannot directly visualize the effect of one variable on the output. We therefore exploit the receptive fields to provide varying strengths of inputs along the different axis which we refer to as drive. We think that this is also another good example for the power of MINE, exploiting the ability of extracting receptive fields together with the presence of a predictive model that encapsulates nonlinear relationships.
Since the function is scalar valued, I believe that what is referred to in the paper as the Jacobian is also simply the gradient of the function. Perhaps this could be mentioned explicitly to aid understanding for a broad audience.
We fully agree with the reviewer and have adjusted our writing. We still refer to the first order derivatives as J in the formulas but refer to it as the gradient of the function (e.g., Lines 101111 in the section about computational complexity).
Overall, I congratulate the authors on developing this useful method, and I look forward to seeing it published.
We thank the reviewer for their assessment and hope that our additions have improved the paper.
Reviewer #3 (Recommendations for the authors):
1. Overall, as mentioned in my Public Review section, I believe the paper is very valuable and merits publication in eLife. I do think that it would be a pity if the manuscript was not carefully reread and rewritten to make it more accessible for the average reader who may want to implement this analysis on their data. An option would be to shift more material to the Methods and use the main text to convey the method and compare it with the available ones for the 3 datasets that are treated.
We greatly appreciate the reviewer’s positive assessment of our work. We hope that our rewrite addresses concerns with respect to readability and accessibility. We think that readers will now have an easier time following along, after we rearranged the sections of the paper and simplified our analysis of nonlinearity. We also attempted to shift more details to the methods, however this wasn’t entirely possible in all places. We had considered presenting MINE as a blackboxmodel essentially demonstrating its usefulness on the three types of data we present (groundtruth, cortex and zebrafish). We would then have split details to some form of appendix as the reviewer suggests. However, ultimately we felt that this is also not a satisfying solution. While it certainly increases accessibility we worried that it would make it harder to judge the validity of our approach. We hope that we found a satisfying middleground in the revision.
2. It would help to understand the various thresholds that appear throughout. I am, for example, not sure what to conclude from the sentence that starts on line 224 or how to interpret the red dashed line in Figure 5B (line 421).
We agree with the reviewer that it is important to clarify thresholds. We have removed the threshold from the Taylor decomposition section (now around line 226) and corresponding figures as it is indeed not very useful. What we tried to say is that it is possible to identify thresholds which can separate true contributing predictors from noise. However, since we do not use this threshold later it really doesn’t serve a purpose.
We have clarified our choice of threshold for the cortical dataset (now around lines 248252) which is much lower than the zebrafish one, since the data does not have cellular resolution. It is therefore very likely that every component we fit contains activity from many neurons that aren’t task related, limiting the amount of variance that could possibly be explained by a model.
We also added an explanation for our threshold used on the zebrafish data and why it is considerably higher than that used on cortical data (Line 319).
3. Given that the zebrafish experiments were performed by the authors, could they comment on the reliability (across animals) in finding the functional activity described in Figures 6 and 7?
We thank the reviewer for this question and we have added this important information to the paper. Specifically, in Figure 6E we now state for each functional type in how many fish it was identified. All but four types were identified across at least 10 fish, with the lowest number being four fish. Accordingly, we state in the text that each type was identified in multiple fish (Line 337). We also counted how many fish contribute to each receptive field cluster. We report in the text that each type was identified in at least 15 fish (Line 383).
4. The code and data availability is exemplary.
We thank the reviewer for this assessment.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
The manuscript has been much improved, but there is one large and a few small remaining issues that need to be addressed, as outlined below:
There is only one issue that remains, and it is the issue of calling this a "modelfree" approach, noted by R1. Since no analysis is truly model free, there is a spectrum of how much constraint a model puts on an analysis. There are other approaches discussed in the paper that are closer to the model free end than this one (like STA, STC, MID), so this term does not seem like the right one to use in this case. The authors should find a more accurate description of what their method brings to field.
We thank the reviewers for again engaging with our manuscript and providing constructive criticism. We have made textual changes (and one figure clarification suggested by R1) to address the remaining concerns. This specifically includes a change to the title and moniker of the method to refrain calling it “modelfree”. We agree with R1 that it is important to be precise about this point, and now focus on the idea that the neural network allows the method to be flexible in terms of the function/model it can implement. This was our reason for calling it “modelfree” in the first place – not that it is free from constraints on parameters but that it is free(er) from constraints on the functional form. We therefore changed the title to “Model discovery to link neural activity to behavioral tasks”. We kept the acronym “MINE” and simply struck the term “free”, referring to it as “Model identification of neural encoding” throughout the text. We only kept one reference to the “modelfree” idea, when we talk about the lack of assumptions on the underlying distribution of data but added the caveat “largely modelfree” in this statement. All other references to the method being “modelfree” have been struck from the text.
Reviewer #1 (Recommendations for the authors):
I appreciate the care that the authors have taken in addressing the concerns raised in the first round. I believe that the major hurdles were cleared in the revision. However, I do have a few comments that I believe should be addressed before final acceptance/publication.
We thank the reviewer for their assessment and believe that our new changes have addressed the remaining concerns.
1) This method is definitely not "model free". It is a relaxed model requirement, or potentially a very general model, or a model with few assumptions, but I don't think it's just a matter of opinion as to whether it's "modelfree". Especially in comparison with STA and STC and Maximally informative dimensions, which all have even less of an underlying model framework… At a very basic level, the title and acronym must be accurate. Later in the paper, the authors argue that this method's model is actually providing important regularizing power in computing STC, etc. So it is definitely not model free, and the authors must find a different description for the title (and the acronym). I'm sure they can do this, but some suggestions: "Flexible INE"? CNNINE? "Canine"? "Convolutional neurAl Network Identification of Neural Encoding"? Something else? "Modelagnostic"?
We have changed the title of the manuscript as well as the text to be precise about this point. We now emphasize the idea that the advantage of MINE is that it is flexible with respect to the functional form it can implement, turning modelbased analysis on its head: Instead of defining a model and subsequently fitting parameters, MINE will discover a model which we subsequently characterize. As stated above, we therefore changed the title to “Model discovery to link neural activity to behavioral tasks”. We kept the acronym “MINE” and simply struck the term “free”, referring to it as “Model identification of neural encoding” throughout the text. We only kept one reference to the “modelfree” idea, when we talk about the lack of assumptions on the underlying distribution of data but added the caveat “largely modelfree” in this statement. All other references to the method being “modelfree” have been struck from the text. The changes made to address this point are highlighted in the tracked version of the manuscript at the end of this reviewer response.
2) Figure 7BC. The temporal kernels found by this method are a bit weird and seem not very biological – much of the power lies at 0 and 10 seconds, the two ends of the filter… Is this because the filters aren't long enough? These look quite different from true filters I'd expect to find in neurons, both in their duration and shape. I think this deserves some kind of explanation. I apologize that I didn't note this on the first round – I clearly got distracted by some other parts of the analysis.
We agree with the reviewer that this should be addressed. To this end, we have added a caveat to the result section, where we now explicitly state:
“We note that some of the uncovered receptive fields (e.g., clusters 1 and 4) have the largest departures from 0 at the starts and ends of the receptive fields. This might indicate that the chosen history length (here 10 s) is too short and does not cover the entire timescale of processing which could be a result of the rather slow nuclear calcium indicator we chose for this study.”
https://doi.org/10.7554/eLife.83289.sa2Article and author information
Author details
Funding
National Institutes of Health (5R01NS12388702)
 Jamie D Costabile
 Kaarthik A Balakrishnan
 Martin Haesemeyer
The Ohio State University Wexner Medical Center
 Sina Schwinn
 Martin Haesemeyer
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Andrew D Bolton and Bradley Cutler for critical comments on the manuscript and the 'Zebrafish Neural Circuits & Behavior' seminar community for early stage feedback and discussion. We would also like to thank participants of NeuroDataShare 2023 for helpful discussions about this work and suggested improvements to our sharing of data. This work was supported by funding from the Ohio State University and NIH BRAIN Initiative R01NS123887 through NINDS and NIDA to MH.
Ethics
Animal handling and experimental procedures were approved by the Ohio State University Institutional Animal Care and Use Committee (IACUC Protocol #: 2019A00000137 and 2019A00000137R1).
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Damon A Clark, Yale University, United States
Version history
 Preprint posted: September 3, 2022 (view preprint)
 Received: September 6, 2022
 Accepted: June 5, 2023
 Accepted Manuscript published: June 6, 2023 (version 1)
 Version of Record published: June 29, 2023 (version 2)
Copyright
© 2023, Costabile, Balakrishnan et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,142
 Page views

 151
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
The microbial community composition in the human gut has a profound effect on human health. This observation has lead to extensive use of microbiome therapies, including overthecounter 'probiotic' treatments intended to alter the composition of the microbiome. Despite so much promise and commercial interest, the factors that contribute to the success or failure of microbiometargeted treatments remain unclear. We investigate the biotic interactions that lead to successful engraftment of a novel bacterial strain introduced to the microbiome as in probiotic treatments. We use pairwise genomescale metabolic modeling with a generalized resource allocation constraint to build a network of interactions between taxa that appear in an experimental engraftment study. We create induced subgraphs using the taxa present in individual samples and assess the likelihood of invader engraftment based on network structure. To do so, we use a generalized LotkaVolterra model, which we show has strong ability to predict if a particular invader or probiotic will successfully engraft into an individual's microbiome. Furthermore, we show that the mechanistic nature of the model is useful for revealing which microbemicrobe interactions potentially drive engraftment.

 Computational and Systems Biology
Angiogenesis is a morphogenic process resulting in the formation of new blood vessels from preexisting ones, usually in hypoxic microenvironments. The initial steps of angiogenesis depend on robust differentiation of oligopotent endothelial cells into the Tip and Stalk phenotypic cell fates, controlled by NOTCHdependent cell–cell communication. The dynamics of spatial patterning of this cell fate specification are only partially understood. Here, by combining a controlled experimental angiogenesis model with mathematical and computational analyses, we find that the regular spatial Tip–Stalk cell patterning can undergo an order–disorder transition at a relatively high input level of a proangiogenic factor VEGF. The resulting differentiation is robust but temporally unstable for most cells, with only a subset of presumptive Tip cells leading sprout extensions. We further find that sprouts form in a manner maximizing their mutual distance, consistent with a Turinglike model that may depend on local enrichment and depletion of fibronectin. Together, our data suggest that NOTCH signaling mediates a robust way of cell differentiation enabling but not instructing subsequent steps in angiogenic morphogenesis, which may require additional cues and selforganization mechanisms. This analysis can assist in further understanding of cell plasticity underlying angiogenesis and other complex morphogenic processes.