Abstract
Responses of sensory neurons are often modeled using a weighted combination of rectified linear subunits. Since these subunits often cannot be measured directly, a flexible method is needed to infer their properties from the responses of downstream neurons. We present a method for maximum likelihood estimation of subunits by softclustering spiketriggered stimuli, and demonstrate its effectiveness in visual neurons. For parasol retinal ganglion cells in macaque retina, estimated subunits partitioned the receptive field into compact regions, likely representing aggregated bipolar cell inputs. Joint clustering revealed shared subunits between neighboring cells, producing a parsimonious population model. Closedloop validation, using stimuli lying in the null space of the linear receptive field, revealed stronger nonlinearities in OFF cells than ON cells. Responses to natural images, jittered to emulate fixational eye movements, were accurately predicted by the subunit model. Finally, the generality of the approach was demonstrated in macaque V1 neurons.
Introduction
The functional properties of sensory neurons are often probed by presenting stimuli and then inferring the rules by which inputs from presynaptic neurons are combined to produce receptive field selectivity. The most common simplifying assumption in these models is that the combination rule is linear, but it is wellknown that many aspects of neural response are highly nonlinear. For example, retinal ganglion cells (RGCs) are known to be driven by nonlinear 'subunits' (Hochstein and Shapley, 1976), which reflect signals from presynaptic bipolar cells that are rectified at the synapse onto RGCs (Demb et al., 1999; Demb et al., 2001; Borghuis et al., 2013). This subunit computation is fundamentally nonlinear because hyperpolarization of one bipolar cell input does not cancel depolarization of another. The subunit architecture endows RGCs with sensitivity to finer spatial detail than would arise from a linear receptive field (Hochstein and Shapley, 1976; Demb et al., 2001;; Baccus et al., 2008; Crook et al., 2008; Schwartz et al., 2012) and has been implicated in the processing of visual features like object motion and looming (Olveczky et al., 2007; Münch et al., 2009). As another example, complex cells in primary visual cortex are thought to perform subunit computations on simple cell inputs, producing invariance to the spatial location of a stimulus. Indeed, subunits appear to be a common computational motif in the brain, and inferring their functional properties from neural recordings requires the development of effective estimation methods.
To this end, simple techniques have been used to reveal the presence and typical sizes of subunits (Hochstein and Shapley, 1976), and more elaborate techniques have been used in specific experimental settings (Emerson et al., 1992; Paninski, 2003; Sharpee et al., 2004; Rust et al., 2005; Schwartz et al., 2006; Pillow and Simoncelli, 2006; Rajan and Bialek, 2013; Park et al., 2013; Theis et al., 2013; Kaardal et al., 2013; Freeman et al., 2015; Liu et al., 2017; Maheswaranathan et al., 2018; Shi et al., 2019). However, no widely accepted general computational tools exist for inferring the structure of nonlinear subunits providing inputs to a neuron, such as their individual sizes, shapes, and spatial arrangement. Such a tool must be robust, efficient, and applicable, possibly with simple modifications, in a variety of different experimental contexts.
Here, we present a novel method for estimating the properties of subunits providing inputs to a recorded neuron. The method was derived as a maximum likelihood procedure for estimating subunit model parameters and took the form of a spiketriggered clustering algorithm. As such, it provided a natural generalization of spiketriggered averaging, which is widely used to estimate linear receptive fields because of its conceptual simplicity, robustness, and straightforward interpretation. In populations of recorded parasol RGCs in macaque retina, the new method revealed a gradual partitioning of receptive fields with a hierarchical organization of spatially localized and regularly spaced subunits, consistent with the input from a mosaic of bipolar cells, and generalized naturally and parsimoniously to a population of cells. Unlike other approaches, the technique optimized a simple and explicit model of neural response, required no assumptions about the spatiotemporal properties of subunits, and was validated on an extensive data set. The structure of the model permitted prior information, such as spatial locality, to be incorporated naturally in cases of limited data. A novel closedloop 'null' stimulus revealed that the subunit model was substantially more accurate than a linear model in explaining RGC responses, a finding that also extended to natural stimuli. Responses to null stimuli were stronger in OFF parasol cells than ON parasol cells, consistent with their stronger nonlinear responses (Demb et al., 2001; Chichilnisky and Kalmar, 2002; Turner and Rieke, 2016). Application of the estimation method to complex cells in primary visual cortex revealed subunits with expected spatiotemporal structure, demonstrating that the approach generalizes to other experimental contexts and neural circuits.
Results
The goal is to develop a method to estimate the nonlinear subunits of recorded neurons, examine the spatial and temporal properties of the estimated subunits, and verify their predictions for a wide range of stimuli.
Subunit response model and parameter estimation
The method is developed as an optimal estimation procedure for a generic subunit model in which responses are described as an alternating cascade of two linearnonlinear (LN) stages (Figure 1A). In the first LN stage, visual inputs over space and recent time are weighted and summed by linear subunit filters, followed by an exponential nonlinearity. In the second LN stage, the subunit outputs are weighted by nonnegative scalars, summed, and passed through a final output nonlinearity. This yields the neuron's firing rate, which drives a Poisson spike generator.
To predict neural responses using the model, the model parameters (subunits, weights, and nonlinearity) must be estimated from recorded data. The estimation problem is difficult because the negative loglikelihood is not a convex function of the parameters. An efficient solution is presented with two steps (see Materials and methods): (1) obtain a maximum likelihood estimate of the subunits by identifying clusters in the space of spiketriggered stimuli (Figure 1B), (2) estimate the parameters of the output nonlinearity using standard optimization methods. This spiketriggered clustering solution offers a simple and robust generalization of the commonly used spiketriggered averaging method for receptive field estimation (see Chichilnisky, 2001). Intuitively, clustering is effective because stimuli that produce spikes will typically fall into distinct categories based on which subunit(s) they activate. Simulations with a RGC model show that this procedure yields accurate estimates of model parameters when sufficient data are available (Figure 1—figure supplement 1).
The remaining model parameter, the number of subunits, is discrete and is therefore typically selected by evaluating model predictions on heldout data (i.e. crossvalidation) (Figure 2C). However, in some cases, it is desirable to either compare the results obtained with different numbers of estimated subunits (in order to study the strongest sources of nonlinearity; Figure 2A,B, 4, 5–8), or to use a fixed number of subunits (to compare variants of the algorithm; Figure 3).
In what follows, results are presented using the core estimation approach, followed by several modifications appropriate for analyzing data from different experimental conditions. Specifically, consider two situations with data limitations: short recording durations, and biased stimulus statistics. For short durations, the core method is modified to include regularization that assumes localized spatial structure of subunits (see Materials and methods, Figures 3,4). For biased stimuli (e.g. natural scenes), subunit filters are estimated using responses from white noise, then the output nonlinearities are finetuned (Figures 6,7).
Estimated subunits emerge hierarchically and are spatially localized and nonoverlapping
To test the subunit model and estimation procedure on primate RGCs, light responses were obtained using largescale multielectrode recordings from isolated macaque retina (Chichilnisky and Kalmar, 2002; Litke et al., 2004; Frechette et al., 2005). Spiking responses of hundreds of RGCs to a spatiotemporal white noise stimulus were used to classify distinct cell types, by examining the properties of the spiketriggered average (STA) stimulus (Frechette et al., 2005; Field et al., 2007). Complete populations of ON and OFF parasol RGCs covering the recorded region were examined.
Fitting a subunit model reliably to these data was aided by decoupling the spatial and temporal properties of subunits. Specifically, although the fitting approach in principle permits estimation of full spatiotemporal subunit filters, the high dimensionality of these filters requires substantial data (and thus, long recordings). The dimension of the parameter space was reduced by assuming that subunit responses are spatiotemporally separable, and that all subunits have the same temporal filter (consistent with known properties of retinal bipolar cells that are thought to form the RGC subunits, see EnrothCugell et al., 1983 and Cowan et al., 2016). The common temporal filter was estimated from the STA (see Materials and methods). The stimulus was then convolved with this time course to produce an instantaneous spatial stimulus associated with each spike time. This temporally prefiltered spatial stimulus was then used to fit a model with purely spatial subunits.
The core spike triggered clustering method (no regularization) was applied to study the spatial properties of subunits with increasing numbers of subunits (N). As expected, setting N = 1 yielded a single subunit with a receptive field essentially identical to the STA, i.e. the spatial filter of a LN model. Interestingly, a model with two subunits partitioned this receptive field into spatially localized regions (Figure 2A, second row). Fitting the model with additional subunits (Figure 2A, subsequent rows) typically caused one of the subunits from the preceding fit to be partitioned further, while other subunits were largely unchanged. In principle this procedure could be repeated many times to capture all subunits in the receptive field. However, the number of model parameters increases with N, and thus the estimation accuracy decreases, with estimated subunits becoming noisier and exhibiting larger overlap (Figure 2A, last row). The partitioning observation suggests the possibility of an estimation approach in which the hierarchical introduction of subunits is built in, with the potential for greater efficiency (see Materials and methods). Explicitly incorporating the hierarchical assumption produces a similar collection of subunits at each level (Figure 2—figure supplement 1), reinforcing the idea of subunit partitioning as the number of subunits increases. However, for clarity and brevity, the hierarchical estimation approach is not used for the remainder of this paper.
An optimal number of subunits was chosen for each cell to maximize the crossvalidated likelihood (i.e. the likelihood measured on a held out test set  Figure 2B). The typical optimum for OFF parasol cells in the recorded data was 4–6 subunits (Figure 2C). This optimal value is governed by the true number of subunits, the resolution of the stimulus (which determines the dimensionality of the parameter space, see Figure 3), and the amount of data (i.e. number of spikes). Since the ON parasol cells had a smaller optimum number of subunits (one or two subunits provided the optimum fit for 48% of ON parasol cells, while this held for only 3% of OFF parasol cells) and exhibited much smaller increases in likelihood of the subunit model compared to a LN model (not shown), subsequent analysis focuses only on the OFF parasol cells. Note, however, that a model with multiple subunits frequently explained ON parasol data more accurately than a model with one subunit (three or more subunits for 52% of cells), implying that ON parasol cells do have spatial nonlinearities.
The estimated subunits were larger and fewer in number than expected from the size and number of bipolar cells providing input to parasol RGCs at the eccentricities recorded (Jacoby et al., 2000; Schwartz and Rieke, 2011; Tsukamoto and Omi, 2015). However, two other features of estimated subunits suggested a relationship to the collection of underlying bipolar cells contributing to the receptive field. First, estimated subunits were compact in space as shown in Figure 2A. This was quantified by comparing each subunit with the collection of subunits derived by randomly permuting the filter values at each pixel location across different subunits of the same cell. Spatial locality for estimated and permuted subunits was then evaluated by the meansquared error (MSE) of 2D Gaussian fits. Compared to the permuted subunits, the estimated subunits had substantially lower MSE (Figure 2D,E). Second, the subunits 'tiled' the RGC receptive field, in that they covered it with minimal gaps and modest overlap. This was quantified by noting that on average, the neighboring subunits for a given cell were separated by ∼ 1.5 times their width, with relatively little variation (Figure 2F).
Given these observations, a natural interpretation of subunits observed with coarse stimulation is that they correspond to aggregates of neighboring bipolar cells. To test this intuition, the algorithm was applied to white noise responses generated from a simulated RGC with a realistic size and number of subunits in the receptive field (Jacoby et al., 2000; Schwartz and Rieke, 2011). For data simulated using coarse pixelated stimuli matched to those used in the experiments, the estimated subunits were fewer and larger than the bipolar cells, with each subunit representing an aggregate of several bipolar cells (Figure 1—figure supplement 1). The recovered subunits also exhibited spatial locality and tiling, as in the data.
Regularization for spatially localized subunit estimation
To estimate subunits with a resolution approaching that of the underlying bipolar cell lattice would require much higher resolution stimuli. But higher resolution stimuli typically require more data. Specifically, to obtain a constant quality estimate of subunits as the stimulus resolution/dimensionality is increased requires a proportional increase in the number of spikes (e.g. see Dayan and Abbott, 2001). Furthermore, higher resolution stimuli typically produce lower firing rates, because the effective stimulus contrast is lower. To illustrate the problem, the impact of increased resolution and smaller duration of data were examined separately. Higher resolution stimuli led to worse subunit estimates, even when all the recorded data (97 min) were used (Figure 3A, top row). Specifically, of the five estimated subunits, one resembled the receptive field, and the others were apparently dominated by noise and had low weights; thus, the algorithm effectively estimated a LN model. For coarse resolution stimuli, using only 10% of the recorded data led to noisy subunit estimates (Figure 3B, top row) compared to the estimates obtained with 81% of the data (24 min, Figure 3B, last row). These observations suggest that it would be useful to modify the core clustering algorithm by incorporating prior knowledge about subunit structure, and thus potentially obtain more accurate estimation with limited data.
To accomplish this, a regularizer was introduced to encourage the spatial locality of estimated subunits (Figure 2). Previous studies (Liu et al., 2017; Maheswaranathan et al., 2018; McFarland et al., 2013) have used L1 regularization (a common means of inducing sparsity), but L1 is agnostic to spatial locality, and indeed is invariant to spatial rearrangement. Thus, a novel locally normalized L1 (LNL1) regularizer was developed that penalizes large weights, but only if all of the neighboring weights are small: $L({w}_{i})=\frac{{w}_{i}}{\u03f5+{\sum}_{j\in \text{neighbor}(i)}{w}_{j}}$ for $\u03f5=0.01$, a value smaller than typical nonzero weights which are on the order of unity. This penalty is designed to have a relatively modest influence on subunit size compared to L1, which induces a strong preference for smaller subunits. The LNL1 penalty was incorporated into the estimation procedure by introducing a proximal operator in the clustering loop (see Materials and methods). In this case, the appropriate projection operator is an extension of the softthresholding operator for the L1 regularizer, where the threshold for each pixel depends on the strength of neighboring pixels. The regularization strength is chosen to maximize crossvalidated likelihood (see Materials and methods).
Introducing the LNL1 prior improved subunit estimates for both of the limited data situations (high resolution, or small duration) presented above. Since the optimal number of subunits can vary for different regularization methods, the subunits estimated from different regularization methods were compared after fixing the total number of subunits to the most likely optimum across cells (Figure 2). In the case of the coarse resolution data, both L1 and LNL1 priors produced spatially localized subunits (Figure 3B, middle rows) whose locations matched the location of the subunits estimated using more data without regularization (Figure 3B, last row). This suggests that the proposed prior leads to efficient estimates without introducing significant biases. Relative to L1 regularization, LNL1 regularization yielded fewer apparent noise pixels in the background (Figure 3B, middle rows) and a larger optimum number of subunits (not shown). This improvement in spatial locality is reflected in more accurate response prediction with LNL1 regularization (Figure 3C).
In the case of high dimensional stimuli (Figure 3A, top row), both L1 and LNL1 priors were successful in estimating compact subunits (Figure 3A, middle rows) and matched the location of subunits obtained with a coarser stimulus (Figure 3A, bottom row). The estimated subunits tiled the receptive field, as would be expected from an underlying mosaic of bipolar cells. The LNL1 prior also led to subunit estimates with fewer spurious pixels compared to the L1 prior. Note that although these spurious pixels can also be suppressed by using a larger weight on the L1 prior, this also yielded smaller subunit size, which produced less accurate response predictions, while LNL1 had a much smaller effect on estimated subunit size (not shown). Hence, in both the conditions of limited data examined, the novel LNL1 prior yielded spatially localized subunits, fewer spurious pixels, and a net improvement in response prediction accuracy.
Parsimonious modeling of population responses using shared subunits
To fully understand visual processing in retina it is necessary to understand how a population of RGCs coordinate to encode the visual stimulus in their joint activity. The core spike triggered clustering method extends naturally to joint identification of subunits in populations of neurons. Since neighboring OFF parasol cells have partially overlapping receptive fields (Gauthier et al., 2009), and sample a mosaic of bipolar cells, neighboring cells would be expected to receive input from common subunits. Indeed, in mosaics (lattices) of recorded OFF parasol cells, estimated subunits of different cells were frequently closer than the typical nearestneighbor subunit distances obtained from singlecell data. For example, in one data set, the fraction of subunits from different cells closer than 1SD of a Gaussian fit was 14%, substantially higher than the predicted value of 9% based on the separation of subunits within a cell (Figure 2F). Examination of the mosaic suggests that these closelyspaced subunits correspond to single subunits shared by neighboring RGCs (Figure 4A), that is, that they reflect a shared biological substrate. If so, then joint estimation of subunits from the pooled data of neighboring RGCs may more correctly reveal their collective spatial structure.
This hypothesis was tested by fitting a common 'bank' of subunits to simultaneously predict the responses for multiple nearby cells. Specifically, the firing rate for the ${i}^{\text{th}}$ neuron was modeled as: ${R}_{i}={g}_{i}({\sum}_{n}{w}_{n,i}exp({K}_{n}\cdot {X}_{t}))$. Here ${X}_{t}$ is the stimulus, ${K}_{n}$ are the filters for the common population of subunits, ${g}_{i}$ is the output nonlinearity for the ${i}^{\text{th}}$ neuron,and ${w}_{n,i}$ is the nonnegative weight of the ${n}^{\text{th}}$ subunit to the ${i}^{\text{th}}$ neuron, which is 0 if they are not connected. The model parameters were estimated by maximizing the sum of loglikelihood across cells, again in two steps. For the first step, the output nonlinearities were ignored for each cell and a common set of subunit filters was found that clustered the spiketriggered stimuli for all the cells simultaneously. Specifically, on each iteration, a) the relative weights for different subunits were computed for each spiketriggered stimulus and b) the cluster centers were updated using the weighted sum of spiketriggered stimuli across different cells (see Materials and methods). Since the subunit filters could span the receptive fields of all the cells, leading to higher dimensionality, the spatial locality (LNL1) prior was used to efficiently estimate the subunits (Figure 3). In the second step of the algorithm, the nonlinearities were then estimated for each cell independently (similar to Figure 1).
To quantify the advantages of joint subunit estimation, the method was applied to a group of four nearby OFF parasol cells. As a baseline, models with different numbers of subunits were estimated for each cell separately, and the combination of 15 subunits that maximized the crossvalidated loglikelihood when summed over all four cells was chosen (Figure 4B). Then, a joint model with 12 subunits was fitted. Similar test loglikelihood was observed using the joint model, which effectively reduced model complexity by three subunits (Figure 4E, horizontal line). Examination of the subunit locations revealed that pairs of overlapping subunits from neighboring cells in separate fits were replaced with one shared subunit in the joint model (Figure 4C). Also, a joint model with 15 subunits showed a higher prediction accuracy than a model in which the 15 subunits were estimated separately (Figure 4E, vertical line). In this case, sharing of subunits leads to larger number of subunits per cell (Figure 4D).
This trend was studied for different number of subunits, by comparing the total test negative loglikelihood associated with the joint model (Figure 4E, green) and the best combination of subunits from separate fits to each cell based on validation data (Figure 4E, blue). Joint fitting provided more accurate predictions than separate fits, when using a large total number of subunits (>12). For a smaller number of subunits, a horizontal shift in the observed likelihood curves revealed the reduction in the number of subunits that was obtained with joint estimation, without a reduction in prediction accuracy. In sum, jointly estimating a common collection of subunits for nearby cells resulted in a more parsimonious explanation of population responses.
Subunit model explains spatial nonlinearities revealed by null stimulus
To more directly expose the advantage of a subunit model over commonly used LN models of RGC light response, experiments were performed using a visual stimulus that elicits no response in a LN model (or other model that begins with linear integration of the visual stimulus over space, for example Pillow et al., 2008). An example of such a stimulus is a contrastreversing grating stimulus centered on the RF (Hochstein and Shapley, 1976; Demb et al., 1999). However, to avoid recruiting additional nonlinearities, a stimulus was developed that has spatiotemporal statistics very similar to the highentropy white noise stimulus used for model characterization. This 'null' stimulus was obtained by manipulating white noise so as to eliminate responses of the linear RF.
Specifically, the computation performed by a hypothetical LN neuron can be represented in a stimulus space in which each axis represents stimulus contrast at a single pixel location (Figure 5A). In this representation, the LN response is determined by the projection of the stimulus vector onto the direction associated with the weights in the neuron's spatial RF. Thus, a null stimulus orthogonal to the RF can be computed by subtracting a vector with the same RF projection from the original stimulus (Figure 5B). In addition to the orthogonality constraint, two other constraints are imposed in constructing the stimulus: (i) the range of pixel values is bounded for the stimulus display, and (ii) the variance of each pixel over time is held constant (as in white noise), to prevent gain changes in photoreceptors (see Materials and methods). Given these constraints, the resulting stimulus is the closest stimulus to the original white noise stimulus that yields zero response in a LN neuron. Note that although this approach is described purely in the spatial domain, it generalizes to the full spatiotemporal RF, and the procedure based on the spatial RF alone will produce a null stimulus for a spatiotemporal linear filter that is spacetime separable (see Materials and methods).
The ability to silence RGC responses with null stimuli was tested as follows. A 10 s movie of white noise was projected into the intersection of the spatial null spaces corresponding to all cells in a region, of a single type (e.g. OFF parasol). Responses were recorded to 30 repeated presentations of both the white noise and the null stimulus. RGC firing rates showed a substantial modulation for both white noise and the null stimulus that was highly reproducible across repeats (Figure 5C,D), inconsistent with linear integration by the RGC over its RF. Note that the null stimulus modulated OFF parasol cells more strongly than ON parasol cells, consistent with previous results obtained with white noise and natural scene stimuli (Demb et al., 2001; Chichilnisky and Kalmar, 2002; Turner and Rieke, 2016); henceforth, analysis is focused on OFF parasol cells.
The subunit model substantially captured response nonlinearities revealed by the null stimulus. Because the biased statistics of the null stimulus induce errors in the estimated subunits, subunit filters were first estimated from responses to white noise (step 1 of model estimation, without regularization, Figure 1), and response nonlinearities were then estimated from responses to the null stimulus (step 2 of model estimation, Figure 1). As expected, the single subunit (LN) model failed to capture responses to the heldout null stimulus (Figure 6A, second row). As the number of subunits was increased, model performance progressively increased (Figure 6A, subsequent rows). Prediction accuracy, defined as the correlation between the recorded and predicted firing rate, was evaluated across entire populations of OFF parasol cells in three recordings (Figure 6B). The single subunit model failed, with prediction accuracy near 0 (mean accuracy : 0.02, 0.12, 0.03 for three retinas). The prediction accuracy gradually increased with the addition of more subunits, saturating at roughly 4–6 subunits (Figure 6B, red). In contrast, the addition of subunits showed marginal improvements for white noise stimuli of the same duration (Figure 6B, black), despite being measurable with longer stimuli (Figure 2B). Thus, the response nonlinearities captured by the subunit model are more efficiently exposed by the null stimulus.
Subunits enhance response prediction for naturalistic stimuli
A fuller understanding of the biological role of spatial nonlinearities requires examining how they influence responses elicited by natural stimuli. For this purpose, segments of images from the Van Hateren database were used, with a new image presented every second to emulate saccadic eye movements. Between simulated saccades, the image was jittered according to the eye movement trajectories recorded during fixation by awake macaque monkeys (ZM Hafed and RJ Krauzlis, personal communication, van Hateren and van der Schaaf, 1998; Heitman et al., 2016). Because the biased statistics of natural stimuli make it infeasible to directly estimate subunits, subunit filters and time courses were first estimated using white noise stimuli (1st step, without regularization Figure 1). Then, the nonlinearities and subunit magnitudes were fitted using natural stimuli (2nd step, Figure 1) (see Materials and methods, similar to above). Subsequently, the ability of the subunit model to capture responses to natural stimuli was measured.
Responses to natural stimuli showed a distinct structure, with strong periods of firing or silence immediately following saccades and less frequent firing elicited by image jitter (Heitman et al., 2016, Figure 7A black raster). This structure was progressively better captured by the model as the number of subunits increased, as shown by the rasters for an OFF parasol cell in Figure 7A. This finding was replicated in OFF parasol populations in three retinas, and quantified by the correlation between the recorded and predicted PSTH (Figure 7B). However, some of the firing events could not be captured even with a large number of subunits, suggesting that models with additional mechanisms such as gain control (Heeger, 1992; Carandini and Heeger, 2012) or nonlinear interactions between subunits (Kuo et al., 2016; Manookin et al., 2018) may be required to describe natural scene responses more completely.
Interestingly, further examination revealed that addition of subunits produced greater prediction improvements for responses elicited by image jitter compared to those elicited immediately following simulated saccades (< 250 ms) (Figure 7B). This could be explained by the fact that changes in luminance are usually more spatially uniform at saccades, resulting in more linear signaling by RGCs. Indeed, after temporally filtering the visual stimulus, there were fewer stimulus samples in the null space of the receptive field (90 degrees) during saccades than between saccades (Figure 7C), and the subunit model provided a more accurate explanation of responses to stimuli in the null space (Figure 6). Thus, these results indicate that spatial nonlinearities contribute to explaining responses to natural stimuli, especially those elicited by jittering images.
Application to simple and complex cells in primary visual cortex
Identification of subunits using spiketriggered stimulus clustering ideally would generalize to other neural systems with similar cascades of linear and nonlinear operations. As a test of this generality, spike triggered clustering (Figure 1) was applied to obtain estimates of spacetime subunit filters from responses of V1 simple and complex cells to flickering bars (Rust et al., 2005). The estimated subunits showed oriented spatiotemporal structure similar to that observed with other methods that imposed more restrictions (Rust et al., 2005; Vintch et al., 2015) (because of this similarity, additional priors on the filters were not explored). Increasing the number of subunits typically caused one of the subunits from the preceding fits to be partitioned into two, while other subunits were largely unchanged (Figure 8); that is, a hierarchical structure similar to that observed in RGCs (Figure 2). The number of crossvalidated subunits for the complex cell (N = 8) was greater than for the simple cell (N = 4), consistent with previous work (Rust et al., 2005). All subunits for a given cell had similar structure in space and time, but were approximately translated relative to each other, consistent with the similar frequency spectrum (Figure 8A,B, bottom row). For the complex cell, this is consistent with the hypothesis that it receives inputs from multiple simple cells at different locations (Hubel and Wiesel, 1962; Adelson and Bergen, 1985; Vintch et al., 2015). Thus, the method presented here extends to V1 data, producing subunit estimates that are not constrained by assumptions of orthogonality (Rust et al., 2005) or convolutional structure (Vintch et al., 2015), yet broadly resembling the results of previous work (see Discussion and Figure 10 for a detailed comparison).
Comparison to other methods
Comparisons with recent studies reveal advantages of the spiketriggered clustering approach in terms of efficiency of estimation, and interpretation of estimated subunits. A relevant comparison is the recently published subspacebased approach, spiketriggered nonnegative matrix factorization (SNMF) (Liu et al., 2017; Jia et al., 2018). This method was applied to salamander RGCs, and estimated subunits were shown to be wellmatched to independent physiological measurements of bipolar cell receptive fields, an impressive validation. This approach works by collecting the spike triggered stimuli in a matrix and then factorizing the matrix. Because this method differs in a variety of ways from the spiketriggered clustering approach, only the core aspects of the two methods were compared (see Discussion). Both methods were applied without any regularization, and were run without perturbing the parameters after convergence (contrary to the suggestion in Liu et al., 2017). The number of subunits used was equal to the true number of subunits in the simulated cell. In these conditions, recovery of subunits revealed two trends. First, for a realistic amount of data, the SNMF approach often produced mixtures of underlying subunits, while the spiketriggered clustering approach produced unique and separate subunits (see Figure 9). Second, the spatial structure of extracted subunit was substantially more variable using SNMF across different initializations. These results indicate that, at least in these simulated conditions, the spiketriggered clustering method produces more consistent and accurate estimates of subunit spatial structure than SNMF. Note, however, that choices of initialization and regularization can alter the efficiency of both procedures. A summary of other differences with SNMF is presented in Discussion.
Subunit behaviors have also been captured using convolutional models that rely on the simplifying assumption that all subunit filters are spatially shifted copies of one another (Eickenberg et al., 2012; Vintch et al., 2015; Wu et al., 2015). To perform the comparison, RGC responses were analyzed with a modification of the spiketriggered clustering approach that enforces convolutional subunit structure, with model parameters fitted by gradient descent. The method was applied with locally normalized L1 regularization, shown above to be effective for retinal data (Figure 10A,B). As expected, the convolutional model was substantially more efficient, exhibiting lower prediction error for limited training data. However, for longer durations of training data (within the range of typical experiments), the unconstrained model exhibited lower error (Figure 10A). The spatial location and resulting layout of subunits obtained from the two methods were similar (Figure 10B). However, spiketriggered clustering captured the deviations from a purely convolutional structure, leading to improved performance. Similar results were obtained using spiketriggered clustering without LNL1 regularization (not shown). Similar results were also obtained using data from a V1 complex cell (Figure 10C,D). Note that the nearly identical power spectra (Figure 8) of the subunits estimated using spiketriggered clustering suggested nearly convolutional structure, without explicitly enforcing this constraint. Note that further improvement in both approaches could potentially be achieved with modifications. For example in the V1 neurons, replacing the exponential nonlinearity with a quadratic nonlinearity (as in Vintch et al., 2015) provided more accurate fits for both models, but the performance of spiketriggered clustering was still superior when more data were available (Figure 10—figure supplement 1).
Discussion
We developed an efficient modeling and estimation approach to study a ubiquitous motif of neural computation in sensory circuits: summation of responses of rectified subunits. The approach allowed efficient subunit estimation by clustering of spiketriggered stimuli, and the inclusion of spatial localization regularizers to additionally constrain the fitting under conditions of limited data. The model and fitting procedure extended naturally to populations of cells, allowing joint fitting of shared subunits. The effectiveness of the method was demonstrated by its ability to capture subunit computations in macaque parasol RGCs, specifically in stimulus conditions in which linear models completely fail, as well in V1 neurons. Below, we discuss the properties of the approach compared to others, and the implications of the application to retinal data.
Modeling linearnonlinear cascades
The subunit model is a natural generalization of linearnonlinear (LN) cascade models, and the estimation procedure is a natural generalization of the spiketriggered estimation procedures that facilitated the widespread use of LN models. In the present model framework, a maximumlikelihood estimate of model parameters leads to a spiketriggered clustering algorithm for obtaining the early linear filters that represent model subunits. These linear components are the key, highdimensional entities that define stimulus selectivity, and, in the absence of a robust computational framework, would be difficult to estimate.
The closest prior work in retina is spiketriggered nonnegative matrix factorization (SNMF) (Liu et al., 2017; Jia et al., 2018). In addition to the greater efficiency of spiketriggered clustering (Figure 9), there are several technical differences worth noting. First, the assumptions of the SNMF algorithm are not directly matched to the structure of the subunit response model: the present model assumes nonnegative subunit outputs, while the SNMF factorization method assumes that the subunit filters are nonnegative. This assumption of nonnegative filters is inconsistent with suppressive surrounds previously reported in retinal bipolar cells (Dacey et al., 2000; Fahey and Burkhardt, 2003; Turner et al., 2018), and limits the application of the method to other systems such as V1 (Figure 8). In contrast, the spiketriggered clustering method makes no assumptions about the structure of the subunit filers. Second, the SNMF method requires a sequence of additional steps to relate the estimated matrix factors to the parameters of an LNLN model. In contrast, in the present work the clustering of spiketriggered stimuli is shown to be equivalent to optimizing the likelihood of a LNLN model, eliminating extra steps in relating model parameters to biology. Third, the comparison to bipolar cell receptive fields using SNMF relied on a regularization parameter that trades off sparseness and reconstruction error. This regularization can influence the size of the estimated subunits, and thus the comparison to bipolar receptive fields. In contrast, the present method includes a novel regularizer that encourages spatial contiguity of receptive fields, while exerting minimal influence on the size of estimated subunits, and the strength of this regularizer is chosen using a crossvalidation procedure.
Another study (Freeman et al., 2015) presented an approach to estimating subunits from OFF midget RGCs with visual inputs presented at singlecone resolution, and was used to obtain a successful cellular resolution dissection of subunits. However, this method assumed that subunits receive inputs from nonoverlapping subsets of cones, which is likely to be incorrect for other cell types, including parasol RGCs. Moreover, the cone to subunit assignments were learned using a greedy procedure, which was effective for subunits composed of 1–2 cone inputs, but is more likely to converge to incorrect local minima in more general conditions.
Recent studies fitted a large class of LNLN models to simulated RGC responses (McFarland et al., 2013), to recorded ONOFF mouse RGC responses (Shi et al., 2019), and to flickering bar responses in salamander RGCs (Maheswaranathan et al., 2018) using standard gradient descent procedures. These approaches have the advantage of flexibility and simplicity. In contrast, this work assumes a specific subunit nonlinearity that leads to a specialized efficient optimization algorithm, and applies it to recorded data from primate RGCs. The choice of an exponential subunit nonlinearity enables likelihood maximization for Gaussian stimuli, which leads to more accurate estimates with limited data, and reduces the computational cost of each iteration during fitting because the loss depends only on the collection of stimuli that elicit a spike (Ramirez and Paninski, 2014). Additionally, the softclustering algorithm requires far fewer iterations than stochastic gradient descent algorithms used in fitting deep neural networks (Kingma and Ba, 2014; Duchi et al., 2011). Depending on the experimental conditions (e.g. type of stimuli, duration of recording, number of subunits to be estimated), the present approach may provide more accurate estimates of subunits, with less data, and more quickly.
Other recent studies applied convolutional and recurrent neural network models, fitted to natural scene responses, to mimic a striking variety of retinal response properties (McIntosh et al., 2016; Batty et al., 2016). The general structure of the model – convolutions and rectifying nonlinearities – resembles the subunit model used here. However, the number of layers and the interactions between channels do not have a direct correspondence to the known retinal architecture, and to date the specific properties of the inferred subunits, and their contributions to the behaviors of the model, have not been elucidated. The complexity of the underlying models suggest that a comparison to the biology may prove difficult.
In the context of V1 neurons, the present approach has advantages and disadvantages compared to previous methods. Spiketriggered covariance (STC) methods (Rust et al., 2005) estimate an orthogonal basis for the stimulus subspace captured by the subunits. However, the underlying subunits are unlikely to be orthogonal and are not guaranteed to align with the basis elements. The method also requires substantial data to obtain accurate estimates. To reduce the data requirements, other methods (Eickenberg et al., 2012; Vintch et al., 2015; Wu et al., 2015) imposed an assumption of convolutional structure on the subunits. This approach is efficient, because it effectively fits only one subunit filter that is then applied convolutionally. However, as shown above using recorded data from RGCs and V1 complex cells, the convolutional restriction may not be necessary when enough data are available, and under these conditions, the present method can more accurately reveal subunit structure in receptive fields that is not precisely convolutional (Figure 10).
A generalization of the STC approach (Sharpee et al., 2004; Paninski, 2003; Eickenberg et al., 2012) involves finding a subspace of stimulus space which is most informative about a neuron's response. This approach also yields a basis for the space spanned by subunits, but the particular axes of the space may or may not correspond to the subunits. Unlike the method presented here, this approach requires minimal assumptions about stimulus statistics, making it very flexible. However, methods based on informationtheoretic measures typically require substantially more data than are available in experimental recordings (Paninski, 2003).
Although constraints on the structure of the subunit model (e.g. convolutional), and the estimation algorithm (e.g. spiketriggered covariance or clustering), both influence the efficiency of subunit estimation, the choice of priors used to estimate the subunits efficiently with limited data (i.e. regularization) is also important. In the present work, a regularizer was developed that imposes spatial contiguity, with minimal impact on the size of estimated subunits. A variety of other regularization schemes have been previously proposed for spatial contiguity for the estimation of V1 subunits (Park and Pillow, 2011). These could be incorporated in the estimation of the present model using a corresponding projection operator.
As suggested by the results, another algorithmic modification that could improve the subunit estimation is the hierarchical variant of the clustering approach (Figure 2—figure supplement 1). By splitting one of the subunits into two subunits at each step, models with different number of subunits can be estimated with greater efficiency, and preliminary exploration indicated that this may offer improvements in data efficiency and speed. However, potential tradeoffs emerging from the additional assumptions will require further study.
Revealing the impact of nonlinear computations using null stimuli
The null stimulus methodology revealed the failure of LN models, and isolated the nonlinear behaviors arising from receptive field subunits. Previous studies have developed specialized stimuli to examine different aspects of the cascaded linearnonlinear model. Contrast reversing gratings (CRGs) have been the most commonly used stimuli for identifying the existence of a spatial nonlinearity (Hochstein and Shapley, 1976; Demb et al., 2001). Recent studies (Schwartz et al., 2012; Turner and Rieke, 2016) extended this logic to more naturalistic stimuli, with changes in responses to translations and rotations of texture stimuli revealing the degree of spatial nonlinearity. Another (Bölinger and Gollisch, 2012) developed a closedloop experimental method to measure subunit nonlinearities, by tracing the intensities of oppositely signed stimuli on two halves of the receptive field to measure isoresponse curves. More recent work (Freeman et al., 2015) developed stimuli in closed loop that targeted pairs of distinct cones in the receptive field, and used these to identify linear and nonlinear summation in RGCs depending on whether the cones provided input to the same or different subunits.
The null stimulus approach introduced here offers a generalization of the preceding methods. Null stimuli can be generated starting from any desired stimulus ensemble (e.g. natural images), and the method does not assume any particular receptive field structure (e.g. radial symmetry). Construction of the null stimuli does, however, require fitting a linear (or LN) model to the data in closed loop. Projecting a diverse set of stimuli (e.g. white noise) onto the null space yields a stimulus ensemble with rich spatiotemporal structure that generates diverse neural responses. By construction, the null stimulus is the closest stimulus to the original stimulus that is orthogonal to the receptive field, and thus minimally alters other statistical properties of the stimulus that affect neural response. This property is useful for studying the effect of spatial nonlinearities in the context of specific stimuli, such as natural scenes. Whereas contrast reversing gratings do not reveal stronger nonlinearities in OFF parasol cells compared to ON parasol cells in responses with natural stimuli (Turner and Rieke, 2016), the null stimulus captures these differences (Figure 6), highlighting the advantages of tailored stimulation with rich spatiotemporal responses.
Further applications and extensions
The assumptions behind the proposed subunit model enable efficient fitting and interpretability, but also lead to certain limitations that could potentially be overcome. First, the choice of an exponential subunit nonlinearity improves efficiency by allowing the maximum expected likelihood approximation. However, other nonlinearities that allow this approximation, such as a rectified quadratic form (Bölinger and Gollisch, 2012), could be explored. Second, in fitting RGCs, the assumption of spacetime separable subunits significantly reduces the number of parameters that need to be estimated, but could be relaxed with more data. The spacetime separability assumption can be verified by checking if the spatiotemporal STA had rank one. In the present data (not shown), the STA has rank one when only pixels corresponding to the receptive field center were considered but had higher rank when the surround was also taken into consideration (EnrothCugell and Pinto, 1970; Cowan et al., 2016). Thus, replacing the spacetime separability assumption may be desirable. For example, a modification to the procedure that enforces subunits with rank two (not shown) could capture the centersurround structure of bipolar cell receptive fields (Dacey et al., 2000; Turner et al., 2018).
The methods developed here may also prove useful in tackling additional challenging problems in neural circuitry and nonlinear coding, in the retina and beyond. First, applying the model to highresolution visual stimulation of the retina could be used to reveal how individual cones connect to bipolar cells, the likely cellular substrate of subunits, in multiple RGC types (see Freeman et al., 2015). Second, the incorporation of noise in the model, along with estimation of shared subunits (Figure 4), could help to probe the origin of noise correlations in the firing of nearby RGCs (see Rieke and Chichilnisky, 2014). Modeling the nonlinear interaction between subunits (Kuo et al., 2016; Manookin et al., 2018) could also improve prediction accuracy, especially for natural scenes. Third, extending the subunit model to include gain control, another ubiquitous nonlinear computation in the retina and throughout the visual system, could provide a more complete description of the responses in diverse stimulus conditions (Heeger, 1992; Carandini and Heeger, 2012). In addition to these potential scientific advances, a more accurate functional subunit model may be critical for the development of artificial retinas for treating blindness. Finally, the successful application of the method to V1 data (Figure 8) suggests that the method could be effective in capturing computations in other neural circuits that share the nonlinear subunit motif.
Materials and methods
Recordings
Request a detailed protocolDetailed preparation and recording methods are described elsewhere (Litke et al., 2004; Frechette et al., 2005; Chichilnisky and Kalmar, 2002). Briefly, eyes were enucleated from seven terminally anesthetized macaque monkeys (Macaca sp.) used by other experimenters in accordance with institutional guidelines for the care and use of animals. Immediately after enucleation, the anterior portion of the eye and the vitreous were removed in room light. The eye was stored in darkness in oxygenated Ames' solution (Sigma, St Louis, MO) at 33°C, pH 7.4. Segments of isolated or RPEattached peripheral retina (6–15 mm temporal equivalent eccentricity [Chichilnisky and Kalmar, 2002], approximately 3 mm x 3 mm) were placed flat, RGC side down, on a planar array of extracellular microelectrodes. The array consisted of 512 electrodes in an isosceles triangular lattice, with 60 µm interelectrode spacing in each row, covering a rectangular region measuring 1800 µm x 900 µm. While recording, the retina was perfused with Ames' solution (35°C for isolated recordings and 33°C for RPEattached recordings), bubbled with 95% O_{2}% and 5% CO_{2}, pH 7.4. Voltage signals on each electrode were bandpass filtered, amplified, and digitized at 20 kHz.
A custom spikesorting algorithm was used to identify spikes from different cells (Litke et al., 2004). Briefly, candidate spike events were detected using a threshold on each electrode, and voltage waveforms on the electrode and nearby electrodes in the 5 ms period surrounding the time of the spike were extracted. Candidate neurons were identified by clustering the waveforms using a Gaussian mixture model. Candidate neurons were retained only if the assigned spikes exhibited a 1 ms refractory period and totaled more than 100 in 30 min of recording. Duplicate spike trains were identified by temporal crosscorrelation and removed. Manual analysis was used to further select cells with a stable firing rate over the course of the experiment, and with a spatially localized receptive field.
Visual stimuli
Request a detailed protocolVisual stimuli were delivered using the optically reduced image of a CRT monitor refreshing at 120 Hz and focused on the photoreceptor outer segments. The optical path passed through the mostly transparent electrode array and the retina. The relative emission spectrum of each display primary was measured with a spectroradiometer (PR701, PhotoResearch) after passing through the optical elements between the display and the retina. The total power of each display primary was measured with a calibrated photodiode (UDT Instruments). The mean photoisomerization rates for the L, M, and S cones were estimated by computing the inner product of the display primary power spectra with the spectral sensitivity of each cone type, and multiplying by the effective collecting area of primate cones (0.6 μm^{2}) (Angueyra and Rieke, 2013; Schnapf et al., 1990). During white noise and null stimulus, the mean background illumination level resulted in photoisomerization rates of (800–2200, 800–2200, 400–900) for the (L, M, S) cones. The pixel size was either 41.6 microns (eight monitor pixels on a side) or 20.8 microns (four monitor pixels on a side). New black binary white noise frames were drawn at a rate of 60 Hz (Figure 2) or 30 Hz (Figures 5,6). The pixel contrast (difference between the maximum and minimum intensities divided by the sum) was either 48% or 96% for each display primary, with mean luminance at 50%.
The details of natural stimuli used are presented in Heitman et al. (2016). Briefly, the stimuli consisted of images from the Van Hateren database, shown for one second each, jittered according to eye movement trajectories recorded during fixation by awake macaque monkeys (Z.M. Hafed and R.J. Krauzlis, personal communication), (Heitman et al., 2016; van Hateren and van der Schaaf, 1998). Image intensities produced mean photoisomerization rates of (1900, 1800, 700) for the (L, M, S) cones, respectively. The pixel width was 10.4 microns (two monitor pixels on a side), and image frames refreshed at 120 Hz. The training data, consisting of 59 groups of 60 distinct natural scenes, was interleaved with identical repeats of testing data, consisting of 30 distinct natural scenes.
Subunit estimation
Request a detailed protocolThe light responses of RGCs were modeled using a cascade of two linearnonlinear (LN) stages, followed by a Poisson spike generator. Specifically, the instantaneous stimulusconditional firing rate was modeled as ${\lambda}_{t}{X}_{t}=g({\sum}_{n}{w}_{n}\mathrm{exp}({K}_{n}^{T}{X}_{t}))$ where ${X}_{t}$ is the visual stimulus at time $t$ (expressed as a vector), ${K}_{n}$ are the spatial subunit filters (also vectors), ${w}_{n}$ are nonnegative weights on different subunits, and $g(x)=\frac{{x}^{a}}{bx+1}$, $(b\ge 0)$ is the output nonlinearity.
The parameters $\{{K}_{n},{w}_{n},a,b\}$ are estimated by minimizing their negative loglikelihood given observed spiking responses ${Y}_{t}$ to visual stimuli ${X}_{t}$, in two steps. In the first step, the parameters $\{{K}_{n},{w}_{n}\}$ are estimated using a clustering procedure while ignoring the nonlinearity ($g$). In the second step, $\{g,{w}_{n}\}$ are estimated using gradient descent, with the ${K}_{n}$ held fixed (up to a scale factor).
The clustering procedure is derived by minimizing an approximate convex upper bound of the negative loglikelihood based on current parameter estimates $\{{K}_{n}^{0},{b}_{n}^{0}\}$. The convex upperbound is derived as follows:
where ${\alpha}_{t,n}^{0}=\frac{{w}_{n}^{0}{e}^{{K}_{n}^{0}.{X}_{t}}}{{\sum}_{m=1}^{m=N}{w}_{j}^{0}{e}^{{K}_{m}^{0}.{X}_{t}}}$ and $c({K}_{n}^{0},{w}_{n}^{0})$ does not depend on parameters $\{{K}_{n},{w}_{n}\}$. For simplicity, we present the equations for a standard Gaussian stimulus $({X}_{t}\sim \mathcal{N}(0,I))$ (modifications for arbitrary Gaussian stimuli are straightforward). Since the first term in the loglikelihood (Equation 1) does not depend on recorded responses (${Y}_{t}$), it is replaced with its expected value across stimuli (Equation 2). Because the projections of the input distribution onto filters ${K}_{n}$ are assumed to be Gaussian, this expectation can be computed using the moment generating function of a Gaussian distribution (Equation 3). The second term only depends on the stimuli for which the cell generated spikes (${Y}_{t}>0$), reducing computational cost. Finally, to optimize the resulting nonconvex function, a convex upper bound was minimized at each step. Specifically, the second term is replaced by a firstorder Taylor approximation (Equation 4) (the first term is already convex). This upper bound is separable across parameters of different subunits, as can be seen by rearranging the summation over time and subunits (Equation 5). Therefore, the parameters for each subunit can be updated separately by minimizing the corresponding term in Equation 5.
The parameters for each subunit are optimized in three steps. First, the subunit activations, ${\alpha}_{t,n}$ are updated based on the parameters $\{{K}_{n},{w}_{n}\}$ from the previous iteration. Then, setting to zero the partial derivatives of Equation 5 with respect to ${K}_{n}$ and ${w}_{n}$ produces a system of two equations, which are rearranged to yield the remaining two steps:
Update subunit activations (cluster weights) for each stimulus : ${\alpha}_{t,n}\leftarrow \frac{{w}_{n}{e}^{{K}_{n}.{X}_{t}}}{{\sum}_{m=1}^{m=N}{w}_{m}{e}^{{K}_{m}.{X}_{t}}}$.
Update linear subunit filters (cluster centers) : ${K}_{n}\leftarrow \frac{{\sum}_{t}{Y}_{t}{\alpha}_{t,n}{X}_{t}}{{\sum}_{t}{Y}_{t}{\alpha}_{t,n}}$.
Update relative weights for different subunits : ${w}_{n}\leftarrow \frac{{\sum}_{t}{Y}_{t}{\alpha}_{t,n}}{T}{e}^{\frac{1}{2}{K}_{n}^{T}{K}_{n}}$.
Hence, applying the three steps repeatedly results in a clustering algorithm that iteratively minimizes a convex upper bound on the approximate negative loglikelihood function. Since the upper bound is tight at the current parameter estimates, the successive minimization of the convex upper bound reduces the approximate negative loglikelihood monotonically, leading to guaranteed convergence. However, due to nonconvexity of the approximate negative loglikelihood function, the estimated parameters might not correspond to a global minimum. In the limit of large recording duration, the approximation in Equation 2 is exact.
The memory requirement for each step of the algorithm is proportional to the number of spike triggered stimuli, which could be large for fine resolution stimuli and/or high firing rates. Computational runtime depends linearly on the number of spikes, the stimulus dimensions, the number of subunits, and the number of steps for convergence of the clustering procedure. For most of the results presented in this paper, each fit converged in ∼ 100 steps and required a few minutes on a singlecore computer. Additional speed improvements would be possible by parallelizing each step of clustering across subunits, making runtime independent of the number of subunits. The resulting subunits can be interpreted as a decomposition of the spike triggered average (STA), since the sum of estimated subunits (${K}_{n}$), weighted by their expected contribution to the response (${w}_{n}{e}^{\frac{1}{2}{K}_{n}^{T}{K}_{n}}$), is proportional to the spiketriggered average:
where (6) comes form (3) (at convergence), (7) comes from (2) (at convergence), and (8) arises because the ${\alpha}_{t,n}$ sum to one. Step (6) may be interpreted as replacing the average spike rate, multiplied by the contribution of each subunit. Step (7) may be interpreted as a weighted average of spiketriggered stimuli, where the weighting assigns responsibility for each spike across the subunits. The above formulae ignore the output nonlinearity, justified by the observation that it is approximately linear after fitting.
Incorporating priors
Request a detailed protocolThe loglikelihood can be augmented with a regularization term, to incorporate prior knowledge about the structure of subunits, thereby improving estimation in the case of limited data. Regularization is incorporated into the algorithm by projecting the subunit kernels (${K}_{n}$) with a proximal operator (${P}_{\lambda}$), derived from the regularization function, after each update of the cluster centers (Figure 11). For a regularization term of the form $\lambda f(K)$, the proximal operator is defined as ${\mathcal{P}}_{\lambda ,f}({K}^{0})={\text{argmin}}_{K}\left(f(K)+(1/2\lambda ){\parallel K{K}^{0}\parallel}_{2}^{2}\right)$. Regularization was applied for estimating subunits for both single cell (Figure 3) as well as population (Figure 4) data and in each case, the regularization strength $(\lambda )$ was chosen by crossvalidation (see below).
For ${L}_{1}$norm regularization (section 3), $f(K)={\sum}_{i}{K}^{i}$, where ${K}^{i}$ is the value of $i$th pixel, and the proximal operator is a softthresholding operator ${\mathcal{P}}_{\lambda ,.}(K)=\mathrm{max}(K\lambda ,0)\mathrm{max}(K\lambda ,0)$. For the locally normalized ${L}_{1}$norm regularization, the proximal operator of the Taylor approximation was used at each step. The Taylor approximation is a weighted ${L}_{1}$norm with the weight for each pixel determined by the parameter value of the neighboring pixels. Hence, the proximal operator for the $i$th pixel is ${\mathcal{P}}_{\lambda ,.}({K}^{i})=\mathrm{max}({K}^{i}{a}^{i}\lambda ,0)\mathrm{max}({K}^{i}{a}^{i}\lambda ,0)$, where ${a}^{i}=1/(\u03f5+{\sum}_{j\in \text{Neighbors}(i)}{K}^{j})$ for $\u03f5=0.01$ black (smaller than typical nonzero values in $K$, which are on the order of unity). Simulations were used to verify that this regularizer induces a preference for spatially contiguous subunits, while being relatively insensitive to subunit area (compared with ${L}_{1}$ regularization).
The entire procedure is summarized in Figure 11, with simplified notation illustrating the four key steps.
Subunit estimation using hierarchical clustering
Request a detailed protocolAs shown in Figure 2, a hierarchical organization of subunits was observed in fits to RGC data, with one subunit broken down into two subunits each time the total number of subunits was increased by one. This hierarchical structure can be enforced explicitly to make the estimation procedure more efficient.
Since the softmax weights ${\alpha}_{m}$ (between 0 and 1) can be interpreted as the ‘activation probability’ of subunit $m$, hierarchical clustering is performed by estimating two 'child’ subunits ${m}_{1},{m}_{2}$ with factorized activation probability : ${\alpha}_{{m}_{1}}={\alpha}_{m}{\alpha}_{{m}_{1}m}$, where ${\alpha}_{{m}_{1}m}$ is the conditional probability of child ${m}_{1}$ given the activation of parent subunit with ${\alpha}_{{m}_{1}m}=\frac{{e}^{{K}_{{m}_{1}}^{T}X+{b}_{{m}_{1}}}}{{e}^{{K}_{{m}_{1}}^{T}X+{b}_{{m}_{1}}}+{e}^{{K}_{{m}_{2}}^{T}+{b}_{{m}_{2}}}}$. The factorization is accurate if the activation of the parent subunit is the sum of activation of the child subunits (${e}^{{K}_{m}.X+{b}_{m}}={e}^{{K}_{{m}_{1}}.X+{b}_{{m}_{1}}}+{e}^{{K}_{{m}_{2}}.X+{b}_{{m}_{2}}}$).
The child subunits were initialized by adding independent random noise to parent subunits (${K}_{{m}_{i}}={K}_{m}+{\u03f5}_{i}$) and equally dividing the subunit weight into two (${e}^{{b}_{{m}_{i}}}=\frac{{e}^{{b}_{m}}}{2}$). Hierarchical clustering was performed by iteratively updating the cluster assignments (${\alpha}_{{m}_{i}}$) and cluster centers (${K}_{{m}_{i}},{b}_{{m}_{i}}$) for the child subunits using the factored $\alpha $ as outlined above.
The parent subunit for splitting was chosen greedily, to yield maximum increase in loglikelihood at each step. Hierarchical subunit estimation provides a computationally efficient way to estimate the entire solution path with different $N$. The estimated subunits obtained with this procedure are shown in Figure 2—figure supplement 1.
Population model
Request a detailed protocolIn Section 4, a method to estimate a common bank of subunits jointly using multiple nearby cells is described. For each cell $c$, the firing rate is given by ${\lambda}_{c,t}={g}_{c}({\sum}_{n=1}^{n=N}{w}_{n,c}{e}^{{K}_{n}.{X}_{t}})$ where ${w}_{n,c}$ are the cellspecific subunit weights. The model parameters were estimated by maximizing the summation of loglikelihood across cells.
Similar to the singlecell case, the estimation was performed in two steps: 1) Ignoring the output nonlinearity ${g}_{c}$, estimate $\{{K}_{n},{w}_{n,c}\}$ by clustering, and 2 ) Estimate ${g}_{c}$, ${w}_{n,c}$ and the magnitude of ${K}_{n}$ for each cell independently using gradient descent, with the vector direction of $\{{K}_{n}\}$ fixed. Clustering in the first step can be interpreted as finding a common set of centroids to cluster the spiketriggered stimuli for multiple cells simultaneously. The following steps are repeated iteratively:
Update subunit activations (cluster weights) for each stimulus and cell: ${\alpha}_{t,n,c}\leftarrow \frac{{w}_{n,c}{e}^{{K}_{n}.{X}_{t}}}{{\sum}_{m=1}^{m=N}{w}_{j,c}{e}^{{K}_{m}.{X}_{t}}}$
Update linear subunit filters (cluster centers) using population responses: ${K}_{n}\leftarrow {\mathcal{P}}_{\lambda}\left(\frac{{\sum}_{t}{\sum}_{c}{Y}_{t,c}{\alpha}_{t,n,c}{X}_{t}}{{\sum}_{t}{\sum}_{c}{Y}_{t,c}{\alpha}_{t,n,c}}\right)$
Update subunit weights $({w}_{n,c})$ for each cell: ${w}_{n,c}\leftarrow \frac{{\sum}_{t}{\alpha}_{t,n,c}}{T}{e}^{\frac{1}{2}{K}_{n}^{T}{K}_{n}}$
As with subunit estimation using singlecell responses, prior knowledge about subunit structure can be incorporated with a proximal operator ($\mathcal{P}(.)$) in the clustering loop. For RGC data, locally normalized L1 regularization was used (Figure 4).
Application to neural responses
Request a detailed protocolThe subunit estimation algorithm was applied to a rectangular segment of stimulus covering the receptive field (along with a one stimulus pixel boundary around it) and to the RGC spike counts over time (8.33 ms bin size). The binary white noise stimulus was temporally filtered and the resulting normally distributed stimulus (zero mean) was used for subunit estimation (${X}_{t}$). The receptive field was estimated as a contiguous set of pixels black in the STA with magnitude larger than 2.5σ, where σ is the standard deviation of the background noise in the STA. The time course for filtering (125 ms length) was estimated by averaging the time course of significant pixels in the STA. The data were partitioned into three groups: testing data consisted of a contiguous segment of recording (last 10%), and the rest of the recording was randomly partitioned for training (90%) and validation (10%) data.
Models were fit with different numbers of subunits $N$ (112) and regularization values $\lambda $. Selection of hyperparameters ($N$ and $\lambda $) was performed by averaging the performance of multiple fits with different training/validation partitions and random initializations (see Figure captions for details). For a given number of subunits, the regularization value was chosen by optimizing the performance on validation data. Similarly, the most accurate model was chosen by optimizing over the number of subunits as well as regularization strength. When data with repeated presentations of the same stimulus were available (e.g. Figures 6,7 ), the prediction accuracy was evaluated by measuring the correlation between peristimulus time histogram (PSTH) of recorded data and the predicted spikes for the same number of repetitions. The PSTH was computed by averaging the discretized responses over repeats and gaussian smoothing with standard deviation 16.67 ms.
For null stimuli, responses from 30 repeats of a 10 s long stimulus were divided into training (first 5 s) and testing (last 5 s) data. Since the training data were limited, the subunit filters and weights (phase 1) were first estimated from responses to a long white noise stimulus and only the subunit scales and output nonlinearity (phase 2) were estimated using the null stimulus.
For natural scenes stimuli, the spatial kernel for subunits was first estimated from responses to the white noise stimulus (phase 1), and the scales of subunits and output nonlinearity were then fit using responses to a natural stimulus (phase 2). This procedure was used because subunits estimated directly from natural scenes did not show any improvement in prediction accuracy, and all the subunits were essentially identical to the RF, presumably because of the strong spatial correlations in natural images.
V1 neurons were fit to previously published responses to flickering bars (see Rust et al., 2005 for details). In contrast to the retinal data, the two dimensions of the stimulus were time and one dimension of space (orthogonal to the preferred orientation of the cell). The estimation procedure (without regularization) was applied for different numbers of subunits. For successive numbers of subunits ($N$ and $N+1$), the hierarchical relationship was evaluated by greedily matching $N1$ filters across the two fits using the inner product of their spatiotemporal filters and splitting the remaining subunit in the $N$subunit fit into the two remaining subunits in the $N+1$subunit fit. The maximum number of subunits was chosen by cross validation as described above.
Simulated RGC model
Request a detailed protocolTo test the estimation procedure in conditions where the answer is known, the procedure was applied to responses generated from a simulated RGC (Appendix I). In the simulation, each RGC received exponentiated inputs from bipolar cells, which in turn linearly summed inputs from photoreceptors. The spatial dimensions of the simulation and the number of cones and bipolar cells were approximately matched to parasol RF centers at the eccentricities of the recorded data (Schwartz and Rieke, 2011; Jacoby et al., 2000). Below, all the measurements are presented in grid units (g.u.), with 1 g.u. = 1 micron resulting in a biologically plausible simulation. The first layer consisted of 64 photoreceptors arranged on a jittered hexagonal lattice with nearestneighbor distance 5 g.u., oriented at 60° with respect to the stimulus grid. The location of each cone was independently jittered with a radially symmetric Gaussian with standard deviation black 0.35 g.u. Stimuli were pooled by individual photoreceptors with a spatiotemporally separable filter, with time course derived from a typical parasol cell and a Gaussian spatial filter (standard deviation = black 1 g.u.).
The second layer of the cascade consisted of 12 model bipolar cells, each summing the inputs from a spatially localized set of photoreceptors, followed by an exponential nonlinearity. The photoreceptors were assigned to individual bipolars by kmeans clustering (12 clusters) of their locations. Finally, the RGC firing rate is computed by a positively weighted summation of bipolar activations. The bipolar weights were drawn from a Gaussian distribution with standard deviation equal to 10% of the mean. The choice of photoreceptors and bipolar cell weights give a mean firing rate of ≈19 spikes/s.
First, the subunit model was fitted to 24 min of responses to a white noise stimulus, such that the receptive field was roughly covered by 6 × 6 stimulus pixels (Figure S1B), as in the recorded experimental data (Figure 2B). Subsequently, the subunit model was fitted to 6 hr of stimuli that excites each cone individually (Figure S1C). In this case, subunits are represented as weights on individual photoreceptors. In a retinal preparation, individual cone activation is possible using a fine resolution white noise such that the locations of individual photoreceptors can be identified (see Freeman et al., 2015).
Null stimulus
Request a detailed protocolNull stimuli were constructed to experimentally verify the nonlinear influence of estimated subunits on light response. Below, a general approach is first presented to generate a spatiotemporal null stimulus which can be used for any neural system. Subsequently, it is shown that if the STA is spacetime separable, the spatiotemporal null stimulus can be derived with a simpler and faster method that only uses spatial information . Since this separability assumption is approximately accurate for RGCs, the spatial null stimulus is used for the results presented in main paper (Figures 5,6).
The purpose of spatiotemporal nulling is to find the closest stimulus sequence such that convolution with a specific spatiotemporal linear filter gives 0 for all time points. For simplicity, the algorithm is described only for a single cell (the extension to multiple cells is straightforward). Let $a(x,y,t)$ represent the spatiotemporal linear filter, estimated by computing the STA on white noise responses. The orthogonality constraint for a null stimulus $S(x,y,t)$ is written as:
The constraints for successive frames are not independent, but they become so when transformed to the temporal frequency domain. Writing $a(.,.,\omega )$ and $S(.,.,\omega )$ for the Fourier transform of $a(.,.,t)$ and $S(.,.,t)$ respectively, the constraints may be rewritten as:
Hence for each temporal frequency, the spatial content can be projected onto the subspace orthogonal to the estimated spatial receptive field. Specifically, for a given temporal frequency $\omega $, given a vectorized stimulus frame ${S}_{w}$ and a matrix ${A}_{\omega}$ with columns corresponding to the vectorized receptive field, the null frame is obtained by solving the following optimization problem:
The solution may be written in closed form:
For the results presented in this paper, binary white noise stimuli (${S}_{w}$) are projected onto the null space. However, in general, any visual stimulus can be used instead of white noise. The receptive field (linear filter) for each cell was estimated by computing a spacetime separable approximation of the spiketriggered average (STA), with the support limited to the pixels with value significantly different from background noise (absolute value of pixel value > 2.5 σ, with σ a robust estimate of standard deviation of background noise).
In addition to the orthogonality constraint, two additional constraints are imposed: a) each pixel must be in the range of monitor intensities (−0.5 to 0.5) and b) the variance of each pixel over time must be preserved, to avoid the possibility of contrast adaption in photoreceptors (Clark et al., 2013, but see Rieke, 2001). These two constraints were incorporated using Dykstra’s algorithm (Boyle and Dykstra, 1986), which iteratively projects the estimated null stimulus into the subspace corresponding to the individual constraints until convergence. Setting the contrast of initial white noise black (${S}_{w}$) to a value lower than 100% was necessary to satisfy these additional constraints. Finally, the pixel values of the resulting null stimuli were discretized to 8bit integers, for display on a monitor with limited dynamic range.
For a spacetime separable (rank one) STA (as is approximately true for RGCs), spatiotemporal nulling gives the same solution as spatial nulling. To see this, assume $a(x,y,t)={a}_{1}(x,y){a}_{2}(t)$, a spacetime separable linear filter. Hence, the null constraint can be rewritten as
where $l(t\tau )={\sum}_{x,y}{a}_{1}(x,y)S(x,y,t\tau )$. In the frequency domain this is written as
Since ${a}_{2}(t)$ has limited support in time, it has infinite support in frequency. Hence, $l(\omega )=0\forall \omega $ implying spatialonly nulling $l(t)=0\forall t$.
Data availability
Data has been deposited with Dryad at: https://doi.org/10.5061/dryad.dncjsxkvk. Code available on Github: https://github.com/ChichilniskyLab/shahelife2020 (copy archived at https://github.com/elifesciencespublications/shahelife2020).

Dryad Digital RepositoryData from: Inference of nonlinear receptive field subunits with spiketriggered clustering.https://doi.org/10.5061/dryad.dncjsxkvk
References

Spatiotemporal energy models for the perception of motionJournal of the Optical Society of America 2:284–299.https://doi.org/10.1364/JOSAA.2.000284

Origin and effect of phototransduction noise in primate cone photoreceptorsNature Neuroscience 16:1692–1700.https://doi.org/10.1038/nn.3534

A retinal circuit that computes object motionJournal of Neuroscience 28:6807–6817.https://doi.org/10.1523/JNEUROSCI.420607.2008

TwoPhoton imaging of nonlinear glutamate release dynamics at bipolar cell synapses in the mouse retinaJournal of Neuroscience 33:10972–10985.https://doi.org/10.1523/JNEUROSCI.124113.2013

Advances in Order Restricted Statistical Inference28–47, A method for finding projections onto the intersection of convex sets in hilbert spaces, Advances in Order Restricted Statistical Inference, Springer, 10.1007/9781461399407_3.

Normalization as a canonical neural computationNature Reviews Neuroscience 13:51–62.https://doi.org/10.1038/nrn3136

A simple white noise analysis of neuronal light responsesNetwork: Computation in Neural Systems 12:199–213.https://doi.org/10.1080/713663221

Functional asymmetries in ON and OFF ganglion cells of primate retinaThe Journal of Neuroscience 22:2737–2747.https://doi.org/10.1523/JNEUROSCI.220702737.2002

Dynamical adaptation in photoreceptorsPLOS Computational Biology 9:e1003289.https://doi.org/10.1371/journal.pcbi.1003289

YCell receptive field and collicular projection of parasol ganglion cells in macaque monkey retinaJournal of Neuroscience 28:11277–11291.https://doi.org/10.1523/JNEUROSCI.298208.2008

BookTheoretical Neuroscience: Computational and Mathematical Modeling of Neural SystemsMIT press.

Functional circuitry of the retinal ganglion cell's nonlinear receptive fieldThe Journal of Neuroscience 19:9756–9767.https://doi.org/10.1523/JNEUROSCI.192209756.1999

Adaptive subgradient methods for online learning and stochastic optimizationJournal of Machine Learning Research 12:2121–2159.

Spatiotemporal interactions in cat retinal ganglion cells showing linear spatial summationThe Journal of Physiology 341:279–307.https://doi.org/10.1113/jphysiol.1983.sp014806

Spatial properties and functional organization of small bistratified ganglion cells in primate retinaJournal of Neuroscience 27:13261–13272.https://doi.org/10.1523/JNEUROSCI.343707.2007

Fidelity of the ensemble code for visual motion in primate retinaJournal of Neurophysiology 94:119–135.https://doi.org/10.1152/jn.01175.2004

Normalization of cell responses in cat striate cortexVisual Neuroscience 9:181–197.https://doi.org/10.1017/S0952523800009640

Linear and nonlinear spatial subunits in Y cat retinal ganglion cellsThe Journal of Physiology 262:265–284.https://doi.org/10.1113/jphysiol.1976.sp011595

Receptive fields, binocular interaction and functional architecture in the cat's visual cortexThe Journal of Physiology 160:106–154.https://doi.org/10.1113/jphysiol.1962.sp006837

Diffuse bipolar cells provide input to OFF parasol ganglion cells in the macaque retinaThe Journal of Comparative Neurology 416:6–18.https://doi.org/10.1002/(SICI)10969861(20000103)416:1<6::AIDCNE2>3.0.CO;2X

Identifying functional bases for multidimensional neural computationsNeural Computation 25:1870–1890.https://doi.org/10.1162/NECO_a_00465

What does the eye tell the brain?: development of a system for the largescale recording of retinal output activityIEEE Transactions on Nuclear Science 51:1434–1440.https://doi.org/10.1109/TNS.2004.832706

Inferring hidden structure in multilayered neural circuitsPLOS Computational Biology 14:e1006291.https://doi.org/10.1371/journal.pcbi.1006291

Inferring nonlinear neuronal computation based on physiologically plausible inputsPLOS Computational Biology 9:e1003143.https://doi.org/10.1371/journal.pcbi.1003143

ConferenceDeep learning models of the retinal response to natural scenesAdvances in Neural Information Processing Systems. pp. 1369–1377.

Approach sensitivity in the retina processed by a multifunctional neural circuitNature Neuroscience 12:1308–1316.https://doi.org/10.1038/nn.2389

ConferenceConvergence properties of some spiketriggered analysis techniquesAdvances in Neural Information Processing Systems.. pp. 189–196.

ConferenceSpectral methods for neural characterization using generalized quadratic models.Advances in Neural Information Processing Systems. pp. 2454–2462.

Receptive field inference with localized priorsPLOS Computational Biology 7:e1002219.https://doi.org/10.1371/journal.pcbi.1002219

Fast inference in generalized linear models via expected loglikelihoodsJournal of Computational Neuroscience 36:215–234.https://doi.org/10.1007/s1082701304664

Temporal contrast adaptation in salamander bipolar cellsThe Journal of Neuroscience 21:9445–9454.https://doi.org/10.1523/JNEUROSCI.212309445.2001

The New Visual Neurosciences145–162, Correlated activity in the retina, The New Visual Neurosciences, 12, MIT Press.

Visual transduction in cones of the monkey Macaca fascicularisThe Journal of Physiology 427:681–713.https://doi.org/10.1113/jphysiol.1990.sp018193

The spatial structure of a nonlinear receptive fieldNature Neuroscience 15:1572–1580.https://doi.org/10.1038/nn.3225

Nonlinear spatial encoding by retinal ganglion cells: when 1 + 1 ≠ 2The Journal of General Physiology 138:283–290.https://doi.org/10.1085/jgp.201110629

Beyond GLMs: a generative mixture modeling approach to neural system identificationPLOS Computational Biology 9:e1003356.https://doi.org/10.1371/journal.pcbi.1003356

Independent component filters of natural images compared with simple cells in primary visual cortexProceedings of the Royal Society of London. Series B: Biological Sciences 265:359–366.https://doi.org/10.1098/rspb.1998.0303

A convolutional subunit model for neuronal responses in macaque V1Journal of Neuroscience 35:14829–14841.https://doi.org/10.1523/JNEUROSCI.281513.2015

ConferenceConvolutional spiketriggered covariance analysis for neural subunit modelsAdvances in Neural Information Processing Systems.. pp. 793–801.
Decision letter

Tatyana O SharpeeReviewing Editor; Salk Institute for Biological Studies, United States

Joshua I GoldSenior Editor; University of Pennsylvania, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Thank you for submitting your article "Inference of nonlinear spatial subunits in primate retina with SpikeTriggered Clustering" for consideration by eLife. Your article has been reviewed by Joshua Gold as the Senior Editor, a Reviewing Editor, and three reviewers. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This manuscript introduces a new method for the estimation of nonlinear spatial subunits in neural data using an LNLN model (a cascade of two LN stages). The paper has clear merits by showing that it is beneficial to utilize subunit models for predicting parasol RGC responses and V1 neural responses to naturalistic stimuli. Parasol spike data has been previously mostly analyzed without taking this into account.
The main concern to address is that for a "tools and resources" papers needs to include clear benchmarking of the advantages of the current method against other existing and/or published approaches for subunit identification. The consensus among reviewers was that one would need to include comparison with at least one other method for both the retina and the V1 dataset. Some of the relevant methods that you could use are provided below, which offer different tradeoffs between accuracy and ease of implementation and the amount of background work needed to set the hyperparameters. It would be helpful if the authors could include a discussion to justify their choice of the alternative method. The alternative method might be different for the V1 and the retina dataset. Including a more detailed discussion of the available methods and the authors' perspective on which would work best at each circumstance or generalize across the light/stimulus conditions would be very helpful to the field.
The second concern was about the apparent inconsistencies in how the steps of the proposed algorithm were chosen (as detailed below) and that more methodological details need to be provided.
Comparison with existing methods:
While the manuscript includes some discussion of the existing methods, it is still not clear which model is the best taken into account that the subunits of the current model do not correlate with the actual bipolar cell receptive fields but rather with their clusters in some way. For the reader to take up this method, one would need to clearly see the pros and cons in speed, resolution, easiness of implementation, neural correlates of subunits and/or other key features of this model against other nonlinear subunit models (e.g. Liu et al., 2017 and the simpler earlier version by Kaardal et al., 2013) and/or simple convolution neural network models (Eickenberg et al., 2013, Vintch et al., 2015). It is critical for the reader to be able to assess the value of the current model. Assuming this is done properly, and these clarifications are added, this paper would be significantly improved.
One would like to understand and document clearly performance of the subunitidentification method across stimulus conditions (white noise vs. naturalistic stimuli) against conventional currently used methods. This could perhaps be done by applying the new model to existing datasets of parasol RGCs in different experimental conditions. If such data are not available, then perhaps the method could be tested using simulated responses constructed to mimic properties of the retinal neurons.
Questions about the proposed methodology:
There are some inconsistencies in the design of the proposed methodology. For examples, the method is motivated with an example that involves hierarchical splitting of subunits, but this hierarchical version is actually not used further on. Then the bare method (without regularization and fairly coarse stimuli) is used to analyze populations of OFF parasol cells (Figure 2BF), but Figure 3 shows that much better subunits are obtained with regularization and finer white noise. One wonders why this improved method wasn't used to obtain, for example, the distribution of subunit numbers. Further analyses are then again (partly) performed without regularization (e.g. joint fitting of multiple cells, V1 data). In particular when discussing the V1 data, where the subunits have very different structure than in the retina, one wonders whether regularization still works in the same way. Also, it is emphasized how the optimal number of subunits can be determined, but some of the subsequent analyses are performed for a fixed number of subunits (e.g. Figure 2 and Figure 3).
There are also several other parts where that information is difficult to extract or missing, which hinders understanding of the material.
The description of the subunit estimation approach via approximate maximum likelihood (subsection “Subunit estimation”) is central to the manuscript but quite difficult to follow. Some terms in the equation are not defined (K and w with superscript 0, c(K,w)), and the explanations for the derivations are cursory or absent. For example, when Equations 15 are derived, there is no mentioning of an iterative procedure (this only comes later; at this point, this seems to be only a reformulation of the loglikelihood), but some manipulations and terms here only make sense in the context of an iterative estimation procedure. This needs to be disentangled.
Relatedly, it seems important to spell out how the clustering algorithm (Equations 68) solves the maximum likelihood problem, in particular since this connection is central to the manuscript.
The application of the method to V1 data makes an important point regarding the general applicability of the method. This is nice, but only very little analysis is presented. Maybe the authors could at least show how the model performs for different numbers of subunits as in other analyses, to see how well the simple cell and the complex cell are separated in this respect. Also, the effect of regularization would be interesting here.
This assumption of spacetime separability needs further discussion and justification. This assumption might be necessary to reduce the number of parameters to manageable levels, but it would be helpful if the authors could offer guidelines for when it might be appropriate.
About the response prediction to naturalistic stimuli. The use of white noise stimuli rather than natural stimuli might provide a good approximation for estimating of subunit filters, However, bipolar cells, which provide inputs to parasol RGCs, are strongly affected by stimulus correlation through gapjunction connectivity generating a nonlinear enhancement of bipolar activity, which could explain the differences found with real activity (Kuo et al., 2016, Manookin et al., 2018). These elements should be included in the Discussion section as one of the potential pitfalls of the approach.
The nullstimulus description indicated in the Results section needs further discussion and justification. This part should be rewritten extracting elements present in the methods. Also, the results are confusing. According to what is presented in Figure 5, the PSTH variance obtained from the whitenoise response is larger than the variance of nullstimulus response, and at some extent, this variance would be reflecting intertrial variability. At this level, this figure leads to confusion because it is not helping to understand the methodology. What is the role of the nullstimulus in this method paper? The method seems to predict the nullstimulus response better, but again, it is not clear how this clarifies the methodology neither provides insights to understand this behavior. It is counterintuitive the fact that adding more units in the model subunits the response prediction for whitenoise stimulus is not improving (Figure 6).
More detailed comments:
The manuscript was first submitted as original research before the submission as a method paper where essential changes in the structure and content were incorporated. Nevertheless, there is still information more related to retina physiology than the methodology description. Besides, if this is a method valid both for retina and V1 neurons, it is important to mention this in the title, where only the retina application is indicated. Therefore, please consider revising the article title.
In the Introduction, the authors claim there is no widely accepted general computation to infer the structure of nonlinear subunits inputs to a neuron. Nevertheless, relevant references have been published in the last years addressing the same problem noted by the authors. For instance, Liu et al., 2017, which appears in the discussion but not in the introduction, proposes a spiketriggered nonnegative matrix factorization, to estimate the nonlinear subunits conforming the RGC receptive field in salamander retina, where bipolar cells conforming RGCs can be also encountered.
It would be interesting to obtain and include some information about computational runtime.
The presentation of the analysis of the simulated ganglion cell under fine white noise stimulation (Figure 1—figure supplement 1C) is odd. Why are the subunits shown as sets of photoreceptors, not as pixel layouts? Does the method yield 12 subunits as the optimal number, or is this just the result of what came out when enforcing 12 subunits? A better discussion of how the method performs for simulated data could help better understand the strengths and limitations of the approach.
The figures show the "relative strength" of subunits by a blue scale bar. Is this the weight w? Or the norm of the subunit filter? Both seem to be related to "strength". And both seem to be optimized in the second step of the fitting approach along with the output nonlinearity.
Subsection “Estimated subunits are spatially localized and nonoverlapping: The text refers to the Materials and methods section for explaining how the temporal filter was obtained, but the Materials and methods section seem to lack information about receptive field and temporal filter estimation.
Subsection “Regularization for spatially localized subunit estimation”: The chosen value of epsilon in the LNL1 regularization should somehow be related to the typical range of weights. Are the latter of order unity so that a value of epsilon=0.01 is small compared to typical weights?
Figure 4E: It seems odd that the negative loglikelihood keeps decreasing with the number of subunits in the joint fitting procedure. Unlike for the separate fitting. Is there a reason for this? Relatedly: how was the total number of subunits constrained for the separate fitting?
Subsection “Regularization for spatially localized subunit estimation”: The text states that a joint model with 15 subunits gives higher prediction accuracy, referring to Figure 4D. Does this relate to predictions as analyzed in Figure 5? Or to the maximum likelihood of Figure 4E?
Subsection “Subunit model explains spatial nonlinearities revealed by null stimulus”: The text refers to "spatial null spaces", but the Materials and methods section relates to spatiotemporal null spaces. Please clarify.
Subsection “Subunit model explains spatial nonlinearities revealed by null stimulus”, fourth paragraph: Maybe I missed it, but I wasn't sure which version of the subunit estimation is meant by "the subunit model". With or without regularization? Fine or coarse stimuli? A specific instance of the different initial conditions? Same for Figure 7.
Figure 7: The figure and the text around it use the term saccade, but the stimulus seems just to be a switch to a new image without a real motion segment between fixations. This is a bit confusing. Also, how the stimulus is segmented into periods around and between image switches should be better explained and not only mentioned in the legend, in particular how this is used to compute the projection on the receptive field (Figure 7C; is this a spatiotemporal projection?). Furthermore, the difference in Figure 7C seems quite subtle. Is it significant?
In the Materials and methods section, maybe state that the spatiotemporal stimulus was binary white noise (not Gaussian) if this is the case and that the X_{t} signal that goes into the subunit estimation is the contrast signal (zero mean).
Could the authors explain why the weights and the filter magnitudes need to be refitted in the second model fitting step?
Subsection “Application to neural responses”: What does the 5.2 µm boundary refer to?
For the simulated RGC model, the 180 µm (= 180 g.u.) spacing between cones seems large. Is that a typo? Also, in the description, it is not quite clear what the photoreceptor weights are (used for normalizing the firing rate) and how bipolar weights were varied around 10% of their mean (which mean?).
When describing the null stimulus, it would be good to explain where the original stimulus S_{w} comes from and how the σ for computing the support of the receptive field is obtained.
The references Maherwaranathan et al., 2018, and Shi et al., 2018 are missing.
Figures are in general of poor quality and the color choice misleading. In Figure 2D is not possible to see the black arrows over the blue bars. In Figure 2B blue line is not well visualized.
Figure 5D, use the same bar width for the inset histograms.
"… was highly reproducible across repeats (Figure 6C,D)," > "… was highly reproducible across repeats (Figure 5C,D),"
Figure 6B: dashed are filled lines are not indicated in the caption.
Subsection "Application to neural response". The last three paragraphs should be incorporated into the Results section.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Inference of Nonlinear Receptive Field Subunits with SpikeTriggered Clustering" for further consideration at eLife. Your revised article has been favorably evaluated by Joshua Gold (Senior Editor), a Reviewing Editor, and three reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
The comparison to the convolutional subunit model is not completely convincing for the following reasons: (1) The comparison done on V1 data uses a restricted version of the convolutional subunit model (the subunit nonlinearity is forced to be an exponential, despite that Vintch et al., (2015) and others have showed that it is often quadratic when learned from data). (2) No comparison is made on the retinal dataset where the fixed exponential subunit nonlinearity probably is a more realistic assumption. The reader is thus left wondering whether the new method truly has better predictive performance than the simpler convolutional subunit model, and whether the main benefit with the new method is that it can fit multiple subunit filters easily. It would be beneficial at least to clarify these concerns by giving a clearer justification to the selected methodology regarding the above issues.
A description for how/why the clustering algorithm's three step procedure solves the likelihood problem is still missing. It would be helpful if the authors could briefly spell out why the threestep prescription subsection “Subunit Estimation” “minimizes the upper bound of the loglikelihood (Equation 5). Are the update rules derived from the partial derivatives of Equation 5? Is this guaranteed to converge and converge to the actual minimum?
Regarding the validation in V1 data: More than a validation, where there is no groundtruth to compare, the results should be focused to validate the model showing that it reproduces neural data better than previous approaches. For instance, Figure 10 compares the resulting subunits obtained by a convolutional model and the model proposed in this article. It is tough to compare both; it should be more useful to see how the two of them reproduce neural response.
The Discussion section requires a new organization. The comparison with the existing models, such as SNMF and V1 data, shouldn't be placed here. The paragraph describing the work of McIntosh et al., 2016 can be better articulated with the rest of the Discussion section.
 For instance, the sentence "Note, however, that a model with two subunits frequently explained ON parasol data more accurately than a model with one subunit, implying that ON cells do have spatial nonlinearities". This is not related to the method validation, and moreover, there is no result to validate this affirmation.
 Subsection “Modeling linearnonlinear cascades” states that the paper by Jia et al., (2018) applies a nonnegativity constraint on the weights instead of the subunit filters. Looking at their methods and sample subunits in the figures, this doesn't seem to be the case.
 For the receptive field estimation in subsection “Application to neural responses”, the pixels with magnitude larger than 2.5 σ probably refer to magnitude **of the STA**; similarly, averaging the time course of pixels probably refers to averaging the pixels **in the STA**.
Figure 2 is still hard for me to follow. Is there a relationship between the grayscale images and the inner subunits? For instance, the subunits sizes go beyond the stimulus resolution, so, is their spatial location reliably estimated? Why level 3 only divides in two the left component, while the right one is still increasing the subunits number? In fact, the authors, in response to reviewers document and the article, indicate that the hierarchical partitioning is not used for the remainder in the paper. If this is not used, why is it introduced?
Figure 5D. In the inset histograms, please use the same binwidth.
 Figure 6B. In the text, the authors talk about prediction accuracy. Nevertheless, the figure indicates Δ correlation. Same in Figure 7B.
https://doi.org/10.7554/eLife.45743.sa1Author response
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
We thank the reviewers for a thorough evaluation of the manuscript. In response to their feedback, we have modified the paper, and have provided responses to their feedback below.
Summary:
[…]The main concern to address is that for a "tools and resources" papers needs to include clear benchmarking of the advantages of the current method against other existing and/or published approaches for subunit identification. The consensus among reviewers was that one would need to include comparison with at least one other method for both the retina and the V1 dataset. Some of the relevant methods that you could use are provided below, which offer different tradeoffs between accuracy and ease of implementation and the amount of background work needed to set the hyperparameters. It would be helpful if the authors could include a discussion to justify their choice of the alternative method. The alternative method might be different for the V1 and the retina dataset. Including a more detailed discussion of the available methods and the authors' perspective on which would work best at each circumstance or generalize across the light/stimulus conditions would be very helpful to the field.
The second concern was about the apparent inconsistencies in how the steps of the proposed algorithm were chosen (as detailed below) and that more methodological details need to be provided.
We have substantially improved the manuscript regarding both of these concerns. For the first, in addition to a detailed discussion of relationships to previous methods, we have included new analyses comparing the performance of our method to two others: the semi nonnegative matrix factorization method (for retinal data), and the convolutional method (for V1 data). For the second concern, we have added an overview of the different algorithmic variations of the basic subunit estimation approach at the end of subsection “Subunit response model and parameter estimation”, and improved the text transitions at various places throughout the Results section to make the modifications for different experimental conditions clearer and better motivated. Details are provided below in response to the specific critiques.
Comparison with existing methods:
While the manuscript includes some discussion of the existing methods, it is still not clear which model is the best taken into account that the subunits of the current model do not correlate with the actual bipolar cell receptive fields but rather with their clusters in some way. For the reader to take up this method, one would need to clearly see the pros and cons in speed, resolution, easiness of implementation, neural correlates of subunits and/or other key features of this model against other nonlinear subunit models (e.g. Liu et al., 2017 and the simpler earlier version by Kaardal et al., 2013) and/or simple convolution neural network models (Eickenberg et al., 2013, Vintch et al., 2015). It is critical for the reader to be able to assess the value of the current model. Assuming this is done properly, and these clarifications are added, this paper would be significantly improved.
We have added the following detailed comparisons:
Semi Nonnegative Matrix Factorization (SNMF – similar to Liu et al., 2017): We have incorporated Figure 9 and an accompanying paragraph in the Discussion section comparing our method to SNMF in a simulated cell (for which ground truth is known). Briefly, we observed that our method is more efficient than SNMF at estimating subunits from limited data. Specifically, the fits using our method were more accurate and less variable.
A convolutional subunit model (similar to Vintch et al., 2015): We have incorporated Figure 10 and an accompanying paragraph in Discussion section comparing our method to an approach with convolutional (spatiallyrepeated, homogeneous) subunits. The convolutional method performs better for very small amounts of training data (as expected – it has far fewer parameters), but for larger amounts of data (comparable to what is available in typical experiments) our method performs better. The subunits estimated with our method roughly match both the locations and spatiotemporal selectivity of subunits estimated using the convolutional approach, but our method offers improved performance by capturing the distinct (nonconvolutional) structure of individual subunits.
One would like to understand and document clearly performance of the subunitidentification method across stimulus conditions (white noise vs. naturalistic stimuli) against conventional currently used methods. This could perhaps be done by applying the new model to existing datasets of parasol RGCs in different experimental conditions. If such data are not available, then perhaps the method could be tested using simulated responses constructed to mimic properties of the retinal neurons.
If we understand correctly, the reviewer is asking us to (a) assess how well our method performs when applied to stimuli other than white noise (e.g., natural scenes), and (b) compare the performance of our method to the performance of previous methods (on at least two kinds of stimuli).
For (a), a direct comparison would involve obtaining and testing subunit model fits on natural scenes. We did not attempt to directly estimate subunits from natural scenes data, because natural scenes are biased toward large spatial scales, and are therefore illsuited for estimating the fine structure of subunits in the RF. Indeed, even the linear estimate of the RF obtained using spiketriggered averaging is highly biased by natural scene structure. Because this problem is about the stimulus itself, it applies to all subunit estimation techniques. So, to test the effectiveness of the method for natural scenes, we instead estimated the subunits using white noise, then tested the success of these subunits in predicting the responses to natural scenes (Figure 7).
Regarding (b), we now provide a direct comparison of our method to the SNMF method (for RGCs) and the convolutional method (for V1), with white noise data. However, for technical reasons, we think comparison of natural scenes predictions would have little value. For both the SNMF approach and ours, extension to natural scenes data requires a modification to account for different response statistics in natural scenes. In both studies, this was handled with an ad hoc addon procedure intended to demonstrate only that the subunits indeed help in response estimation (see Figure 7), but was not intended to fully capture responses to natural scenes (which likely entail much more complex nonlinearities than just subunits, e.g. various forms of gain control). We note that the authors of the SNMF study point this out explicitly. Our approach involves a modification of the output nonlinearity, and their approach involves an added response gain term, which are not easy to reconcile. In both studies, there is no exploration of alternatives or experimental test of the particular choice made. Furthermore, because the subunits themselves are estimated from white noise, a comparison of responses to natural scenes would hinge primarily on these ad hoc addons. For these reasons, we have not attempted this analysis  regardless of the outcome, we don’t think it would do justice to either study.
Questions about the proposed methodology:
There are some inconsistencies in the design of the proposed methodology. For examples, the method is motivated with an example that involves hierarchical splitting of subunits, but this hierarchical version is actually not used further on.
The basic method is not intended to be hierarchical. The hierarchical analysis is intended to illustrate (1) how the method behaves when the number of estimated subunits is less than the number of true subunits; (2) that the solutions for different numbers of estimated subunits are “nested”, in the sense that each successive increase in the number of subunits leads (approximately) to a splitting of a subunit into two. This observation is formalized in an analysis where hierarchy is enforced, confirming the result (Figure 2—figure supplement 1). This hierarchical partitioning can also be used as the basis for an incremental algorithm that provides a more robust/efficient method of estimating the subunits (Figure 2—figure supplement 1). Despite this, in the interest of simplicity, this is not the approach we use in the remainder of the paper. This has been clarified in the text. If the reviewer thinks this is too confusing, Figure 2—figure supplement 1 and accompanying analysis could be omitted.
Then the bare method (without regularization and fairly coarse stimuli) is used to analyze populations of OFF parasol cells (Figure 2BF), but Figure 3 shows that much better subunits are obtained with regularization and finer white noise.
Early in the manuscript, we use the most basic form of the method (without priors/regularization) to establish its properties, without introducing any inadvertent biases or relying on any features of the particular data used. This resulted in subunits that were spatially localized, despite the fact that the method incorporates no preferences or knowledge regarding the spatial location of the stimulus pixels. This motivated the development of a prior that favors spatial locality (locally normalized L1), which we demonstrate can regularize RGC subunit estimates more efficiently than the more commonly used sparsity (L1) prior when data are limited. We have tried to clarify the motivation for this sequenced presentation in the text.
One wonders why this improved method wasn't used to obtain, for example, the distribution of subunit numbers.
As shown early in the manuscript, when the pixel size is large, the estimated subunits are aggregates of true underlying bipolar cell subunits. For the data with finer pixels, even though the estimated number of subunits is slightly higher, the pixel size and estimated subunits are still too large to reveal bipolar cell inputs. As such, reanalyzing the true number of subunits with regularization did not seem sufficiently informative given the space it would take.
Further analyses are then again (partly) performed without regularization (e.g. joint fitting of multiple cells, V1 data). In particular when discussing the V1 data, where the subunits have very different structure than in the retina, one wonders whether regularization still works in the same way.
Joint fitting of multiple cells was actually performed using the locally normalized L1 regularization that we showed to be effective for single cells. This was originally indicated only in the caption of Figure 4; We now also mention it in the main text and Methods section.
V1 subunits estimated without any regularization already revealed interesting structure in relation to previous findings, so we did not think it was necessary to incorporate any regularization (either the localization, or a new V1specific variation). We have tried to be clearer about the rationale in the text.
Also, it is emphasized how the optimal number of subunits can be determined, but some of the subsequent analyses are performed for a fixed number of subunits (e.g. Figure 2 and Figure 3).
The appropriate approach for setting this parameter (the number of subunits) depends on what the analysis is being used for. The most straightforward case is when the user seeks to determine the number and structure of subunits for a particular cell. For this, we advocate crossvalidated optimization  choose the number of subunits that best describes the data (while not overfitting). This is empirically the “best” model fit, and is used in Figure 2C and Figure 8.
However, Figure 2A and Figure 3 have different goals:
 The goal of Figure 2A is to identify strong sources of spatial nonlinearity. By comparing the fits for different numbers of subunits, we find that the subunits are related hierarchically, suggesting that weaker subunits are partitions of stronger subunits.
 The goal of Figure 3 is to compare different regularization methods for subunit estimation. To avoid comparing different optimal number of subunits for each regularization method, the number of subunits are fixed to a value that works best across multiple cells.
We have tried to clarify the motivation for these analyses in the text.
There are also several other parts where that information is difficult to extract or missing, which hinders understanding of the material.
The description of the subunit estimation approach via approximate maximum likelihood (subsection “Subunit estimation”) is central to the manuscript but quite difficult to follow. Some terms in the equation are not defined (K and w with superscript 0, c(K,w)), and the explanations for the derivations are cursory or absent. For example, when Equations 15 are derived, there is no mentioning of an iterative procedure (this only comes later; at this point, this seems to be only a reformulation of the loglikelihood), but some manipulations and terms here only make sense in the context of an iterative estimation procedure. This needs to be disentangled.
Relatedly, it seems important to spell out how the clustering algorithm (Equations 68) solves the maximum likelihood problem, in particular since this connection is central to the manuscript.
We appreciate these comments and corrections, and have rewritten subsection “Subunit estimation” to make it more clear. Briefly, each step of the clustering algorithm results from minimizing a convex upper bound based on current parameter estimates. Equations 15 derive this upper bound and Equations 68 present the steps to minimize this upper bound. Omission of term definitions has been corrected.
The application of the method to V1 data makes an important point regarding the general applicability of the method. This is nice, but only very little analysis is presented. Maybe the authors could at least show how the model performs for different numbers of subunits as in other analyses, to see how well the simple cell and the complex cell are separated in this respect. Also, the effect of regularization would be interesting here.
We have incorporated the reviewer’s comments by adding analysis with varying number of subunits for V1 data, showing similar hierarchical splitting as RGCs (Figure 8).
Although the suggestion about simple and complex cells is an interesting one, given that the paper is now essentially a methods paper, and that such an analysis would require more space (as well as substantially more data), we have chosen not to pursue this direction.
Although in principle we could explore and test various regularization methods on V1 data, our goal with this analysis (and limited data set) was simply to demonstrate that the basic method works as well as or better than previous methods for subunit identification.
This assumption of spacetime separability needs further discussion and justification. This assumption might be necessary to reduce the number of parameters to manageable levels, but it would be helpful if the authors could offer guidelines for when it might be appropriate.
We have added text in Subsection “Further applications and extensions” on how to verify the spacetime separability assumption empirically. We have performed this verification on our data, but it would substantially lengthen the manuscript to include these data.
Note that the separability assumption is also made in other work that focuses on spatial subunits (Liu et al., 2017) and implicitly or explicitly in much of the literature on RGCs (but see EnrothCugell and Pinto., 1970 for a deeper investigation of separability). Note also that separability is not required by our subunit model or fitting procedure  for example, in the V1 cells, the method extracts spatiotemporally oriented (and thus, nonseparable) receptive fields.
About the response prediction to naturalistic stimuli. The use of white noise stimuli rather than natural stimuli might provide a good approximation for estimating of subunit filters, However, bipolar cells, which provide inputs to parasol RGCs, are strongly affected by stimulus correlation through gapjunction connectivity generating a nonlinear enhancement of bipolar activity, which could explain the differences found with real activity (Kuo et al., 2016, Manookin et al., 2018). These elements should be included in the Discussion section as one of the potential pitfalls of the approach.
Thank you for pointing this out. We now explain the possible role of gap junctions in natural scenes responses in Results section and Discussion section.
The nullstimulus description indicated in the Results section needs further discussion and justification. This part should be rewritten extracting elements present in the methods.
We have rewritten this section with more specifics and detail as suggested.
Also, the results are confusing. According to what is presented in Figure 5, the PSTH variance obtained from the whitenoise response is larger than the variance of nullstimulus response, and at some extent, this variance would be reflecting intertrial variability.
We thank the reviewer for pointing this out. Indeed, the PSTH variance we measure has a component of response variation over time (the thing we are interested in) and intertrial variability (which is of less interest here). On evaluating the effect of the number of trials, we find that PSTH variance converges with 30 trials, in 3 datasets (Author response image 1); thus, the intertrial variability is not significant in the analysis performed in the paper. We have now indicated this in Results.
At this level, this figure leads to confusion because it is not helping to understand the methodology. What is the role of the nullstimulus in this method paper? The method seems to predict the nullstimulus response better, but again, it is not clear how this clarifies the methodology neither provides insights to understand this behavior. It is counterintuitive the fact that adding more units in the model subunits the response prediction for whitenoise stimulus is not improving (Figure 6).
Null stimuli provide a direct means of demonstrating the importance of the subunit nonlinearity. They expose spatial nonlinearities by specifically “zeroing out” the component of light response that can be explained with a linear model. In this sense, they are analogous to the commonlyused counterphase gratings and second harmonic analysis (Hochstein and Shapley, 1976), which have been used in dozens of studies to focus on bipolar cell nonlinearities. The null stimuli, however, have much richer stimulus statistics (contrast, temporal frequency, spatial scale) very similar to the white noise stimuli that are used for estimating the subunit model that is being tested. The counterintuitive result that the reviewer points out is that the efficiency of the null stimuli at revealing spatial nonlinearities allows them to reveal a clear improvement of subunit models even for short duration stimuli (Figure 6), while white noise stimuli require longer stimulus presentations (Figure 2).
We have attempted to clarify these points in the text.
More detailed comments:
The manuscript was first submitted as original research before the submission as a method paper where essential changes in the structure and content were incorporated. Nevertheless, there is still information more related to retina physiology than the methodology description. Besides, if this is a method valid both for retina and V1 neurons, it is important to mention this in the title, where only the retina application is indicated. Therefore, please consider revising the article title.
We agree, and have revised the title accordingly, – “Inference of Nonlinear Receptive Field Subunits with SpikeTriggered Clustering”
In the Introduction, the authors claim there is no widely accepted general computation to infer the structure of nonlinear subunits inputs to a neuron. Nevertheless, relevant references have been published in the last years addressing the same problem noted by the authors. For instance, Liu et al., 2017, which appears in the discussion but not in the introduction, proposes a spiketriggered nonnegative matrix factorization, to estimate the nonlinear subunits conforming the RGC receptive field in salamander retina, where bipolar cells conforming RGCs can be also encountered.
We agree that the recent literature on this topic is active: various groups have realized that subunits are a common computational motif, and that new tools are required to study their role in neural computation. We have revised the introduction to position our contribution relative to these studies (although we note that the Liu et al., 2017 study was already mentioned in paragraph 2 of our original Introduction). In particular, although previous efforts to identify subunits have been advanced in various experimental conditions, none of the existing papers have developed a single method and applied it to multiple stimulus conditions, with different statistics, different systems, and large numbers of recorded neurons.
It would be interesting to obtain and include some information about computational runtime.
Thanks for this suggestion – we have added information about computational runtime in Materials and methods section.
The presentation of the analysis of the simulated ganglion cell under fine white noise stimulation (Figure 1—figure supplement 1C) is odd. Why are the subunits shown as sets of photoreceptors, not as pixel layouts? Does the method yield 12 subunits as the optimal number, or is this just the result of what came out when enforcing 12 subunits? A better discussion of how the method performs for simulated data could help better understand the strengths and limitations of the approach.
Unfortunately, the text in the caption was confusing. While the coarse pixel approach is described in panels A and B, in panel C we simulate what would happen if each cone were driven independently. Although this would require a closedloop experiment identifying individual cone locations (Freeman, 2015), we show here that it would reveal the correct subunits with the estimation approach in this paper. We have attempted to explain the simulation better in the caption (including indication that the correct number of subunits (12) was identified), and motivate this choice of representation better in the text.
The figures show the "relative strength" of subunits by a blue scale bar. Is this the weight w? Or the norm of the subunit filter? Both seem to be related to "strength". And both seem to be optimized in the second step of the fitting approach along with the output nonlinearity.
The blue bars correspond to the average contribution to cell response for different subunits given by: $w}_{n}{e}^{\frac{{Kn}^{2}}{2}$. We now provide a pointer to Equation (6) in the methods. This formula ignores the learned output nonlinearity, which saturates only for the strongest 0.5% of the stimuli. This information has been added in Results section and Materials and methods section.
Subsection “Estimated subunits are spatially localized and nonoverlapping: The text refers to the Materials and methods section for explaining how the temporal filter was obtained, but the Materials and methods section seem to lack information about receptive field and temporal filter estimation.
Thank you for spotting this. We’ve added the explanation to the Materials and methods section.
Subsection “Regularization for spatially localized subunit estimation”: The chosen value of epsilon in the LNL1 regularization should somehow be related to the typical range of weights. Are the latter of order unity so that a value of epsilon=0.01 is small compared to typical weights?
Yes, typically, the nonzero weights were ~2 and the maximum weights were ~4. This information has been added to the paper.
Figure 4E: It seems odd that the negative loglikelihood keeps decreasing with the number of subunits in the joint fitting procedure. Unlike for the separate fitting. Is there a reason for this?
Unlike other figures, the negative loglikelihood for the joint model is decreasing for the range of subunits plotted in Figure 4E. This can be attributed to the greater efficiency of joint estimation procedure compared to separate fitting. For separate fitting, a subunit connected to two nearby cells needs to be estimated by both the cells independently.
We believe that the negative loglikelihood for the joint model will indeed saturate, but for larger number of subunits than currently shown. For presenting a concise figure, we did not analyze the model for larger number of subunits.
Relatedly: how was the total number of subunits constrained for the separate fitting?
First, for each cell, models with different number of subunits are estimated. Then, for a fixed total number of subunits, these separate fits are used to find a combination of subunits (N_{i} for cell i) that gives maximum loglikelihood on heldout validation data and $\sum {N}_{i}=N$. Performance is analyzed for different N. This has been clarified in the main text.
Subsection “Regularization for spatially localized subunit estimation”: The text states that a joint model with 15 subunits gives higher prediction accuracy, referring to Figure 4D. Does this relate to predictions as analyzed in Figure 5? Or to the maximum likelihood of Figure 4E?
Thank you for pointing out this typo. It refers to Figure 4E.
Subsection “Subunit model explains spatial nonlinearities revealed by null stimulus”: The text refers to "spatial null spaces", but the Materials and methods section relates to spatiotemporal null spaces. Please clarify.
The methods section presents the steps involved in a more general, spatiotemporal null stimulus, as well as a method for spatialonly null stimulus, and the relationship between them. In case of RGCs, where the STA is spacetime separable, the spatialonly nulling is equivalent to spatiotemporal nulling. However, for the V1 data, the more general spatiotemporal nulling method needs to be applied, and this is why we have included it in the Materials and methods section.
Subsection “Subunit model explains spatial nonlinearities revealed by null stimulus”, fourth paragraph: Maybe I missed it, but I wasn't sure which version of the subunit estimation is meant by "the subunit model". With or without regularization? Fine or coarse stimuli? A specific instance of the different initial conditions? Same for Figure 7.
For both Figure 6 and Figure 7, the unregularized method was applied on coarse stimuli; the first phase of fitting was applied on white noise and second phase was applied on null stimulus or natural scenes respectively. This has now been clarified in the text.
Figure 7: The figure and the text around it use the term saccade, but the stimulus seems just to be a switch to a new image without a real motion segment between fixations. This is a bit confusing.
This has been clarified in the main text. We have emphasized that this is only a crude approximation of visual input interrupted by saccades.
Also, how the stimulus is segmented into periods around and between image switches should be better explained and not only mentioned in the legend, in particular how this is used to compute the projection on the receptive field (Figure 7C; is this a spatiotemporal projection?).
Yes, it is spatiotemporal filtering. This has been clarified in the main text.
Furthermore, the difference in Figure 7C seems quite subtle. Is it significant?
We agree that the difference in the histograms seems subtle – one has to look for a moment to see that the black bars are lower in the center, and higher at the extremes. However, the two distributions are significantly different as evaluated by comparing their variances using a twosided Ftest. The variances for saccadic and intersaccadic stimuli were 233 (N=5e6) and 155 (N=15e6), respectively.
In the Materials and methods section, maybe state that the spatiotemporal stimulus was binary white noise (not Gaussian) if this is the case and that the X_{t} signal that goes into the subunit estimation is the contrast signal (zero mean).
This has been added.
Could the authors explain why the weights and the filter magnitudes need to be refitted in the second model fitting step?
Since the output nonlinearity changes from linear to a more saturating one in natural scenes, the overall firing rate of the model cell is lower, and the weights on subunits and/or their magnitudes must be increased to match the observed firing rate. Moreover, the saturating nonlinearity could change the relative weights for different subunits – subunits that drive the cell more strongly would be more affected by saturation. However, on applying to data, we observed that the final nonlinearity is nearly linear, confirming that the first model fitting step is sufficient in most cases.
Subsection “Application to neural responses”: What does the 5.2 µm boundary refer to?
This refers to the region of stimulus around the receptive field which is included for subunit estimation. This has been clarified in the Materials and methods section.
For the simulated RGC model, the 180 µm (= 180 g.u.) spacing between cones seems large. Is that a typo? Also, in the description, it is not quite clear what the photoreceptor weights are (used for normalizing the firing rate) and how bipolar weights were varied around 10% of their mean (which mean?).
It is a typo – thanks for pointing it out. The spacing between photoreceptors is 5 µms. The details of photoreceptor and bipolar weights have been corrected. The bipolar weights were drawn from a gaussian distribution with standard deviation equal to 10% of the mean.
When describing the null stimulus, it would be good to explain where the original stimulus S_{w} comes from and how the σ for computing the support of the receptive field is obtained.
The original stimulus S_{w} is white noise used for computing the null stimulus and σ is a robust estimate of the standard deviation of background noise in STA. We have clarified in the text.
The references Maherwaranathan et al., 2018, and Shi et al., 2018 are missing.
These recent references have been added.
Figures are in general of poor quality and the color choice misleading.
We apologize for the lowresolution figures. They have been replaced with high resolution versions. We chose to focus on as few colors as possible and the color choices are consistent across figures. The color choices for white noise and null stimulus in Figure 5 and Figure 6 were interchanged. This is fixed now.
In Figure 2D is not possible to see the black arrows over the blue bars. In Figure 2B blue line is not well visualized.
For Figure 2D, we moved the black arrows to the top for clarity. The blue line is made continuous (instead of dotted) for better visualization.
Figure 5D, use the same bar width for the inset histograms.
We agree that having different bin widths makes the figure visually awkward. However, since the range of values for ON and OFF parasols is different, a common binwidth does not provide a good depiction of both distributions. Instead, we chose the same number of bins to describe both distributions equally accurately.
"… was highly reproducible across repeats (Figure 6C,D)," > "… was highly reproducible across repeats (Figure 5C,D),"
This has been fixed.
Figure 6B: dashed are filled lines are not indicated in the caption.
We had mistakenly referred to them as ‘dotted’ lines in caption. Now changed to ‘dashed’ lines.
Subsection "Application to neural response". The last three paragraphs should be incorporated into the Results section.
The information from these paragraphs is now included in the main text in Results section and/or captions.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
Thank you for your favourable review. Below, we address the remaining issues.
The comparison to the convolutional subunit model is not completely convincing for the following reasons: (1) The comparison done on V1 data uses a restricted version of the convolutional subunit model (the subunit nonlinearity is forced to be an exponential, despite that Vintch et al., (2015) and others have showed that it is often quadratic when learned from data). (2) No comparison is made on the retinal dataset where the fixed exponential subunit nonlinearity probably is a more realistic assumption. The reader is thus left wondering whether the new method truly has better predictive performance than the simpler convolutional subunit model, and whether the main benefit with the new method is that it can fit multiple subunit filters easily. It would be beneficial at least to clarify these concerns by giving a clearer justification to the selected methodology regarding the above issues.
(1) We agree with the reviewers, and have now added a relevant comparison (Figure 10—figure supplement 1). For the convolutional model, we estimate the common subunit filter, locationspecific weights and biases for quadratic nonlinearity using standard gradient descent methods. For our model, we first estimate the subunit filters with spiketriggered clustering (which assumes an exponential nonlinearity), and finetune the subunits filters and bias for quadratic nonlinearity using gradient descent. We find that both methods perform better with a quadratic nonlinearity, consistent with previous studies (Vintch et al., 2015). But as with the exponential nonlinearity, the convolutional method outperforms the spiketriggered clustering method when trained on small amounts of data, but the opposite holds for larger amounts of data. (2) As requested, we have added a comparison on the retinal data, which produces similar results to the V1 data (Figure 10).
A description for how/why the clustering algorithm's three step procedure solves the likelihood problem is still missing. It would be helpful if the authors could briefly spell out why the threestep prescription subsection “Subunit estimation” “minimizes the upper bound of the loglikelihood (Equation 5). Are the update rules derived from the partial derivatives of Equation 5? Is this guaranteed to converge and converge to the actual minimum?
We have expanded the description and derivation of the algorithm in Methods. We now explain how the update rules are derived from partial derivatives of Equation 5. In brief: we are minimizing an approximation to the negative loglikelihood (Equation 2). As the amount of data grows, the approximation converges to the negative loglikelihood. The algorithm is guaranteed to converge since it successively minimizes a tight upper bound to the approximate loglikelihood, leading to a monotonic increase in the approximate loglikelihood. However, due to the nonconvexity of the approximate negative loglikelihood, we cannot guarantee that the point of convergence will be a global optimum.
Regarding the validation in V1 data: More than a validation, where there is no groundtruth to compare, the results should be focused to validate the model showing that it reproduces neural data better than previous approaches. For instance, Figure 10 compares the resulting subunits obtained by a convolutional model and the model proposed in this article. It is tough to compare both; it should be more useful to see how the two of them reproduce neural response.
We agree with the reviewer that in addition to the qualitative comparison of the derived filters, it is valuable to compare predictions of neural response. This is done by plotting the loglikelihood of the recorded response as a function of the amount of training data in Figure 10C (and in the new Figure 10A as well for retinal data). We have attempted to clarify this in the figure caption.
The Discussion section requires a new organization. The comparison with the existing models, such as SNMF and V1 data, shouldn't be placed here. The paragraph describing the work of McIntosh et al., 2016 can be better articulated with the rest of the Discussion section.
As suggested, we have moved the comparison to SNMF and convolutional models to Results section and reorganized the relevant section of Discussion section.
 For instance, the sentence "Note, however, that a model with two subunits frequently explained ON parasol data more accurately than a model with one subunit, implying that ON cells do have spatial nonlinearities". This is not related to the method validation, and moreover, there is no result to validate this affirmation.
This statement was intended to make the point that, even though the optimum number of subunits for ON parasol cells was lower than for OFF parasol cells, spatial nonlinearity was still present in ON parasol cells. To support this claim, we have now provided statistics in the text, and rephrased it slightly.
 Subsection “Modeling linearnonlinear cascades” states that the paper by Jia et al., (2018) applies a nonnegativity constraint on the weights instead of the subunit filters. Looking at their methods and sample subunits in the figures, this doesn't seem to be the case.
The reviewer is absolutely correct: we made a mistake in our interpretation of the 2018 paper. It has now been corrected in the manuscript.
 For the receptive field estimation in subsection “Application to neural responses”, the pixels with magnitude larger than 2.5 σ probably refer to magnitude **of the STA**; similarly, averaging the time course of pixels probably refers to averaging the pixels **in the STA**.
Thanks for pointing this out. It has been corrected.
Figure 2 is still hard for me to follow. Is there a relationship between the grayscale images and the inner subunits? For instance, the subunits sizes go beyond the stimulus resolution, so, is their spatial location reliably estimated? Why level 3 only divides in two the left component, while the right one is still increasing the subunits number? In fact, the authors, in response to reviewers document and the article, indicate that the hierarchical partitioning is not used for the remainder in the paper. If this is not used, why is it introduced?
We apologize for lack of clarity in this figure. We suspect the confusion is due to a poor graphics choice on our part that made it difficult for a reader to see the Gaussian fits to the subunits. We have modified the figure so that the Gaussian fit for the fitted subunit shown in the image is indicated with a green ellipse that is more visible on darker pixels, and used red ellipses to represent the other estimated subunits. We hope that this makes the partitioning of receptive field clearer.
As the reviewer points out, the subunit estimates are limited by stimulus pixel resolution. For this reason, we measure the separation between subunits using the peaks of their Gaussian fits which, assuming that the true subunits are roughly Gaussian, effectively interpolates their center locations.
Regarding the last comment, we thought it was important to mention that enforcing hierarchical partitioning (Figure 2—figure supplement 1) leads to similar subunits, because this reinforces the idea that subunit estimates effectively “split” as the number of subunits in the model increases. Also, enforcing this hierarchical organization could lead to improved estimates by providing a form of regularization. We have further clarified this in the text. If the reviewers still think that these points are not important, we could remove ‘Figure 2—figure supplement 1’.
Figure 5D. In the inset histograms, please use the same binwidth.
Done.
 Figure 6B. In the text, the authors talk about prediction accuracy. Nevertheless, the figure indicates Δ correlation. Same in Figure 7B.
We apologize for the confusion about the metric. Throughout the paper, we use negative loglikelihood for measuring accuracy for singletrial data, and correlation for raster data. For studying the variation of performance with increasing numbers of subunits, the change in correlation compared to one subunit is presented in the figures, because there is a comparatively large variation across cells in the correlation obtained with one subunit, which would obscure the trends seen in the figure. We have modified the figure caption to make this connection clearer.
https://doi.org/10.7554/eLife.45743.sa2Article and author information
Author details
Funding
National Science Foundation (NSF IGERT Grant 0801700)
 Nora Jane Brackbill
National Eye Institute (NIH NEI R01EY021271)
 EJ Chichilnisky
Howard Hughes Medical Institute
 Eero Simoncelli
Pew Charitable Trusts
 Alexander Sher
National Eye Institute (NIH NEI F31EY027166)
 Colleen Rhoades
National Science Foundation (NSF GRFP DGE114747)
 Colleen Rhoades
National Eye Institute (NIH NEI P30EY019005)
 E J Chichilnisky
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Ethics
Animal experimentation: Eyes were removed from terminally anesthetized macaque monkeys (Macaca mulatta, Macaca fascicularis) used by other laboratories in the course of their experiments, in accordance with the Institutional Animal Care and Use Committee guidelines. All of the animals were handled according to approved institutional animal care and use committee (IACUC) protocols (#28860) of the Stanford University. The protocol was approved by the Administrative Panel on Laboratory Animal Care of the Stanford University (Assurance Number: A321301).
Senior Editor
 Joshua I Gold, University of Pennsylvania, United States
Reviewing Editor
 Tatyana O Sharpee, Salk Institute for Biological Studies, United States
Publication history
 Received: February 3, 2019
 Accepted: October 29, 2019
 Version of Record published: March 9, 2020 (version 1)
 Version of Record updated: March 13, 2020 (version 2)
 Version of Record updated: April 3, 2020 (version 3)
Copyright
© 2020, Shah et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,175
 Page views

 182
 Downloads

 5
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.