Abstract
Animals detect motion using a variety of visual cues that reflect regularities in the natural world. Experiments in animals across phyla have shown that motion percepts incorporate both pairwise and triplet spatiotemporal correlations that could theoretically benefit motion computation. However, it remains unclear how visual systems assemble these cues to build accurate motion estimates. Here, we used systematic behavioral measurements of fruit fly motion perception to show how flies combine local pairwise and triplet correlations to reduce variability in motion estimates across natural scenes. By generating synthetic images with statistics controlled by maximum entropy distributions, we show that the triplet correlations are useful only when images have lightdark asymmetries that mimic natural ones. This suggests that asymmetric ONOFF processing is tuned to the particular statistics of natural scenes. Since all animals encounter the world’s lightdark asymmetries, many visual systems are likely to use asymmetric ONOFF processing to improve motion estimation.
Introduction
For any visual system, motion estimation is an important but computationally challenging task. To accurately extract motion signals from complex natural inputs, visual systems should take advantage of all useful information. One source of information lies in the stable statistics of the visual input, that is, in the regularities of natural scenes (Geisler, 2008). A strong version of this hypothesis is that visual systems are tuned, through evolution and experience, to the statistics of natural environments (Chichilnisky and Kalmar, 2002; Olshausen and Field, 1996; Simoncelli and Olshausen, 2001; Srinivasan et al., 1982). However, it remains unclear how visual systems use the statistics of natural scenes and the motion signals in them to aid in motion estimation (Salisbury and Palmer, 2016; Sinha et al., 2018).
Motion computation has long been understood algorithmically as selective responses to specific spatiotemporal correlations (Fitzgerald et al., 2011; Poggio and Reichardt, 1973; Potters and Bialek, 1994). For example, canonical models propose that animals extract motion from visual signals by detecting pairwise spatiotemporal correlations (Adelson and Bergen, 1985; Hassenstein and Reichardt, 1956). Higher order correlations could also contribute to motion computation, and Bayes optimal visual motion estimators can be written as a sum of terms specialized for detecting different correlation types (Potters and Bialek, 1994; Fitzgerald et al., 2011). This mathematical result follows from a Volterra series expansion, which provides a general and systematic way to represent nonlinear computational systems. Higher order correlations are also empirically relevant. For example, triplet spatiotemporal correlations are the next lowestorder terms after pairwise correlations, and both humans and flies perceive motion in ‘glider’ stimuli that isolate triplet spatiotemporal correlations (Clark et al., 2014; Hu and Victor, 2010). The sensitivity to triplet spatiotemporal correlations shows that motion perception incorporates cues neglected by canonical motion detectors.
Interestingly, perceptual sensitivities to triplet spatiotemporal correlations prove that visual systems must consider the polarity of contrast when computing motion. This is because triplet correlations flip sign when contrast polarities are inverted, which means that the perceptual contribution of triplet correlations to a motion estimator reverses when all input contrasts are inverted. This contrastpolarity dependent motion processing has been hypothesized to be an adaptation to natural scenes, especially to the lightdark asymmetry of the contrast distribution (Clark et al., 2014; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Leonhardt et al., 2016; Nitzany and Victor, 2014). For example, simulated motion detectors that were optimized to estimate motion in natural scenes exhibited contrastpolaritydependent responses similar to flies (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). These modeling studies suggest that contrastpolaritydependent responses emerge from performance optimization in natural scenes, but they do not show that real visual systems use their sensitivity to triplet spatiotemporal correlations to improve motion estimates. This limitation arises because previous experimental studies measured sensitivities to only a few triplet correlations (Clark et al., 2014; Leonhardt et al., 2016). However, one cannot assess the utility of individual correlations in isolation (Clark et al., 2014; Nitzany et al., 2016), and naturalistic visual signals contain many spatiotemporal correlations with diverse spatiotemporal structures. Moreover, although previous analyses recognized that some kind of lightdark asymmetry is required for triplet correlation sensitivity to emerge in optimized motion estimators (Clark et al., 2014; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Leonhardt et al., 2016), they did not discover which statistical regularities within natural scenes were sufficient for the observed motion signals to improve accuracy.
Here, we filled these gaps by systematically measuring the nonlinearities in Drosophila visual motion detection and relating them to lightdark asymmetries in natural scenes. We first systematically characterized loworder components of the fly’s motion computation algorithm by modeling its visually evoked turning behavior with a Volterra series expansion (Clark et al., 2011; Clark et al., 2014; Fitzgerald et al., 2011; Marmarelis and McCann, 1973; Poggio and Reichardt, 1973; SalazarGatzimas et al., 2016). Through this framework, we extended canonical pairwise (secondorder) motion computation models by adding a triplet (thirdorder) component that accounts for contrastpolaritydependent motion computation. We evaluated the performance of the inferred algorithm across an ensemble of moving natural images and discovered that the thirdorder component improves velocity estimates by canceling imageinduced variability in the secondorder component. Finally, we leveraged maximum entropy distributions to develop a method for generating synthetic images with precisely controlled contrast statistics. This method revealed that the skewness of natural images allows the fly’s sensitivity to triplet spatiotemporal correlations to improve its canonical motion estimates.
Results
The structure of natural scenes induces variability in secondorder motion estimates
To evaluate how canonical motion detectors performed with natural scene inputs, we simulated responses of the HassensteinReichardt Correlator (HRC) to rigidly translating natural scenes. The HRC exemplifies canonical motion detectors, which rely exclusively on pairwise spatiotemporal correlations to estimate motion (Adelson and Bergen, 1985; Hassenstein and Reichardt, 1956) (Figure 1A). It can be equivalently written as a motion energy model (Adelson and Bergen, 1985). We used a database of natural, panoramic photographs to create naturalistic motion stimuli (Meyer et al., 2014). In particular, we first converted the photographs’ luminance signals into local contrast signals (Figure 1B, Figure 1—figure supplement 1). We then rigidly translated these natural images at various horizontal velocities to simulate fullfield motion signals (Badwan et al., 2019; Dror et al., 2001; Fitzgerald and Clark, 2015; Leonhardt et al., 2016). This rigid translation of images mimics the motion produced by an animal’s pure rotation, during which visual objects all move at the same rotational velocity and occlusion does not change over time. Real motion through an environment generates more complex signals than this, but rigid translations are straightforward to compute and rotational visual stimuli are known to induce the rotational optomotor response that we focus on in this manuscript.
The spatiotemporal contrast signals from these (image, velocity) pairs were used as inputs to the HRC model, and we evaluated the model’s output for fixed image velocities across different scenes (Figure 1C, Materials and methods). The model generated a mean response that was linearly tuned for small velocities, peaked at around 130 °/s, and then decayed to zero for fast speeds (Figure 1C green line). However, we observed substantial variance about the mean response, and this variance implies that different natural scenes generated different secondorder motion estimates, even when moving at the same velocity (Figure 1C green shading). This is consistent with the finding that canonical secondorder motion detectors generate variable responses with natural scene inputs (Dror et al., 2001; Fitzgerald and Clark, 2015; Sinha et al., 2018).
Next we sought to investigate how the higher order structure of natural scenes influences the performance of the secondorder motion estimates. Though canonical motion detectors use only pairwise spatiotemporal correlations, higher order statistics of static images, such as contrast kurtosis, influence the detector’s variance (Clark et al., 2014; Fitzgerald and Clark, 2015). To demonstrate this, we generated a synthetic image set in which we preserved the secondorder statistics of natural scenes, including their spatial correlation function and contrast variance, but eliminated all higher order structure (Figure 1D, Materials and methods). When the higher order structure was eliminated, the HRC’s average tuning was unchanged, but there was a marked decrease in the variance (Figure 1C purple). This demonstrates that higher order structure in natural scenes induces variability in canonical motion estimates.
Modeling fly motion computation with second and thirdorder Volterra kernels
To investigate how real visual systems compute motion, we wanted to systematically characterize an animal’s motion computation at the algorithmic level (Marr and Poggio, 1976). Motion computation requires a nonlinear transformation to form a motion estimate from the visual stimulus (Borst and Egelhaaf, 1989; Fitzgerald et al., 2011; Poggio and Reichardt, 1973). We approximated this nonlinear transformation using a Volterra series expansion (Marmarelis and McCann, 1973; Marmarelis, 2004; Schetzen, 1980; Wiener, 1966). Similar to the Taylor series from calculus, the Volterra series is a polynomial description of a nonlinearity, with a firstorder kernel that describes linear transformations, a secondorder kernel that captures quadratic terms, and higherorder kernels that combine to represent a wide variety of nonlinearities beyond the secondorder. However, many polynomial terms can be needed to describe some nonlinearities. For instance, the polynomial description of a compressive, saturating nonlinearity is inefficient, and it can be easier to describe such transformations using alternative nonlinear model architectures, such as linearnonlinear cascade models (Dayan and Abbott, 2001). We emphasize that the Volterra kernel description is explicitly algorithmic, as it aims to summarize the overall system processing without considering the mechanisms leading to this processing.
Volterra kernels are useful for studying visual motion processing because they allow us to rigorously group response properties by their order (Fitzgerald et al., 2011; Potters and Bialek, 1994), thereby permitting us to clearly describe both canonical and contrast polaritydependent components of the behavior. For example, the secondorder kernel is equivalent to the canonical motion detecting algorithms, as it explains the sensitivity to pairwise spatiotemporal correlations (Fitzgerald et al., 2011; SalazarGatzimas et al., 2016). Secondorder Volterra kernels, along with related spiketriggered covariance methods (Bialek and van Steveninck, 2005; Sandler and Marmarelis, 2015; Schwartz et al., 2006), have been used to model secondorder behavior and neural processing in flies and primates (Clark et al., 2011; Marmarelis and McCann, 1973; Poggio and Reichardt, 1973; Rust et al., 2005; SalazarGatzimas et al., 2016). However, the secondorder kernel cannot capture the system’s sensitivity to triplet spatiotemporal correlations. We therefore minimally extended the depth of the Volterra series expansion to include the thirdorder kernel. The thirdorder kernel directly measures sensitivities to triplet spatiotemporal correlations and probes ON/OFF asymmetries in motion processing.
Experimental measurements of Volterra kernels in fly behavior
We focused on how the fly responds to correlations between nearestneighbor pixels in the visual input, which corresponded roughly to a single ommatidium separation (Buchner, 1976) (Figure 2A). The secondorder kernel describes how the behavioral response is influenced by the product of contrasts at each pair of spatiotemporal points in the visual input (Figure 2B blue). In comparison, the thirdorder kernel describes how the response is influenced by the product of contrasts at each triplet of spatiotemporal points in the visual input (Figure 2B green). Note that triplet spatiotemporal correlations could in principle be computed across three distinct spatial locations, but our analysis focused on triplet spatiotemporal correlations distributed across two nearestneighbor pixels.
In order to extract Volterra kernels, especially higherorder ones, we needed a large amount of data. We thus developed a highthroughput setup to measure turning in walking flies in response to visual stimuli (Figure 2C) (Creamer et al., 2018; Creamer et al., 2019). In this setup, a fly’s optomotor turning response serves as a readout of its motion perception (Götz and Wenking, 1973; Hassenstein and Reichardt, 1956), which allowed us to characterize the fly’s motion computation algorithm by measuring its visuallyevoked turning response. Flies spend a large portion of their lives standing and walking on surfaces, making walking optomotor responses ethologically critical (Carey et al., 2006).
We extracted the second and thirdorder Volterrakernels with reversecorrelation methods. To do this, we presented flies with spatiotemporally uncorrelated binary stimuli on a panoramic screen around the fly, measured their turning responses, and correlated the behavior at each time to the stimuli preceding it (Materials and methods, Appendix 1, Figure 2—figure supplement 1) (Clark et al., 2011; Mano and Clark, 2017; SalazarGatzimas et al., 2016). The measured secondorder kernel showed positive and negative lobes (Figure 2D). The positive lobe below the diagonal indicates that flies turned to the right when presented with positive correlations in the rightward direction. This secondorder kernel is consistent with classical models of motion computation and with previous neural and behavioral measurements (Clark et al., 2011; Marmarelis and McCann, 1973; SalazarGatzimas et al., 2016). The measured thirdorder kernel also showed both positive and negative values (Figure 2E), and we will dissect its detailed structure later in this manuscript. However, we first set out to evaluate how the thirdorder kernel contributed to motion estimation across an ensemble of moving natural images.
The thirdorder kernel improves velocity estimation for moving natural scenes
The kernels were fit to turning behavior, so the output of the model to moving visual stimuli is the predicted optomotor turning response. Following previous work (Clark et al., 2014; Fitzgerald and Clark, 2015; Leonhardt et al., 2016; Poggio and Reichardt, 1973; Potters and Bialek, 1994; Sinha et al., 2018), we hypothesized that optomotor turning responses provide a proxy for the fly’s velocity estimate. Using the fitted behavioral model, we could thus investigate how accurately the fly’s velocity estimate tracks the true image velocity. We evaluated the fly’s motion computation performance with a simple and specific metric: when an entire natural image translates rigidly with constant velocity, how accurately does the behavioral algorithm predict the image velocity (Figure 3A)? Specifically, does the fly use its sensitivity to triplet spatiotemporal correlations to improve velocity estimation?
We sampled the velocities from a zeromean Gaussian distribution with a standard deviation of 114 °/s: this distribution roughly matched turning distributions in walking flies (DeAngelis et al., 2019; Katsov and Clandinin, 2008). Crucially, because we measured the Volterra kernels, we could separate the fly’s predicted output into two components: the canonical secondorder response, ${r}^{\left(2\right)}$, and the noncanonical thirdorder response, ${r}^{\left(3\right)}$ (Figure 3A). The secondorder response is the output from the secondorder kernel, and it describes how the fly responded to naturalistic secondorder spatiotemporal correlations in the stimulus. Similarly, ${r}^{\left(3\right)}$ is the output from the thirdorder kernel, and it describes how the fly responded to naturalistic triplet spatiotemporal correlations. This separation allowed us to ask how the pairwise and triplet correlations are individually and jointly used to estimate motion.
We quantified how well the model’s responses predicted the image velocity using the Pearson correlation coefficient (Clark et al., 2014; Fitzgerald and Clark, 2015; Leonhardt et al., 2016). This metric supposes that the model response and image velocity are linearly related, and its value summarizes intuitively the meansquarederror of the best linear fit between the model’s output and the image velocity. When the correlation coefficient has an absolute value near 1, the model closely tracks image velocity, while a value near 0 indicates no linear relationship between model and image velocity. The responses derived from the secondorder kernel, ${r}^{\left(2\right)}$, correlated positively with the true velocity (Figure 3B blue), indicating that the secondorder response matches the behavioral direction (Clark et al., 2011; Hassenstein and Reichardt, 1956; SalazarGatzimas et al., 2016). Interestingly, the isolated thirdorder response, ${r}^{\left(3\right)}$, anticorrelated with true image velocities (Figure 3B green). This means that the fly’s thirdorder response on its own would predict that the fly turns in the direction opposite to the presented motion. However, when ${r}^{\left(3\right)}$ was added to ${r}^{\left(2\right)}$, the accuracy of the full motion estimator increased by ~25% compared to ${r}^{\left(2\right)}$ alone (Figure 3B red, Figure 3C). This important result shows that the thirdorder responses improve velocity estimates only in conjunction with secondorder responses.
To understand this counterintuitive finding, it’s useful to recognize that the secondorder response is influenced by both the image velocity and the structure of the natural scene. For example, recall that the output of the HRC depended both on the velocity of motion and on the particular image that was moving (Figure 1C). Thus, one way to improve the accuracy of the response is to reduce scenedependent variability in the secondorder estimate. To investigate whether this interpretation explained the observed improvement, we calculated the residuals of the secondorder responses by subtracting the best linear fit of the image velocity and plotted them against the thirdorder responses. We found that the thirdorder signal was strongly anticorrelated with this sceneinduced residual in the secondorder response (Figure 3D). This means that the fly’s sensitivity to triplet spatiotemporal correlations indeed canceled scenedependent variability in the secondorder motion estimator to improve the accuracy of motion estimation across natural scenes.
Since the magnitude of the secondorder kernel and thirdorder kernel were each measured experimentally, our model combined ${r}^{\left(2\right)}$ and ${r}^{\left(3\right)}$ with a 1:1 ratio. Nevertheless, we were interested in whether the fly could have done better with alternate weighting coefficients, so we fit a linear regression model to reweight ${r}^{\left(2\right)}$ and ${r}^{\left(3\right)}$ to best predict image velocity. Strikingly, we found the optimized relative weighting between ${r}^{\left(2\right)}$ and ${r}^{\left(3\right)}$ was near one, and the performance of the best weighted model was only marginally better than the empirical model (Figure 3C gray). Thus, the measured second and thirdorder kernels were weighted near optimally for performance in naturalistic motion estimation.
We also wanted to understand how the improvement added by ${r}^{\left(3\right)}$ depended on the parameters of our simulation. To see how it depended on the width of the image velocity distribution, we varied the standard deviation over an order of magnitude. The improvement did not depend strongly on the variance of the velocity (Figure 3—figure supplement 1). We also asked how the contrast computation affected the performance of the measured algorithm. When we previously converted luminance into contrast signals, we computed local contrasts on a length scale of 25° (measured by fullwidthathalfmaximum), because that is the approximate spatial scale of surround inhibition measured in flies (Arenz et al., 2017; Freifeld et al., 2013; Srinivasan et al., 1982). When we swept this spatial scale from 10° to 75°, the improvement added by the thirdorder kernel first increased, peaked at around 30°, and then decreased to negative values after 40° (Figure 3—figure supplement 2A–E). When we computed the contrast over time, instead of space, we observed improvements on timescales less than 100 ms, comparable to measured timescales involved in early visual neurons that compute temporal derivatives (Behnia et al., 2014; Srinivasan et al., 1982; Yang et al., 2016) (Figure 3—figure supplement 2FG). However, the thirdorder term hurt performance when contrasts were computed on longer timescales. These results show that contrast computations influence the utility of the measured thirdorder kernel, with maximal utility occurring in a regime that approximately matches the contrast computation of the fly eye.
Visualizing the measured thirdorder kernel with impulse responses
Since the measured thirdorder Volterra kernel improved motion estimates, we wanted to characterize it in more detail. To better visualize the thirdorder kernel, we rearranged its elements in an impulse response format (Figure 4AB, Materials and methods). The impulse response of a system is its output when presented with a small and brief input, called an impulse. This impulse may consist of a change in contrast at a single point, in which case the impulse response captures the linear response of the system. Analogously, if the impulse consists of a contrast triplet over three points in space and time, then the triplet impulse response captures the system’s response to the interactions of those three points, after accounting for those responses already explained by linear or secondorder impulse responses.
Triplet correlation impulse responses are useful because they allow one to rapidly digest how different triplet correlations will affect behavior. For example, in Figure 4A, we colored three occurrences of triplets that have the same spatiotemporal structure and their corresponding triplet impulse responses. We set the origin of the impulse responses to be the most recent point in the triplet, because the system could not respond to the interaction of three points before all three points were presented. Since the spatiotemporal structures of these three triplets are the same, the three impulse responses have the same shape. Note that a negative impulse, consisting of an odd number of dark elements within the triplet, would drive turning in the opposite direction. In Figure 4B, we represented impulse responses of different triplet with colormaps and used ballstick cartoons to show the relative temporal distances between the points in each triplet. The predicted time course of the behavioral effect is easy to discern, and the kernel predicts that the behavioral consequences of triplet correlations will last almost a second. We can more compactly understand the relative magnitudes of the behavioral effects by summing the impulse responses over time (Figure 4C) (SalazarGatzimas et al., 2016). As expected, the impact of different triplet correlations varies significantly in both direction and magnitude.
Verification of the thirdorder kernel measurement
We verified the reliability of our thirdorder kernel measurement in two ways. First, we tested the statistical significance of the measured kernel directly. We extracted an ensemble of null kernels by applying the reversecorrelation analysis to the measured behavioral responses and temporallyshifted visual stimuli (Materials and methods). By comparing summed kernel elements in the empirical and null kernels, we found that many terms in the thirdorder kernel were statistically significant at the p=0.05 level (Figure 4C). Significance was especially common when the temporal distance between the points in the triplet spatiotemporal correlation was less than 0.1 s.
Second, we measured the fly’s sensitivity to triplet spatiotemporal correlations with thirdorder glider stimuli (Figure 4D, Figure 4—figure supplement 1). Thirdorder glider stimuli are binary stimuli that lack pairwise correlations and are enriched in specific triplet spatiotemporal correlations (Clark et al., 2014; Hu and Victor, 2010). We used the measured thirdorder kernel to predict responses to the glider stimuli. Most of the measured responses were quantitatively predicted by the thirdorder kernel (Figure 4D). Several gliders elicited smaller behavioral responses compared to the kernel prediction; such differences might be attributable to induced longrange spatial correlations in glider stimuli (Clark et al., 2014; Hu and Victor, 2010), which are not captured by our measured nearestneighbor kernel. Nevertheless, the successes revealed by this independent experimental test strongly suggest that we had enough statistical power to reliably fit the thirdorder kernel to the behavioral data.
The second and thirdorder kernels share temporal structure
Multiple models propose that sensitivity to pairwise and triplet spatiotemporal correlations could emerge simultaneously from the same nonlinear step in the fly brain (Fitzgerald and Clark, 2015; Leong et al., 2016; Leonhardt et al., 2016). We were thus curious whether the measured second and thirdorder kernels had a common temporal structure. To compare the secondorder and thirdorder kernels, we simplified the thirdorder kernel to a twodimensional approximation (Figure 4—figure supplement 2ABC), rearranged the secondorder kernel into the impulse response format (Figure 4—figure supplement 2D), and computed summed kernel strengths to obtain onedimensional representations for both kernels (Figure 4E). We compared the second and thirdorder kernel elements at the same temporal offsets (Figure 4E top). In the case of pairwise correlations, the temporal offset was determined by the temporal distance between the left and the right points, and in the case of triplet correlations, the temporal offset was determined by the average temporal distance between the left and right points. The summed kernel strengths showed that the secondorder and thirdorder kernels had similar sensitivities to temporal delays between the input pixels, with peak sensitivity at the shortest delays in our experiment (Figure 4E bottom). An analysis employing the singular value decomposition yielded similar results, and also showed comparable kinetics in the behavioral responses to pairwise and triplet correlations (Figure 4—figure supplement 2EFG). These similarities suggest that the second and thirdorder responses originate in common physiological processes.
Comparing the measured thirdorder kernel to optimal motion estimators
A recent theoretical study proposed several motion detectors whose parameters were optimized for velocity estimation in natural scenes (Fitzgerald and Clark, 2015) (Figure 4—figure supplement 3). In order to compare our measured thirdorder kernel to those of these optimized motion detectors, we presented stochastic binary stimuli to these detectors and extracted their thirdorder kernels using reversecorrelation. We found that the thirdorder kernels of the optimized models were usually similar to each other (Figure 4F), which is consistent with prior analyses (Fitzgerald and Clark, 2015). The measured thirdorder kernel consistently agreed with the optimized kernels in its signs, and in some cases, the kernels were also similar in magnitude. However, certain kernel elements differed markedly between the optimized models and the behaviorally measured kernel. Perhaps most noticeably, the behavioral kernel was much smaller than the optimized kernels for correlations whose spatiotemporal structure involved large delays between the points (Figure 4F, third and fifth kernel elements). Such differences between the optimized models and the measured behavior could indicate suboptimalities in the fly brain. However, they could also result from unrealistic constraints imposed on the model optimization, such as fixed temporal processing and restricted model structures (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). The measured kernel therefore provides valuable new data to inform theoretical work assessing the optimality of biological motion estimators.
Positive skewness is sufficient for the thirdorder kernel to improve motion estimates
Which features of natural images allow the measured thirdorder kernel to improve motion estimates? The natural scene dataset is comprised of heterogeneous individual images (Figure 5A), so we calculated the contrast mean, variance, skewness, and kurtosis of each image individually. The variance describes the scale of the contrast variation; the skewness quantifies imbalance between contrasts above and below the mean; and the kurtosis roughly characterizes the frequency of extreme bright and dark points. Each of these statistics showed a wide distribution over the image ensemble (Figure 5—figure supplement 1ABCD). These statistics were also highly dependent on each other: a positively skewed image often had high variance and was highly kurtotic (Figure 5B, Figure 5—figure supplement 1E). These strong relationships make it difficult to isolate the effects of individual statistics within the image ensemble.
To isolate the effects of individual statistics, we therefore generated several different synthetic image datasets that have alternate contrast statistics (Materials and methods). To generate each image, we constructed a synthetic contrast distribution and sampled pixel contrasts from this distribution (Appendix 2, Figure 5—figure supplement 2A). In this way, we could manipulate the statistics of the image by constraining various statistics of the distribution. In particular, we constrained the distribution to have specific lower order moments, such as mean, variance, and skewness. A distribution is not solely determined by its lower order moments, so there can be many distributions sharing the same lower order moments (Figure 5—figure supplement 2A). Among all such distributions, we chose the most random one, known as the maximum entropy distribution (MED) (Berger et al., 1996; Jaynes, 1957; Schneidman et al., 2006; Victor and Conte, 2012). Because we can specify these lower order moments independently, we can ask whether specific statistics are sufficient to generate the improvement added by the thirdorder signal.
We began by generating two synthetic image datasets (Materials and methods). In the first dataset, we generated a synthetic image for each natural image that had the same contrast mean and variance. To do this, we first found an MED whose contrast mean and variance matched those of the natural image (Figure 5C, green). We then generated a single synthetic image by sampling from this MED (Figure 5D green). In the second dataset, we required the synthetic image to have the same contrast mean, variance, and skewness as the original image (Figure 5CD, brown). By retaining the skewness, these synthetic images retained naturalistic lightdark asymmetries. We then asked how the thirdorder response affected velocity estimates across these two synthetic image datasets. When only the mean and variance of natural scenes were retained, the thirdorder response was near zero (Figure 5—figure supplement 2A, green) and did not improve motion estimation (Figure 5E). However, when the synthetic scenes were constrained to be naturalistically lightdark asymmetric, the improvement added by the thirdorder kernel was recovered (Figure 5E), with magnitude comparable to what was observed for the natural scene dataset (Figure 3C).
Finally, we wanted to see whether the degree of skewness controlled the magnitude of the improvement. We therefore generated synthetic image datasets in which we systematically varied the image skewness (Figure 5FG). In these synthetic images, the degree of skewness determined how much the thirdorder response could improve the full motion estimate (Figure 5H). When the images were negatively skewed, the thirdorder response correlated with image velocity (Figure 5—figure supplement 3C). However, since it also positively correlated with the residual in the secondorder response, adding it to the secondorder response decreased the model’s overall performance (Figure 5—figure supplement 3CD). When the images were positively skewed, the thirdorder response became anticorrelated with the residual in the secondorder response and thus improved the overall motion estimates (Figure 5H, Figure 5—figure supplement 3CD). These synthetic image sets show that positive image skewness is sufficient for the thirdorder signal to improve motion estimates. Thus, the measured algorithm for motion estimation leverages lightdark asymmetries found in natural scenes to improve motion estimates.
Discussion
In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second and thirdorder terms in the Volterra series contributed to the turning behavior. We then evaluated the model’s output when it was presented with an ensemble of rigidly translating natural scenes. There, the second and thirdorder terms of the model combined to produce outputs that better correlated with image velocities. There is no a priori reason to assume that a model fit to explain turning behavior would necessarily predict the image velocity. Therefore, these results can be taken together to motivate the hypothesis that the magnitude of the fly’s turning response is determined by an internal estimate of velocity. Furthermore, this estimate is specifically tailored for natural environments, since we found that the thirdorder kernel relies on lightdark asymmetries that are present in natural scenes but not in arbitrary images. Since skewed scenes are prevalent across natural environments, and many visual systems exhibit ONOFF asymmetric visual processing (Chichilnisky and Kalmar, 2002; Jin et al., 2011; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010; Zaghloul et al., 2003), many animals are likely to use similar strategies for motion perception.
Direct demonstration that triplet correlations improve velocity estimation for natural images
The idea that features of the visual motion computation serve to improve performance in natural environments is conceptually appealing and theoretically powerful. For example, prior studies have found that optimized motion detectors had triplet correlation sensitivities similar to those measured from the fly visual system (Fitzgerald and Clark, 2015; Leonhardt et al., 2016). Although it is intriguing that biologically relevant response properties emerged in optimized motion detectors, the link from contrastpolarity dependent motion computation to naturalistic motion estimation remained indirect. Here we have provided the first direct demonstration that thirdorder components of the fly’s motion computation algorithm improve velocity estimation for moving natural scenes. This direct demonstration had not been possible before because prior measurements were limited to a narrow range of correlations, and it was unclear how the measured cues interacted with unmeasured components of the motion estimation algorithm. For example, here we found that the thirdorder kernel appeared counterproductive when viewed in isolation, but the way flies incorporated triplet correlations was easily interpretable via the deficits of the secondorder motion estimator. Remarkably, this problem could have persisted if motion computation needed to be understood in the context of additional visual motion cues that involve longer spatialscales or higher order nonlinearities, which have been neglected in this study. This suggests that it might be sufficient to comprehensively characterize motion computation with a few local and loworder visual cues, which is encouraging for the approach outlined here. On the other hand, a mechanistically accurate model of visual motion processing could eventually summarize the relevant cues in a more succinct and less abstract way.
Flies use triplet correlations to cancel scenedependent variability in secondorder cues
We found that the thirdorder responses in flies were anticorrelated with the natural image velocities (Figure 3BC). Nevertheless, they improved velocity estimation when added to the secondorder responses (Figure 3D). This result appears counterintuitive at first but can be understood. Spatiotemporal correlations are influenced by both the motion and local structure of the scene, but motiondriven behaviors should ignore fluctuations stemming from the scene’s structure as much as possible. Pairwise and triplet spatiotemporal correlations are related to contrast variance and skewness, respectively, which are correlated across the ensemble of natural scenes (Figure 5B). This means that fluctuations in secondorder signals tend to be accompanied by fluctuations in thirdorder signals. Therefore, with the right weighting, secondorder and thirdorder signals may collaborate to reduce the imageinduced signal fluctuations in the motion estimate (Clark et al., 2014). Indeed, we found that the thirdorder responses improved motion estimates because they helped to cancel variability in the secondorder responses induced by the structure of natural scenes. This finding highlights a generally important but underappreciated point about cue combination and population coding in neural systems. Although a neuron’s tuning is often used as a proxy for its involvement in stimulus processing, even untuned neurons can contribute productively to downstream decoding if their responses are correlated with noise in the tuned neuronal population (Zylberberg, 2017).
Large computational benefits can underlie small behavioral effects
The HRC has explained a large number of behavioral phenomena and neural responses, and it is reasonable to ask how much we have gained by extending its secondorder algorithmic description to a thirdorder one. The magnitude of behavior elicited by the thirdorder kernel is small compared to the secondorder kernel’s contribution. However, the magnitude of the behavioral effect and the magnitude of the underlying performance gain can differ significantly. For example, the largest reported turning response elicited by a thirdorder glider stimulus is less than 20% of that elicited by secondorder glider stimuli (Clark et al., 2014), yet the performance gain afforded by a motion detector designed to detect its defining thirdorder correlation exceeded 30% (Fitzgerald and Clark, 2015). Similarly, here we predict that many natural images would elicit little output from the thirdorder kernel (Figure 3B), yet thirdorder responses improved the correlation between the model output and velocity by ~25% (Figure 3C). These performance gains are modest, but they are comparable to the inarguable benefits provided by spatial averaging and could be ecologically relevant to the fly (Dror et al., 2001; SalazarGatzimas et al., 2018). Since we have only approximated the system with a spatially localized low order polynomial, we also expect that the improvements we observed here represent a lower bound on the total effects provided by the full mechanism underlying lightdark asymmetric motion processing. Indeed, longer range and higher order motion detection models can nearly double motion estimation accuracy while predicting realistic glider response magnitudes (Fitzgerald and Clark, 2015). It will be interesting to investigate whether mechanistically accurate models that explain the origin of the thirdorder kernel also reveal larger performance improvements.
Asymmetric ONOFF processing could affect motion processing across the animal kingdom
Here, we showed that flies systematically exploit contrast asymmetries in natural scenes to improve their visual motion estimates. This resonates with previous work showing that many visual systems process ON and OFF signals asymmetrically to improve other aspects of visual processing (Kremkow et al., 2014; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010). Moreover, flies and vertebrates share striking anatomical and functional properties in their motion detection circuits (Borst and Helmstaedter, 2015; Clark and Demb, 2016; Sanes and Zipursky, 2010), likely because they are solving similar problems with similar constraints. We thus expect that many visual systems use ONOFF asymmetric processing to improve visual motion perception. Nevertheless, it remains unclear how similar or different the details of such strategies will be across the animal kingdom. Indeed, although both primates and insects respond to thirdorder glider stimuli, their patterns of response differ (Clark et al., 2014; Hu and Victor, 2010; Nitzany et al., 2017). ONOFF asymmetric visual processing also varies in other ways, and there is evidence that contrast adaptation in ON and OFF pathways is different between primate and salamander retinas (Chander and Chichilnisky, 2001).
Adaptations to differences in both habitat and early sensory processing could potentially explain these divergences. Here, we found positive and negative contrast skewness in different terrestrial scenes (Figure 5—figure supplement 1C), but positive skewness was most prevalent across the scenes. If certain habitats feature scenes that are predominantly negatively skewed, our work predicts that animals living in these habitats should have opposite thirdorder responses to flies (Figure 5H). More generally, natural scenes in different habitats are known to be similar in some statistics and different in others (Balboa and Grzywacz, 2003; Burkhardt et al., 2006). Interestingly, positive skewness might be particularly common in natural luminance distributions, because luminance signals are the product of many independent factors that generically combine to produce lognormal distributions (Richards, 1982). Nevertheless, the skewness level that matters for motion detecting circuits also depends on earlier processing operations in the eye (Figure 3—figure supplement 2C). In our numerical experiments, we computed local contrast signals. The spatial scales of this preprocessing influenced the statistics of the resulting contrast, which in turn influenced the performance of the motion computation. Biologically, this suggests that signal processing in early visual circuits can strongly influence how downstream circuits organize their computations (Dror et al., 2001; Fitzgerald and Clark, 2015). Alternatively, early sensory processing might be tailored to accommodate the computational requirements of downstream processing. These possibilities are not mutually exclusive, and in both cases, the early visual processing must work in concert with the downstream motion detectors to form robust and consistent perceptions.
Volterra kernels systematically characterize nonlinear motion computations
The algorithm used by the visual system to extract motion signals is a nonlinear transformation from light detection to motion estimates. There are numerous ways to characterize a nonlinear system. In many cases, visual neuroscientists have purposefully designed stimuli, such as sinusoidal gratings, plaids, and gliders, to probe specific nonlinearities in the system (Clark et al., 2014; Creamer et al., 2018; Euler et al., 2002; Fisher et al., 2015; Haag et al., 2016; Hu and Victor, 2010; Movshon et al., 1985; Rust et al., 2005; SalazarGatzimas et al., 2018; SalazarGatzimas et al., 2016). In other cases, they have used stochastic stimuli to fit simple predictive models, such as linearnonlinear models, generalized linear models, cascade models, and normalization models to capture a restricted but relatively broad set of biologically plausible nonlinearities (Dayan and Abbott, 2001; Leong et al., 2016; Maheswaranathan et al., 2018; McIntosh et al., 2016; SalazarGatzimas et al., 2016; Simoncelli and Heeger, 1998).
Here, we approximated the nonlinear system with Volterra kernels (Marmarelis and Naka, 1972; Wiener, 1966). This represents a general and systematic approach to nonlinear system identification, since (1) one need not make strong assumptions about the system to measure its kernels, (2) higher order kernels can in principle be added to characterize the system arbitrarily well, and (3) a complete set of kernels predicts the system’s output for arbitrary input signals. One major limitation of this approach is that higher order kernels become progressively more difficult to fit as the number of kernel elements increases. This makes the approach most practical when a few loworder terms already capture conceptually important variables. Here, we leveraged the fact that secondorder kernel capture the canonical models for visual motion estimation while thirdorder kernel probes ON/OFF asymmetries in motion processing. These two kernels can be related to distinct statistics of natural scenes.
Polynomial approximations to complex nonlinear systems have also been useful in other domains of neuroscience. For example, the experimental phenomenon of frequencydependent longterm potentiation can be explained by extending canonical pairwise spiketimingdependent plasticity models to include the relative timing of three spikes (Pfister and Gerstner, 2006; Sjöström et al., 2001). This makes learning sensitive to thirdorder correlations (Gjorgjieva et al., 2011). In the field of texture perception, researchers have long sought low order statistics that explain whether two patterns are texturally discriminable (Julesz, 1962; Julesz et al., 1973; Julesz et al., 1978). Similar to our findings for motion perception, both natural scene statistics and upstream visual processing play important roles (Hermundstad et al., 2014; Portilla and Simoncelli, 2000; Tkacik et al., 2010). As a final example, understanding how neural network structure impacts dynamics was aided by formally expanding the network’s connectivity matrix into loworder connectivity motifs (Hu et al., 2014; Trousdale et al., 2012). These motifs might relate to measurable properties of the neocortex (Song et al., 2005).
Velocity estimation is a useful approximation to motion computation
In this paper, we evaluated the fly’s motion computation algorithm by measuring the accuracy of velocity estimation. Prior studies have often hypothesized that velocity estimation is a key requirement of motion processing and optomotor circuitry (Clark et al., 2014; Dror et al., 2001; Fitzgerald and Clark, 2015; Fitzgerald et al., 2011; Poggio and Reichardt, 1973; Potters and Bialek, 1994). It is thus reassuring that a model that better fit optomotor behavior also predicted image velocity more accurately. Nevertheless, motion computation is involved in perceptual tasks beyond the optomotor response, such as detecting looming stimuli (Card and Dickinson, 2008; Zacarias et al., 2018). In such tasks, the goal may not be to estimate the velocity of the visual object, but spatiotemporal correlations might nevertheless be useful (Nitzany and Victor, 2014). In addition, fly motion detecting neurons respond to static sinusoids or local luminance changes without obvious relevance for motion processing (Fisher et al., 2015; Gruntman et al., 2018; SalazarGatzimas et al., 2018), which suggests that motion computation algorithms might be jointly optimized alongside the detection of other visual features. Finally, visual systems have to coordinate with motor systems to achieve accurate sensorimotor transformations, so one should take the properties of the motor system into consideration when evaluating the performance of a motion computation algorithm (Dickinson et al., 2000). Future work could consider more sophisticated evaluation metrics that better reflect the total ethological relevance of the visual environment to the fly. As we continue to distill the factors that combine to set the ultimate performance criteria, the use of velocity estimation is likely to remain a simple, useful, and insightful approximation.
Potential mechanisms underlying the measured lightdark asymmetries
Visual systems in both vertebrates and invertebrates split into ON and OFF pathways that process light and dark signals separately and asymmetrically (Balasubramanian and Sterling, 2009; Chichilnisky and Kalmar, 2002; Clark and Demb, 2016; Leonhardt et al., 2016; Ratliff et al., 2010; Ravi et al., 2018; Sagdullaev and McCall, 2005; SalazarGatzimas et al., 2018). Some of these differences could result from biological constraints. Others could be an ethologically relevant adaptation to lightdark asymmetries found in the natural world. Either way, it is difficult to extrapolate from asymmetric neuronal processing of light and dark signals to functional asymmetries in downstream processing, including behavior. In this study, we used the behavioral turning responses to measure asymmetries in the flies’ motion computation algorithm, instead of examining ON and OFF processing channels at the neuronal level. Since optomotor sensitivity to triplet spatiotemporal correlations is necessarily a functional consequence of underlying asymmetric visual signal processing, we could thus directly link lightdark asymmetries in natural scenes to the functional impact of ONOFF asymmetric neural circuitry. It is similarly important to identify additional lightdark asymmetric behaviors that can clarify the functional role of other lightdark asymmetries in visual processing.
Having established the functional relevance of ONOFF asymmetric visual processing, it is next important to find its neural implementation. Previous work has suggested that frontend nonlinearities could account for certain optomotor illusions in flies (Bülthoff and Götz, 1979), and it is conceivable that such nonlinearities could generate contrast asymmetric motion responses (Clark and Demb, 2016; Fitzgerald et al., 2011). However, several simple frontend nonlinearities can improve motion estimation without inducing the observed triplet correlation responses (Fitzgerald and Clark, 2015). Alternatively, nonlinear processing at the level of directionselective T4 and T5 neurons could also generate the asymmetries we observed here. Indeed, differentially affecting T4 and T5 activity, either through direct silencing or by manipulating upstream neurons, alters the behavioral responses of flies to triplet correlations (Clark et al., 2014; Leonhardt et al., 2016), and parallel experiments in humans similarly find that contrastasymmetric responses are mediated by neurons separately modulated by moving ON and OFF edges (Clark et al., 2014). Yet, it remains unclear whether asymmetric responses of T4 and T5 are inherited from upstream neurons. For instance, contrast adaptation could differ between the two pathways (Chichilnisky and Kalmar, 2002), and incompletely rectified inputs to T4 and T5 could generate asymmetrical responses to light and dark inputs (SalazarGatzimas et al., 2018). The weightings of T4 and T5 signals in downstream circuits could also result in contrast asymmetric phenomena. This rich landscape of possibilities motivates us to think that multiple mechanisms are likely to be involved. By measuring behavior and distilling the abstract algorithmic properties of the system, we will be able to constrain the contributions of individual circuit components without confining ourselves to an overly narrow class of mechanistic models.
Relating algorithm and implementation in fly visual motion estimation
David Marr famously asserted that neural computation needs to be understood at both the algorithmic and implementational levels (Marr and Poggio, 1976). The benefits of this dual understanding go both ways. On one hand, the brain is immensely complicated, and an algorithmic theory can provide an invaluable lens for making sense of its details. On the other hand, the nuances of neuronal implementation can lead to new algorithmic questions and mechanistically satisfying answers. Marr used the optomotor response of flies to articulate his philosophy over forty years ago, and the community is still leveraging this problem to unravel the subtle relationships between algorithm and mechanism in the brain. The HRC model provided the first algorithmic theory of fly visual motion estimation, and this model’s insights into the roles of spatial separation, differential time delays, and nonlinear signal integration have now been verified mechanistically (Arenz et al., 2017; Fisher et al., 2015; Haag et al., 2016; Leong et al., 2016; SalazarGatzimas et al., 2016; Takemura et al., 2017). They still provide the bedrock of our understanding. Yet the precise mathematical form and mechanistic origin of the nonlinearity remain controversial, with different papers pointing out compelling roles for membrane voltages, intracellular calcium signals, and ONOFF pathways (Badwan et al., 2019; Gruntman et al., 2018; Haag et al., 2016; Leong et al., 2016; Leonhardt et al., 2016; SalazarGatzimas et al., 2018; Wienecke et al., 2018). None of this complexity invalidates the core insights of the HRC, nor does the HRC’s domain of success warrant apathy toward the fundamental importance of these unexpected findings. Instead of algorithm and mechanism providing parallel or hierarchical goals, they should be treated as parts of one integrated understanding of the circuit.
Materials and methods
Fly husbandry
Request a detailed protocolFlies were grown at 20°C, 50% humidity in 12hr day/night cycles on a dextrosebased food. Flies used for the behavioral experiment were nonvirgin wildtype (D. melanogaster: WT: +; +; +) females between 24 and 72 hr old.
Psychophysics
Request a detailed protocolThe fly’s turning behavior was measured with the flyonaball rig, as described in previous studies (Clark et al., 2011; Creamer et al., 2018). The fly was tethered above a ball floating on a cushion of air. The ball served as a treadmill such that the fly could walk and turn while its position and orientation were fixed. The rotational response of the fly was the averaged rotation magnitude of the ball in 1/60s bins with an angular resolution of ~0.5°. Panoramic screens surrounded the fly, covering 270° horizontally and 106° vertically (Creamer et al., 2019). A Lightcrafter DLP (Texas Instruments, USA) projected visual stimulus to the screens with chrome green light (peak 520 nm and mean intensity of 100 cd/m^{2}). The spatial resolution of the projector was around 0.3° and the projector image was updated at 180 Hz. The rig’s temperature was 3436°.
Visual stimuli
Request a detailed protocolVisual stimuli varied along the horizontal axis in 5° pixels and were uniform along the vertical dimension. Since the panoramic screen was 270° wide, the horizontal axis was divided into $270/5=54$ pixels, so the screen was divided into 54 vertical bars.
We used two types of binary stochastic visual stimuli for kernel extraction, a threebarblock stimulus type, and a fourbarblock stimulus type. In the 3 (4)barblock stimuli, each block contained 3 (4) neighboring vertical bars that flickered white or black independently in space and time. The identical blocks then repeated periodically around the fly. Since there are 54 bars, the entire visual field was divided into 18 (14.5) blocks. Each bar updated its contrast every 1/60 second.
We used thirdorder glider stimuli (Hu and Victor, 2010) to directly measure the fly’s sensitivity to threepoint correlations. Thirdorder glider stimuli are binary patterns of black and white pixels. In each glider stimulus, one can enforce a threepoint spatiotemporal correlation. Here, we considered only threepoint spatiotemporal correlations that involved two neighboring points in space. We described the specific configuration of each glider with a fourparameter scheme, $({\mathrm{\Delta}\tau}_{31},{\mathrm{\Delta}\tau}_{21},L\backslash R,P)$. We defined the 1st point to be the more recent one of the two points sharing a spatial location, while the other point at this spatial location was defined to be the 3rd point. The final point, which was in a position adjacent to the 1st and 3rd points, was defined to be the 2nd point. The temporal interval between the 2nd point and 1st point was denoted as ${\mathrm{\Delta}\tau}_{21}$, and ${\mathrm{\Delta}\tau}_{31}$ was defined similarly. For example, ${\mathrm{\Delta}\tau}_{31}=1$ means that the 3rd point is 1 frame (16 ms) before the 1st point. Although ${\mathrm{\Delta}\tau}_{31}$ is positive by definition, ${\mathrm{\Delta}\tau}_{21}$ can be positive or negative. We used L and R to indicate whether the 1st and the 3rd points are on the left or right of the 2nd point. As detailed previously (Clark et al., 2014; Hu and Victor, 2010), in positive parity gliders ($P=+1$) one or three of these three points are white, whereas negative parity gliders ($P=1$) have one or three of the points black. We illustrated the configuration of each glider using a 'ballstick' diagram, where the xaxis represents space, the yaxis represents time, time runs downward, and the plus (minus) sign denotes the polarity of the glider (Figure 4—figure supplement 1A). Overall, we presented $52=13\times 2\times 2$ different stimuli: thirteen different temporal intervals, each with two directions and two polarities (Table 1).
HRC model
Request a detailed protocolWe constructed a classical HassensteinReichardt correlator (HRC) model (Hassenstein and Reichardt, 1956). The output ${r}_{\text{HRC}}\left(t\right)$ was defined as
where ${s}_{1}\left(t\right)$ and ${s}_{2}\left(t\right)$ denote the contrast signals from two spatial locations, * denotes convolution in time, and ${f}_{\text{HRC}}\left(t\right)$, ${g}_{\text{HRC}}\left(t\right)$ are the temporal filters of the delay line and the nondelay line. In particular,
for $t\ge 0$, ${f}_{\text{HRC}}\left(t\right)=0$ for $t<0$, and
where ${\tau}_{\text{HRC}}=20$ ms.
Modeling the fly’s motion computation algorithm with Volterra kernels
Request a detailed protocolWe approximated the fly’s motion computation algorithm with second and thirdorder Volterra kernels. We provide a detailed description of this model in Section 1 of Appendix 1. In brief, we discretize space into pixels with $\mathrm{\Delta}x=5$° resolution, discretize time into time bins with $\mathrm{\Delta}t=1/60$ s resolution, and index locations in space with an integer subscript $\xi $. We modeled the response of the fly $r\left(t\right)$ as the sum of an array of elementary motion detectors (EMDs) acting at each position in space:
where ${r}_{\xi}\left(t\right)$ denotes the response of EMD at spatial location $\xi $. That term is itself the sum of the secondorder response ${r}_{\xi}^{\left(2\right)}\left(t\right)$ and thirdorder response ${r}_{\xi}^{\left(3\right)}\left(t\right)$:
The second and thirdorder responses are defined as follows (see also Appendix 1):
where ${s}_{\xi}\left(t\right)$ and ${s}_{\xi +1}\left(t\right)$ denote the visual inputs that the EMD at spatial location $\xi $ receives. (In Figure 2B and Figure 2—figure supplement 1, we used ${s}_{L}\left(t\right)$ and ${s}_{R}\left(t\right)$ to represent ${s}_{\xi}\left(t\right)$ and ${s}_{\xi +1}\left(t\right)$). The secondorder response ${r}_{\xi}^{\left(2\right)}\left(t\right)$ is the sum of secondorder features, ${s}_{\xi}\left(t{\tau}_{1}\right){s}_{\xi +1}\left(t{\tau}_{2}\right)$, weighted by the secondorder kernel, ${K}_{LR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$. The thirdorder response, ${r}_{\xi}^{\left(3\right)}\left(t\right)$, is the sum of thirdorder features, ${s}_{\xi}\left(t{\tau}_{1}\right){s}_{\xi +1}\left(t{\tau}_{2}\right){s}_{\xi}\left(t{\tau}_{3}\right)$, and ${s}_{\xi +1}\left(t{\tau}_{1}\right){s}_{\xi}\left(t{\tau}_{2}\right){s}_{\xi +1}\left(t{\tau}_{3}\right)$, weighted by the thirdorder kernel, ${K}_{LRL}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$. Note that we use the notation ${K}_{LLR}^{\left(3\right)}$ in Appendix 1, which can be transformed into ${K}_{LRL}^{\left(3\right)}$ by interchanging the position of the second and the third spatial and temporal arguments. In particular, ${K}_{LRL}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)={K}_{LLR}^{\left(3\right)}\left({\tau}_{1},{\tau}_{3},{\tau}_{2}\right)$.
Measuring Volterra kernels with stochastic stimuli and reversecorrelation
Request a detailed protocolTo estimate the kernels, we presented stochastic binary stimuli to flies and reversecorrelated the corresponding response with the input. In particular, we estimated the secondorder kernel by reversecorrelating the meansubtracted turning response with the products of two points in space and time (Appendix 1),
where ${\hat{K}}_{LR3}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$ is the estimated secondorder kernel from the threebarblock stimulus and ${\hat{K}}_{LR4}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$ is the estimated secondorder kernel from the fourbarblock stimulus. In particular,
where ${r}_{\text{turn}}\left(t\right)$ is the meansubtracted response, and $\gamma}_{\text{stim}$ is the magnitude of the contrast in the binary stimulus (Appendix 1). ${s}_{1}\left(t\right),{s}_{2}\left(t\right),{s}_{3}\left(t\right)$ represent the contrasts of 3 independent bars in 3barblock stimulus, and ${s}_{\xi +1}\left(t{\tau}_{2}\right)\equiv {s}_{1}\left(t{\tau}_{2}\right)$ for $\xi =3$. Similarly, ${s}_{1}\left(t\right)$,${s}_{2}\left(t\right)$, ${s}_{3}\left(t\right)$, ${s}_{4}\left(t\right)$represent the contrasts of 4 independent bars in fourbarblock stimulus, and ${s}_{\xi +1}\left(t{\tau}_{2}\right)\equiv {s}_{1}\left(t{\tau}_{2}\right)$ for $\xi =4$. We presented threebarblock stimulus to 35 flies, and fourbarblock stimulus to 37 flies, for $T=20$ min. All fly kernel estimates were averaged to generate the final kernel estimate. Note that we enforce ${\hat{K}}_{LR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)=0$ when ${\tau}_{1}={\tau}_{2}$, because we model the visual motion estimator as a mirrorantisymmetric operator (Appendix 1).
Similarly, we estimated the thirdorder kernel by reversecorrelating the meansubtracted turning response with the products of three points in space and time (Appendix 1).
where
and ${\tau}_{1}\ne {\tau}_{3}$.
We then enforced mirror antisymmetry by,
where ${\hat{K}}_{LRsym}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$, ${\hat{K}}_{LRLsym}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$ and ${\hat{K}}_{RLRsym}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$ are the symmetrized kernels, and we refer to them as the mirror antisymmetric component.
We next evaluated how much variance in the fly’s turning behavior can be explained by the estimated second and thirdorder kernels. We presented the same stochastic stimulus sequence to many flies, so we averaged the turning response from different flies, denoted as ${\overline{r}}_{\text{turn}}\left(t\right)$, to estimate the true stimulusassociated turning response ${r}_{\text{stimdriven}}\left(t\right)$. We predicted the turning response ${r}_{\text{pred}}\left(t\right)$ to the same stimulus sequence,
where ${r}_{\text{pred}}^{\left(2\right)}\left(t\right)\phantom{\rule{thickmathspace}{0ex}}({r}_{\text{pred}}^{\left(3\right)}\left(t\right))$ is the predicted response from the secondorder (thirdorder) kernel. We calculated ${r}_{\text{pred}}\left(t\right)$ using only the antisymmetric component of the kernel, that is ${\hat{K}}_{LRsym}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$ and ${\hat{K}}_{LRLsym}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right).$ The Pearson correlation between ${r}_{\text{pred}}\left(t\right)$ and ${\overline{r}}_{\text{turn}}\left(t\right)$ were 0.686 and 0.787 in threebar and fourbar experiments.
If fly turning responses are driven only by visual stimuli, are mirror antisymmetric, and use only second and thirdorder correlations, then the measured second and thirdorder kernel would explain all the variance in the stimulusdriven turning responses, and the Pearson correlation between ${r}_{\text{pred}}\left(t\right)$ and ${r}_{\text{stimdriven}}\left(t\right)$ should be one. However, our measured kernels only explained about half of the variance. There are several potential reasons for this. First, the turning responses of flies appeared very noisy, making it difficult to estimate the true stimulusdriven response. That is, ${\overline{r}}_{\text{turn}}\left(t\right)$ was a poor estimation of ${r}_{\text{stimdriven}}\left(t\right)$. Second, we wanted to minimally extend the canonical secondorder motion detector while being able to account for lightdark asymmetric visual processing, so we added only one more term, the thirdorder kernel. However, the fly might respond to higher order spatiotemporal correlations in visual inputs, and our model did not capture them.
Representing the second and thirdorder kernels in the impulse response format
Request a detailed protocolTo better understand and visualize the extracted kernels, we rearranged the elements in the kernels such that we could interpret kernels as the impulse response to a pair (triplet) of contrasts. This is analogous to the impulse response to a single contrast change at one point in a linear system.
Before rearrangement, the rows (columns) of the secondorder kernel represent the temporal argument ${\tau}_{1}$ (${\tau}_{2}$) in the matrix ${\hat{K}}_{LRsym}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$. After rearrangement, the rows correspond to the time since the more recent point, and the columns represent different temporal intervals between the two points, with negative intervals meaning that the right point is more recent than the left point (Figure 4—figure supplement 2D) (SalazarGatzimas et al., 2016). We denote this new format as
where $\mathrm{\Delta}{\tau}_{21}={\tau}_{2}{\tau}_{1}$ and $\tau =\mathrm{min}\left({\tau}_{1},{\tau}_{1}+\mathrm{\Delta}{\tau}_{21}\right)$. Because we have enforced mirror antisymmetry in $\hat{K}}_{LRsym}^{\left(2\right)$, the columns of ${\hat{K}}_{impulse}^{\left(2\right)}\left(\tau ,\mathrm{\Delta}{\tau}_{21}\right)$ are antisymmetric around ${\mathrm{\Delta}\tau}_{21}=0$. We interpreted the columns of ${\hat{K}}_{impulse}^{\left(2\right)}\left(\tau ,\mathrm{\Delta}{\tau}_{21}\right)$ as the impulse response of the fly to a pair of adjacent contrast changes separated by ${\mathrm{\Delta}\tau}_{21}$ in time.
Similarly, before rearrangement, the three dimensions of the thirdorder kernel represent the three temporal arguments of ${\hat{K}}_{LRLsym}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$. Once rearranged, we define
where $\tau =min\left({\tau}_{1},{\tau}_{1}+\mathrm{\Delta}{\tau}_{21},{\tau}_{1}+\mathrm{\Delta}{\tau}_{31}\right)$, ${\mathrm{\Delta}\tau}_{21}={\tau}_{2}{\tau}_{1}$, and ${\mathrm{\Delta}\tau}_{31}={\tau}_{3}{\tau}_{1}$. Rows again represent the time since the last point, the columns represent the temporal distance between the more recent point on the left and the sole right point, and the third tensor dimension represents the temporal distance between two left points (Figure 4B). For this thirdorder kernel, we also summed along the rows for 0.75 s to define the summed kernel strength (Figure 4C),
Testing the significance of the measured thirdorder kernel with ‘null kernels’
Request a detailed protocolWe tested the significance of the measured kernel with synthetic null kernels (Figure 4C). We shifted the stimulus with 100 random temporal offsets (the offset was at least 2 seconds long), reversecorrelated these shifted stimuli with responses, and generated 100 synthetic null kernels. The 100 kernels extracted from the misaligned stimulus and response were used to test the significance of the real kernel. We calculated the summed kernel strength of these 100 null kernels, and built the null distribution of summed kernel strength and performed twotailed ztest. We tested kernel strength in the region of the kernels: ${\tau}_{3}{\tau}_{1}$ from 0 to 250 ms, and ${\tau}_{2}{\tau}_{1}$ from – 250 to 250 ms, which equaled $528=16\times 33$ kernel strengths in total. There are 43 significant (p <0.05) responses, and around 23% of the total significant points (10 in total) aggregated when ${\tau}_{1}{\tau}_{2}<83$ ms, $\left{\tau}_{3}{\tau}_{1}\right<83$ ms. Therefore, we further simplified our kernel by setting thirdorder kernel elements to zero when ${\tau}_{1}{\tau}_{2}\ge 83$ ms or $\left{\tau}_{3}{\tau}_{1}\right\ge 83$ ms, and denoted the 'cleaned' kernel as ${\hat{K}}_{LRLsymclean}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$. To be consistent, we also set elements of the secondorder kernel to zero when ${\tau}_{1}{\tau}_{2}>83\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{s}$, and denoted it as ${\hat{K}}_{LRsymclean}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$.
The exact pvalues for the summed thirdorder strength were shown for ${\tau}_{2}{\tau}_{1}=[133\text{ms},133\text{ms}]$ with 16.7 ms increment in the following table from top to bottom.
$\mathrm{\Delta}{\tau}_{31}=16$ ms  $\mathrm{\Delta}{\tau}_{31}=33.3$ ms  $\mathrm{\Delta}{\tau}_{31}=50$ ms 

0.4328  0.7506  0.3948 
0.9784  0.0514  0.1870 
0.7381  0.0587  0.0904 
0.3340  0.9650  0.5661 
0.4435  0.6801  0.6273 
0.2154  0.8081  0.8081 
0.0645  0.0942  0.0641 
0.0008  0.7077  0.0015 
<0.0001  0.3014  0.9246 
<0.0001  0.0823  0.9375 
0.8873  <0.0001  0.0149 
0.0177  0.3112  0.0001 
0.1011  0.0487  0.2435 
0.2565  0.3086  0.5983 
0.6328  0.5057  0.7153 
0.1037  0.4815  0.3233 
0.3677  0.2075  0.6670 
Comparing the measured thirdorder kernel with the glider responses
Request a detailed protocolWe measured the fly’s sensitivity to threepoint correlations using a suite of thirdorder glider stimuli. Overall, we presented $52=13\times 2\times 2$ different stimuli (See Visual stimuli in Materials and methods, Table 1, Figure 4—figure supplement 1A). Each glider stimulus elicited sustained turning responses (Clark et al., 2014), so we averaged the response over time and denote it as $r}_{\left(\mathrm{\Delta}{\tau}_{31},\mathrm{\Delta}{\tau}_{21},L/R,P\right)}^{\text{glider}$, where the subscript specifies the stimulus type. Since we assume the fly’s motion computation is mirror antisymmetric, we subtracted responses to the pairs of gliders with different directions but with the same temporal interval and polarity, and denote it as $r}_{\left(\mathrm{\Delta}{\tau}_{31},\mathrm{\Delta}{\tau}_{21},P\right)}^{\text{glider}$,
where $O$ denotes the orientation of the glider (i.e. left or right), and $O$ denotes the opposite orientation. We plotted 18 out of 26 averaged responses in Figure 4—figure supplement 1B.
In Table 1, we listed the number of flies tested for each glider (${n}_{P=1}$ is the number of flies tested with positive gliders, ${n}_{P=1}\mathrm{}$with negative gliders), and the pvalues of Student ttests, which were tested against zero response (${p}_{P=1}$is the significance level for positive gliders, ${p}_{P=1}$ for negative gliders).
The measured thirdorder kernel and the measured glider responses should both reflect the fly’s sensitivity to threepoint correlations. To test agreement between these two measurements, we used the measured thirdorder kernel to predict the fly’s responses to glider stimuli. We made the prediction by summing the ‘diagonal line’ of the thirdorder kernel. Specifically, we found the predicted response to specific thirdorder gliders by summing over all elements in the kernel with the same temporal differences as the glider:
The constant of $54\times 6$ takes into consideration the spatial summation all 54 putative EMDs and all six parts of thirdorder kernel in one EMD (Appendix 1), and ${\gamma}_{stim}=1$ is the contrast of the glider stimuli. For gliders who have two points on the right side, we used $K}_{RLRsym$ instead of $K}_{LRLsym$. Since the thirdorder kernel is agnostic to the polarity of the threepoint correlations and reflected only the average of the fly’s sensitivity to positive and negative correlations, we averaged the responses of positive and negative gliders. This neglects higherorder components of glider responses that could nevertheless be biologically meaningful.
We then compared the $r}_{\left(\mathrm{\Delta}{\tau}_{31},\mathrm{\Delta}{\tau}_{21}\right)}^{\text{gliderave}$ with $r}_{\left(\mathrm{\Delta}{\tau}_{31},\mathrm{\Delta}{\tau}_{21}\right)}^{\left(3\right)\text{pred}$ (Figure 4D).
Comparing the temporal structure of the second and thirdorder kernels
Request a detailed protocolTo compare the temporal structure of the two kernels, we first rearranged and combined the elements in the thirdorder kernel into a twodimensional representation and then rearrange the secondorder kernel in the impulse response format. Specifically, we rearranged the elements in the thirdorder kernel into ${\hat{K}}_{alignedLRL}^{\left(3\right)}\left(\tau ,\text{}\mathrm{\Delta}{\tau}_{LR},\mathrm{\Delta}{\tau}_{LL}\right)$, where $\tau $ corresponds to the time since the most recent point, ${\mathrm{\Delta}\tau}_{LR}$ is the average time difference between left points and the right point, and ${2\mathrm{\Delta}\tau}_{LL}$ is the temporal separation between the two left points (Figure 4—figure supplement 2A left). In particular,
where $\tau =min\left(\left\{{\tau}_{2}+\mathrm{\Delta}{\tau}_{LR}\mathrm{\Delta}{\tau}_{LL},{\tau}_{2},{\tau}_{2}+\mathrm{\Delta}{\tau}_{LR}+\mathrm{\Delta}{\tau}_{LL}\right\}\right)$. We similarly defined
where $\tau$ corresponds to the time since the most recent point, ${\mathrm{\Delta}\tau}_{RL}$ is the average time difference between right points and the left point, and ${2\mathrm{\Delta}\tau}_{RR}$ is the temporal separation between the two right points (Figure 4—figure supplement 2A right). Finally, we summed over the withinpoint time differences (${\mathrm{\Delta}\tau}_{RR}$ and ${\mathrm{\Delta}\tau}_{LL}$), and summed these two pieces to obtain a matrix (Figure 4—figure supplement 2B),
where $\mathrm{\Delta}t=1/60$ s.
${\hat{K}}_{align2D}^{\left(3\right)}\left(\tau ,\text{}\mathrm{\Delta}{\tau}_{RL}\right)$ has rows and columns that are conceptually comparable to those of ${\widehat{K}}_{impulse}^{\left(2\right)}\left(\tau ,{\mathrm{\Delta}\tau}_{21}\right)$, as rows represent times since the most recent point and columns describe the temporal distance between right and left points. However, in $\hat{K}}_{impulse}^{\left(2\right)$ the columns are spaced by 16.67 ms Figure 4—figure supplement 2D), whereas in $\hat{K}}_{align2D}^{\left(3\right)$ the columns are spaced by 8.33 ms (Figure 4—figure supplement 2B). This results from the fact that ${\mathrm{\Delta}\tau}_{21}$ is an integer number of frames in ${\hat{K}}_{impulse}^{\left(2\right)},$ whereas ${2\mathrm{\Delta}\tau}_{RL}$ is an integer number of frames in $\hat{K}}_{align2D}^{\left(3\right)$. We thus averaged two neighboring elements in $\hat{K}}_{align2D}^{\left(3\right)$ (Figure 4—figure supplement 2C), so that it has the same resolution as the $\hat{K}}_{impulse}^{\left(2\right)$.
We then summed both $\hat{K}}_{align2D}^{\left(3\right)$ and $\hat{K}}_{impulse}^{\left(2\right)$ in each column, and we rescaled the two summed kernels so that the norm of each was 1 (Figure 4E). In order to ease the visual comparison of the temporal structure between the two kernels, we also flipped the sign of the summed $\hat{K}}_{align2D}^{\left(3\right)$.
Comparing the second and thirdorder kernels with the singular value decomposition (SVD)
Request a detailed protocolWe factorized $\hat{K}}_{align2D}^{\left(3\right)$ and $\hat{K}}_{impulse}^{\left(2\right)$ into the products of a set of basis vectors with SVD (Figure 4—figure supplement 2EFG),
where ${U}^{\left(i\right)},{\mathrm{\Sigma}}^{\left(i\right)},{V}^{\left(i\right)},$ are the leftsingular vectors, singular values, and rightsingular vectors of the associated $i$thorder kernel. We use ${u}_{1}^{\left(2\right)},{v}_{1}^{\left(2\right)}$ (${u}_{1}^{\left(3\right)},{v}_{1}^{\left(3\right)}$) to denote the left and right singular vectors corresponding to the largest singular values. For visualization purposes, in Figure 4—figure supplement 2G, we flipped the sign of ${v}_{1}^{\left(3\right)}$ so that readers could visually compare the temporal structure of these two vectors.
Extracting kernels of various motion detectors that were optimized to predict image velocity in natural scenes
Request a detailed protocolWe characterized four other motion detectors (Figure 4—figure supplement 3) with Volterra kernels (Fitzgerald and Clark, 2015). These motion detectors have various physiological plausible structures and were optimized to predict image velocities in natural scenes. We fed the same stochastic binary stimuli sequence to these motion detectors, collected the corresponding responses, and extracted the second and thirdorder kernels using reversecorrelation. In Figure 4F, we presented the summed kernel strength of the thirdorder kernels. Note that we only presented several examples, and the spatiotemporal arguments of these examples are represented graphically.
Natural scene dataset
Request a detailed protocolWe used a natural scene dataset (Meyer et al., 2014), which contains 421 panoramic luminancecalibrated naturalistic twodimensional pictures. Each picture has $927\times 251$ pixels and subtends 360° horizontally and 97.4° vertically, so that the spatial resolution is ~0.30°/pixel. In our study, we used 1dimensional images, which were single rows from the twodimensional pictures. Therefore, there were $105671=421\times 251$ images in the dataset. We refer to the twodimensional scenes as pictures or photographs, and refer to the onedimensional slices as images.
Preprocessing photographs
Request a detailed protocolTo simulate the photoreceptors, we converted the luminance pictures into contrast pictures with a blurring step and contrast computation step.
First, to simulate the spatial resolution of the fly’s ommatidia, we blurred the original photograph (Figure 4—figure supplement 1B), denoted by $I$, with a twodimensional Gaussian filter ${f}_{\text{blur}}\left(x,y\right)$.
where $\ast$ denotes crosscorrelation. The filter extends to $\pm 3{\lambda}_{\text{blur}}$, that is $u}_{+}={v}_{+}=\left{u}_{}\right=\left{v}_{}\right=3{\lambda}_{\text{blur}$, where $\lambda}_{\text{blur}$ is related to fullwidthathalfmaximum (FWHM) by $\lambda}_{\text{blur}}=\frac{{\mathrm{F}\mathrm{W}\mathrm{H}\mathrm{M}}_{\text{blur}}}{2\sqrt{2\mathrm{ln}2}$, and we chose ${\mathrm{F}\mathrm{W}\mathrm{H}\mathrm{M}}_{\text{blur}}=5.3$° (Stavenga, 2003). The original pictures cover the full circular range horizontally, but only 97.4° vertically. When the range of the filter extended beyond the vertical boundary of the picture, we padded the picture by 'vertical reflection'. This reflection padding was also used when we calculated the local meanluminance. In Figure 1B, where we demonstrated one example picture, we did not perform this blurring step in order to preserve high spatial acuity such that it is pleasing for human eyes.
Second, we converted the luminance signals in the blurred photograph to contrast signals (Figure 1—figure supplement 1CD) (Fitzgerald and Clark, 2015),
where $c\left(x,y\right)$ is the contrast at each location $\left(x,y\right)$. $I}_{\text{mean}$ is the local mean luminance, which is the averaged luminance weighted by a twodimensional Gaussian spatial filter $f}_{\text{localmean}$. The length scale of $f}_{\text{localmean}$ can be equivalently described by $\lambda}_{\text{localmean}$ and $\mathrm{F}\mathrm{W}\mathrm{H}\mathrm{M}}_{\text{localmean}$, where $\lambda}_{\text{localmean}}=\frac{{\mathrm{F}\mathrm{W}\mathrm{H}\mathrm{M}}_{\text{localmean}}}{2\sqrt{2\mathrm{ln}2}$. We swept $\mathrm{F}\mathrm{W}\mathrm{H}{\mathrm{M}}_{\text{localmean}}$ from 10° to 75° (Figure 3—figure supplement 2ABC).
Alternatively, we also computed the local mean luminance over time instead of over space in Figure 3—figure supplement 2FG,
where ${I}_{\text{blurtime}}\left(x,y,t\right)$ is the simulated naturalistic moving luminance signal (Fitzgerald and Clark, 2015), and the local mean luminance was the averaged luminance signals over time, with the temporal filter ${f}_{\text{localmeantime}}\left(t\right)=\mathrm{exp}\left(t/{\tau}_{\text{localmean}}\right)$, where ${f}_{\text{localmeantime}}\left(t\right)$ is normalized to have a sum of 1. We swept $\tau}_{\text{localmean}$ from 10 ms to 500 ms (Figure 3—figure supplement 2FG).
Depending on the parameters in local mean computation, we had 20 natural scene datasets, including 14 datasets whose local mean luminance was computed statically and 6 dynamically. Unless specified, we used the natural scene dataset with contrast images preprocessed statically with ${\mathrm{F}\mathrm{W}\mathrm{H}\mathrm{M}}_{\text{localmean}}=25$°.
Eliminating the higher order structure of natural scene ensemble
Request a detailed protocolWe created a synthetic image dataset where we effectively preserved only the secondorder structure of natural scenes ensemble and eliminated the higherorder structure. We viewed each 1D image as a random vector determined by ${\text{P}}_{\text{natural}}\left(\mathit{X}\right),$where $\mathit{X}$ is a $n$dimensional vector ($n=927$ for 927 pixels). The covariance matrix of ${\text{P}}_{\text{natural}}\left(\mathit{X}\right)$ determines the point variance and pairwise spatial correlations in natural scenes. We intended to construct a Gaussian distribution ${\text{P}}_{\text{synthetic}}\left(\mathit{X}\right)$ such that its covariance matrix is the same as ${\text{P}}_{\text{natural}}\left(\mathit{X}\right)$. In this way, the image ensemble sampled from ${\text{P}}_{\text{synthetic}}\left(\mathit{X}\right)$ will contain the same secondorder statistics as the natural scene ensemble, but it would lack the higher order statistics present in the natural scene ensemble.
Because the pixels in images are horizontally translational invariant, the covariance matrix of ${\text{P}}_{\text{natural}}\left(\mathit{X}\right)$ is a circulant matrix and can be diagonalized by a discrete Fourier transform. Therefore, we constructed ${\text{P}}_{\text{synthetic}}\left(\mathit{X}\right)$ and generated the synthetic dataset in the frequency domain (Bialek, 2012). Operationally, we first performed a discrete Fourier transform (with fft function in Matlab v2018a, RRID:SCR_001622) on each onedimensional image in the natural scene dataset and obtained its Fourier domain representation ${\mathit{y}}^{\left(i\right)}=({y}_{{k}_{1}}^{\left(i\right)},{y}_{{k}_{2}}^{\left(i\right)},\dots ,{y}_{{k}_{n}}^{\left(i\right)})$, where $\left(i\right)$ denotes the $i$th images, and ${k}_{n}$ denotes the ${k}_{n}$th Fourier component. We then calculated the average power of frequency ${k}_{n}$, denoted as ${\sigma}_{{k}_{n}}^{2}$, where ${\sigma}_{{k}_{n}}^{2}=\frac{1}{m}\sum _{i}^{m}\left({y}_{{k}_{n}}^{\left(i\right)}\right){\left({y}_{{k}_{n}}^{\left(i\right)}\right)}^{*}$, and $*$ denotes complex conjugation. At each frequency ${k}_{n}$, we built two Gaussian distributions ${G}_{\mathrm{\Re}}\sim \mathcal{\mathcal{N}}\left(0,\text{}\frac{{\sigma}_{{k}_{n}}^{2}}{2}\right)$ and ${G}_{\mathrm{\Im}}\sim \mathcal{\mathcal{N}}\left(0,\text{}\frac{{\sigma}_{{k}_{n}}^{2}}{2}\right)$. To generate one synthetic image, we sampled two real numbers from these two distributions as the real and imaginary part of Fourier component of the synthetic image at frequency ${k}_{n}$. In the end, we performed an inverse Fourier transform to the sampled Fourier components to gain a synthetic image in the spatial domain. In total, we generated 1000 highorderstructureeliminated synthetic images, and refer to this synthetic image dataset to as the dataset in which higher order structure was eliminated.
Computing and manipulating statistics of individual onedimensional images
Request a detailed protocolThe natural scene dataset was comprised of an ensemble of heterogeneous images, and the statistics of different images can be drastically different from each other. Thus, we considered each onedimensional image to have its own contrast distribution, ${P}_{\text{pixel}}^{\left(i\right)}\left(X\right)$, where $i$ indexes the image and the contrast of each pixel in an image as an independent sample of the random variable $X$. For each natural image, we computed its sample mean, sample variance, sample skewness, and sample kurtosis of pixels, and show the histogram of these statistics (Figure 5—figure supplement 1 and Figure 5B).
We generated $14=2+10+2$ synthetic image datasets to mimic various statistical properties of natural scenes (Table 2). These 14 datasets differ in three parameters: (1) the contrast range; (2) whether the contrast skewness was constrained; and (3) the specific value of the imposed skewness when the skewness was constrained. We conceptually justify this image synthesis method in Appendix 2, and here we focus on methodological details. For every image in the natural scene dataset, we generated a corresponding image in each synthetic image dataset. This involved two steps.
In step one, we determined all relevant image statistics and generated the corresponding maximum entropy distribution (MED). Operationally, for each individual image, we found its contrast range, $\left[{c}_{min},{c}_{max}\right],$ the largest contrast magnitude, $\delta c=max\left({c}_{min},{c}_{max}\right),$ and the sample mean, ${c}_{\mu}$. We then derived the contrast range specified in Table 2, binned the range into $N$discrete levels, calculated the contrast frequency at each level to estimate ${P}_{\text{pixel}}^{\left(i\right)}\left(X\right),$ and estimated the contrast variance ${c}_{{\sigma}^{2}}$ and skewness ${c}_{\gamma}$ from this estimated distribution. Finally, we solved the MED (Appendix 2) with the constrained statistics specified in Table 2. In the 'imposed skewness' column, 'NA' means that the skewness was not constrained in the MED.
In step two, we generated the synthetic image. The solved MED captures the pixel statistics but does not capture any spatial correlations between pixels. Therefore, we decided to interpolate between sampled pixels to coarsely mimic the spatial correlations. Operationally, for an individual image, we calculated its spatial correlation function and found the cutoff distance, $\mathrm{\Delta}{x}_{\mathrm{\alpha}},$ where the correlation falls below $\alpha =0.2$. We then sampled contrast values independently from the solved MED and placed these contrasts $\mathrm{\Delta}{x}_{\alpha}$ pixels apart. Finally, we upsampled the lowresolution image to a highresolution image using the resample function in Matlab v2018a (RRID:SCR_001622).
Note that MED3 through MED7 are theoretically related to MED8 through MED12 by contrast inversion. We thus generated synthetic datasets MED8 to MED12 by simply inverting the contrast of images in dataset MED3 to MED7.
Simulating naturalistic motion with natural scenes
Request a detailed protocolTo simulate the fullfield motion signals induced by selfrotation in the natural environment, we rigidly translated images at various horizontal velocities (Fitzgerald and Clark, 2015). For each trial, we randomly chose one 1dimensional image from the dataset. We had 35 image datasets in total, including 20 natural scene datasets preprocessed with different parameters and 15 synthetic image datasets, we therefore built 35 naturalistic motion datasets, where all images were sampled from a particular image dataset in each motion dataset. We sampled the velocity in two ways. First, we sampled it from a Gaussian distribution with zero mean and standard deviation of 114 °/s. This standard deviation is the measured standard deviation of the spontaneous rotational turning speed of freely walking flies. In Figure 3—figure supplement 1, we varied the standard deviation of the Gaussian distribution from 32 °/s to 512 °/s. Second, we selected an image velocity at discrete values ranging from 0 °/s to 1000 °/s with a 10 °/s interval. Given an imagevelocity pair, we rigidly move this image with this velocity for one second. The temporal resolution is 60 Hz.
To eliminate any potential leftright asymmetry in naturalistic motion datasets, if we moved an image rightward at a certain speed, we always simulated a paired trial in which the same image was flipped horizontally and moved leftward with the same speed.
Calculating the responses of different motion detectors to naturalistic motion signals
Request a detailed protocolOur study concerns two motion detectors: the HRC and the measured kernels. For the HRC, we created an array of 54 overlapping HRC elementary motion detectors which extended 270° horizontally. We calculated the instantaneous HRCs’ outputs at the end of each trial and averaged them across space to get the model’s output $r}_{\text{HRC}$. For the measured kernels, we calculated the instantaneous output of the kernels at the end of each trial, including the secondorder response, ${r}^{\left(2\right)},$ the thirdorder response, ${r}^{\left(3\right)}$, and the total response, $r={r}^{\left(2\right)}+{r}^{\left(3\right)}$.
In Table 3, we listed the simulation parameters for each figure, including the motion detector, the image dataset, the velocity distribution, and the number of trials (imagevelocity pairs). If the dataset is natural scenes, we also listed its preprocessing parameter $\text{FWHM}}_{\text{localmean}$. If the velocity distribution is Gaussian, we listed its standard deviation.
For example, in Figure 3, we predicted the motion estimates of the measured kernel to the naturalistic motion. The naturalistic motion was created with images sampled from the natural scene dataset that has preprocessing parameter ${\text{FWHM}}_{\text{localmean}}=25$°, and created with velocity sampled from the Gaussian distribution that has standard deviation of 114 °/s. We simulated 10,000 independent trials, and had 20,000 trials after enforcing leftright symmetry.
Evaluating the performance of a motion detector
Request a detailed protocolWe denote the true velocity of the image as $v}_{\text{img}$ and the response of a motion detector as $v}_{\text{est}$. We assessed the accuracy of the motion estimation using two metrics. First, we measured the Pearson correlation $\rho $ between $v}_{\text{img}$ and $v}_{\text{est}$,
where the variances and covariances are evaluated across any of the naturalistic motion datasets introduced above. To estimate the uncertainty in $\rho $ induced by finite sample sizes, we randomly separated all independently simulated trials into 10 groups, calculated the Pearson correlation for each group, and estimated the SEM of the Pearson correlation across the groups. Second, when the image velocity was sampled at discrete values, we measured the variance of $v}_{\text{est}$ conditional on each possible image velocity, $\text{var}\left({v}_{\text{est}}{v}_{\text{img}}={v}_{0}\right)$.
Evaluating the improvement added by the thirdorder response
Request a detailed protocolTo evaluate how much improvement was added by the thirdorder response to the secondorder response, we calculated the relative Pearson correlations: $\mathrm{i}\mathrm{m}\mathrm{p}\mathrm{r}\mathrm{o}\mathrm{v}\mathrm{e}\mathrm{m}\mathrm{e}\mathrm{n}\mathrm{t}=\frac{{\rho}^{(2+3)}{\rho}^{\left(2\right)}}{{\rho}^{\left(2\right)}}$. As in Evaluating the performance of a motion detector, to estimate the uncertainty induced by finite sample sizes, we separated all trials into 10 groups, calculated the Pearson correlation for each group, calculated the improvements in each group, and estimated the SEM of the improvements across the groups.
Assessing the empirical weighting of the secondorder and thirdorder responses
Request a detailed protocolWe modeled the image velocity as a linear combination of the secondorder and thirdorder responses
and estimated the optimal weighting coefficients, $\left({\hat{\alpha}}^{(2)},{\hat{\alpha}}^{\left(3\right)}\right)$, using ordinary least square regression. We calculated $\rho}_{\text{best}}=\frac{\text{cov}\left({r}_{\text{best}},{v}_{\text{img}}\right)}{\sqrt{\text{var}\left({r}_{\text{best}}\right)}\sqrt{\text{var}\left({v}_{\text{img}}\right)}$, where $r}_{\text{best}}={\hat{\alpha}}^{\left(2\right)}{r}^{\left(2\right)}+{\hat{\alpha}}^{\left(3\right)}{r}^{\left(3\right)$. We computed the relative weighting coefficient as $w={\widehat{\alpha}}^{\left(2\right)}/{\widehat{\alpha}}^{\left(3\right)}$.
Calculating the residual of secondorder response
Request a detailed protocolThe secondorder response can be viewed as a function of the image velocity, as well as noise that depends on image structure:
We estimated the noise term as
where $\hat{\beta}}^{\left(2\right)$ minimized the squared residual and $\hat{\u03f5}}^{\left(2\right)$ denotes the estimated residual (noise).
Appendix 1
Modeling the fly's optomotor turning response with Volterra kernels
The visual motion system of the fly can generally be considered as a nonlinear mapping from spatiotemporal visual stimuli, denoted $s(x,t)$, to behavioral turning responses, denoted $r\left(t\right)$. In general, any timeinvariant nonlinear system can be written with a Volterra series,
where each ${H}^{\left(n\right)}$ is a convolutional operator,
${\eta}_{i}$ are spatial coordinates, ${\tau}_{i}$ are temporal coordinates, and ${K}^{\left(n\right)}$ is termed the $n$thorder Volterra kernel (Marmarelis and McCann, 1973; Schetzen, 1980). The system is fully specified by these kernels. In this section, we describe how we use this framework to model the fly’s visual system. We make several simplifying assumptions about the fly’s visual system.
Assumption 1. The system is thirdorder. This implies, ${K}^{\left(n\right)}\left({\eta}_{1},\dots ,{\eta}_{n},{\tau}_{1},\dots ,{\tau}_{n}\right)=0$ for $n>3.$ This choice aims to account for the system’s potential ability to asymmetrically process lightdark signals while minimally extending the canonical secondorder algorithm.
Assumption 2. The system is spatiallyinvariant. This implies that
Assumption 3. The system consists of a spatial array of elementary motion detectors (EMDs), and each EMD responds to inputs that are close in space. Since the spatial distance between neighboring ommatidia in fruit fly is around 5°, we discretize space into pixels with $\mathrm{\Delta}x=5\xb0$ resolution, and we use integer indices to denote the relative spatial location. We further assume that the fly EMD does not respond to interactions between points that are spaced more than $\mathrm{\Delta}x$ apart,
This modeling simplification is frequently made in fly neuroscience (Buchner, 1976), but modern receptive field measurements in individual motion detectors in Drosophila suggest that they integrate over a wider field of view (Leong et al., 2016; SalazarGatzimas et al., 2016) .
Assumption 4. The system is mirror antisymmetric. This means that if $r\left(t\right)$ is the system’s response to stimulus $s\left(x,t\right)$, then the system will respond to $s(x,t)$ with response $r\left(t\right)$. This is an important assumption for motion processing systems, and we will discuss its implications for Volterra kernels later in this section.
Given the first three assumptions, we model the fly’s turning response $r\left(t\right)$ as
where
The response $r\left(t\right)$ is the sum of responses contributed by the $i$thorder kernel at different spatial locations $\xi$, denoted with ${r}_{\xi}^{\left(i\right)}\left(t\right)$. ${K}^{\left(1\right)}\left({\tau}_{1}\right),\text{}{K}^{\left(2\right)}\left({\eta}_{1},{\eta}_{2},{\tau}_{1},{\tau}_{2}\right),\text{}\mathrm{a}\mathrm{n}\mathrm{d}\text{}{K}^{\left(3\right)}\left({\eta}_{1},{\eta}_{2},{\eta}_{3},{\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$ approximates how each EMD maps the inputs $s\left({x}_{\xi},t\right)$ and $s\left({x}_{\xi}+\mathrm{\Delta}x,t\right)$ into its output. ${N}_{EMD}=54$ is the number of putative EMD units.
Next, we will simplify the notations for ${K}^{\left(2\right)}\left({\eta}_{1},{\eta}_{2},{\tau}_{1},{\tau}_{2}\right)$ and ${K}^{\left(3\right)}\left({\eta}_{1},{\eta}_{2},{\eta}_{3},{\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$ , and discuss symmetries in these kernels.
The secondorder kernel ${K}^{\left(2\right)}\left({\eta}_{1},{\eta}_{2},{\tau}_{1},{\tau}_{2}\right)$ is a fourdimensional tensor, with 2 dimensions in time and 2 dimensions in space. Since $\left{\eta}_{1}{\eta}_{2}\right\le \mathrm{\Delta}x$ , the values that can be taken are limited 2 discrete points, and all possible combinations of ($\eta}_{1},{\eta}_{2$) are $(L,L)$, $(L,R)$, $(R,L)$ and $(R,L$ ), where $L(R)$ denotes left (right). We replace the spatial arguments with subscripts, and the 4dimensional ${K}^{\left(2\right)}\left({\eta}_{1},{\eta}_{2},{\tau}_{1},{\tau}_{2}\right)$ can be rewritten with four 2dimensional kernels,
In ${K}_{{\eta}_{1},{\eta}_{2}}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right),$ the first (second) temporal argument is related to the spatial location represented by the first (second) subscript. For example, ${K}_{LR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$ means ${K}^{\left(2\right)}\left({\eta}_{1}=L,{\eta}_{2}=R,{\tau}_{1},{\tau}_{2}\right)$.
There are three types of symmetries in these ${K}_{{\eta}_{1},{\eta}_{2}}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$. First, since the system is translationallyinvariant, ${K}_{LL}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)={K}_{RR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$. Second, ${K}_{LR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)$ describes how the system responds to ${s}_{L}\left(t{\tau}_{1}\right){s}_{R}\left(t{\tau}_{2}\right)$, and ${K}_{RL}^{\left(2\right)}\left({\tau}_{2},{\tau}_{1}\right)$ describes how the system responds to ${s}_{R}\left(t{\tau}_{2}\right){s}_{L}\left(t{\tau}_{1}\right)$.
Because multiplication operator is commutative, that is $\text{}{s}_{L}\left(t{\tau}_{1}\right){s}_{R}\left(t{\tau}_{2}\right)={s}_{R}\left(t{\tau}_{2}\right){s}_{L}\left(t{\tau}_{1}\right)$, one should have ${K}_{LR}^{\left(2\right)}\left({\tau}_{1},{\tau}_{2}\right)={K}_{RL}^{\left(2\right)}\left({\tau}_{2},{\tau}_{1}\right)$.
Therefore, the secondorder response of an EMD can be simplified as,
where we replace the spatial arguments in stimulus with discrete subscripts $\xi$ as well.
Third, to derive the consequences of mirror antisymmetry, note that the secondorder kernel is locally sensitive to two discrete points in space, $s\left(x,t\right)=\left[{s}_{L}\left(t\right),\text{}{s}_{R}\left(t\right)\right]$. Mirrorantisymmetry states that if the system’s response to ${s}_{L}\left(t\right)=a\left(t\right)$ and ${s}_{R}\left(t\right)=b\left(t\right)$ is ${s}_{R}\left(t\right)=b\left(t\right)$, then the system’s response to the reflected stimulus, ${s}_{L}\left(t\right)=b\left(t\right)$ and ${s}_{R}\left(t\right)=a\left(t\right)$, is $r\left(t\right)$, where $a\left(t\right)$ and $b\left(t\right)$ are arbitrary functions of time. The system’s response when ${s}_{L}\left(t\right)=a\left(t\right)$ and ${s}_{R}\left(t\right)=b\left(t\right)$ is
and the response when ${s}_{L}\left(t\right)=b\left(t\right)$ and ${s}_{R}\left(t\right)=a\left(t\right)$ is
If ${r}_{\xi ,ba}^{\left(2\right)}\left(t\right)={r}_{\xi ,ab}^{\left(2\right)}\left(t\right)$ for any $a\left(t\right)$ and $b\left(t\right)$, then one must have
and
Thus, the general expression for the secondorder response of the EMD further simplifies to
We analyzed the thirdorder kernel in a similar manner. It is a 6dimensional tensor, with 3 dimensions in time and 3 dimensions in space. Its spatial arguments are also limited to two points, and all possible combinations of $\left({\eta}_{1},\text{}{\eta}_{2},\text{}{\eta}_{3}\right)$ are $(L,L,L)$, $(R,R,R)$, $(L,L,R)$, $(L,R,L)$, $(R,L,L)$, $(R,R,L)$, $\left(R,R,L\right)$ and $(L,R,R)$. We replace the spatial arguments with subscripts, and represent the 6dimensional thirdorder kernel with eight 3dimensional kernels,
As was the case of the secondorder kernel, in ${K}_{{\eta}_{1},{\eta}_{2},{\eta}_{3}}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$ the first (second, third) temporal argument is related to the spatial location denoted with the first (second, third) subscript. For example, ${K}_{LLR}^{\left(3\right)}\left({\tau}_{1},{\tau}_{2},{\tau}_{3}\right)={K}^{\left(3\right)}\left({\eta}_{1}=L,{\eta}_{2}=L,{\eta}_{3}=R,{\tau}_{1},{\tau}_{2},{\tau}_{3}\right)$.
As before, these eight blocks are redundant. From spatialinvariance and the commutative property of multiplication, one finds
For each 3D tensor, at least two out of the three spatial arguments are the same, so there is symmetry within the tensor. For example,
and
Following this logic, the thirdorder response of an EMD can be written as
One can again apply the definition of mirror antisymmetry to find
Thus, with mirror antisymmetry, the thirdorder response of an EMD can be simplified as,
We have shown how mirror antisymmetry manifest in the second and thirdorder kernels. In a similar way, for the zero and firstorder kernel, mirror antisymmetry implies that,
Intuitively, $r}_{0$ term is the response of the EMD when there is no visual input. A reasonable motion detector should not be biased to turn left or right when there is no visual input. The firstorder kernel describes how the response is influenced by stimulus at a single spatial location, which should not give any motion information, thus ${K}^{\left(1\right)}\left({\tau}_{1}\right)=0$.
In summary, in our model, the turning response is
where ${r}_{\text{}\xi \text{}}\left(t\right)$ is the response of each EMD at spatial location $\xi$. In the experiments, we discretized time into bins of size $\mathrm{\Delta}t=1/60$ seconds, so the integrals become summations, and we write the local responses as
Appendix 2
Manipulating image statistics with maximum entropy distributions
To investigate how the skewness of the image contrast distribution influences motion detection, we want to develop a method for generating synthetic images with specified skewness values. In Section 1, we formalize this image synthesis problem and provide a rationale for the maximum entropy method. In Section 2, we derive formulae for finding maximum entropy distributions (MED) with constrained lowerorder moments. In Section 3, we discuss how the contrast range sets implicit constraints on the MED.
1. Motivation
We consider each 1dimensional natural image to have its own contrast distribution, ${P}_{pixel}^{\left(i\right)}\left(X\right)$, such that the contrast of each pixel in the $i}^{th$ image is an independent sample from ${P}_{pixel}^{\left(i\right)}\left(X\right)$. We would like to generate a synthetic contrast distribution, ${P}_{syn}^{\left(i\right)}\left(X\right)$, such that ${P}_{syn}^{\left(i\right)}\left(X\right)$ shares certain loworder contrast statistics with ${P}_{pixel}^{\left(i\right)}\left(X\right)$. More specifically, we want ${P}_{syn}^{\left(i\right)}\left(X\right)$ and ${P}_{pixel}^{\left(i\right)}\left(X\right)$ to have matched means, variances, and/or skewness levels. However, it’s important to recognize that ${P}_{syn}^{\left(i\right)}\left(X\right)$ is ambiguously determined by its lowerorder moments, because many distributions share finite set of moments. Nevertheless, among all of these qualified distributions, the distribution with maximal entropy is unique. The maximum entropy requirement thus implicitly specifies all unconstrained statistics. Maximum entropy distributions are also conceptually appealing because they have the least statistical structure consistent with the set of chosen constraints. Here we use maximum entropy distributions to generate synthetic images with specific statistics by sampling pixel contrasts from ${P}_{syn}^{\left(i\right)}\left(X\right)$. By manipulating the statistics that define ${P}_{syn}^{\left(i\right)}\left(X\right)$, we are able to manipulate the contrast statistics of the synthetic image.
2. Solving maximum entropy distribution by minimizing the free energy function
To illustrate the structure of the maximum entropy distribution (MED) with constrained moments, we concretely consider the example when the MED is constrained to have a specific mean, variance, and skewness. We refer to this distribution as the meanvarianceandskewness constrained MED. Since constraining the mean, variance, and skewness is equivalent to constraining the first, second and third moments, we equivalently refer to this distribution as the thirdorder MED.
The entropy of a random variable $X$ is
where $P\left(X\right)$ is the probability distribution of $X$, and $\text{E}\left[\right]$ is the expectation operator over $P\left(X\right)$. In particular, $\mathrm{E}\left[\right]=\int dxP\left(X=x\right)\left[\right]$. By definition, the thirdorder MED is the probability distribution, ${P}^{\ast}\left(X\right)$, such that
subject to,
where $\mu}_{1$, $\mu}_{2$ and $\mu}_{3$ are the first, second, and third moments. To optimize for ${P}^{\ast}\left(X\right)$, one can introduce Lagrange multipliers, $\lambda}_{0},{\lambda}_{1},{\lambda}_{2$ and $\lambda}_{3$, to enforce the normalization of the probability distribution and the three moment constraints. The problem is thus transformed into extremizing
Note that $L\left(P\left(X\right),\lambda \right)$ is a functional of $P\left(X\right)$, and a necessary condition for $P\left(X\right)$ to extremize $L$ is
This implies that thirdorder MED has the form
and the Lagrange multipliers must be solved to satisfy $\int dxP\left(X=x\right)=1,$ and $\text{E}\left[{X}^{i}\right]={\mu}_{i},\text{}i=1,2,3.$
An alternative to solving the nonlinear constraintsatisfaction equations for the Lagrange multipliers is to find ${\mathit{\lambda}}^{\ast}=\left({\lambda}_{1}^{\ast},{\lambda}_{2}^{\ast},{\lambda}_{3}^{\ast}\right)$ such that
where
Here $Q\left(x,\mathit{\lambda}\right)$ can be thought of an unnormalized probability distribution for $X$, $\mathrm{Z}\left(\mathit{\lambda}\right)$ is the normalizing factor, and $F\left(\mathit{\lambda}\right)$ is the log of this normalizing factor. Readers familiar with statistical mechanics will recognize $F\left(\mathit{\lambda}\right)$ as the free energy function and $\mathrm{Z}\left(\mathit{\lambda}\right)$ as the partition function. It turns out that the thirdorder MED is
This distribution manifestly has the right functional form, so to prove that it is the thirdorder MED we merely need to show that all constraints are satisfied by the distribution. Indeed, the constraint that $\int dxP\left(X=x\right)=1$ is trivially satisfied because the partition function $\mathrm{Z}\left({\mathit{\lambda}}^{\mathbf{\ast}}\right)$ normalizes $Q\left(x,{\mathit{\lambda}}^{\mathbf{\ast}}\right)$. Since $\mathit{\lambda}}^{\mathbf{\ast}$ minimizes the free energy, we also know
This derivative is easily evaluated as
where ${E}_{\lambda}\left[\right]$ is the expectation operator over ${P}_{\lambda}\left(x\right)=Q\left(x,\mathit{\lambda}\right)/\mathrm{Z}\left(\mathit{\lambda}\right)$. Consequently,
and minimizing the free energy provides parameters that guarantee that the constraints are satisfied.
This free energy formulation is computationally convenient because $F\left(\mathit{\lambda}\right)$ is a convex function that can be easily minimized using powerful techniques from convex optimization. To see this, note that the matrix of second derivatives is,
We showed above that
Similarly, we next note that
Therefore, these two terms together imply that
Since the second derivative of $F\left(\mathit{\lambda}\right)$ is a covariance matrix for all values of $\mathit{\lambda}$, and all covariance matrices are positive semidefinite, this implies $F\left(\mathit{\lambda}\right)$ is a convex function.
Similarly, one can find the maximum entropy distribution (MED) with constrained mean and variance by minimizing another free energy function:
where
and the maximum entropy distribution will be
We refer this distribution as the meanandvarianceconstrained MED or secondorder MED.
In practice, we solved each MED by numerically minimizing its associated free energy function with the fminunc function in MATLAB v2018a.
3. Contrast bounds set implicit constraints on the MED
In the previous section, we emphasized that the free energy, $F$, depends on the Lagrange multipliers parameters $\text{}\mathit{\lambda}$, because optimizing over these parameters allowed us to find the corresponding MED. However, the free energy also depends on the constrained moments, $\mathit{\mu}$, which entered the formulae as a set of fixed parameters. Moreover, the free energy depends on the state space of the random variable, which is the set of values that $X$ can take. For continuous variables, this manifests as the integral bound, $\left[a,b\right]$, in the partition function, $\mathrm{Z}\left(\mathit{\lambda}\right)={\int}_{a}^{b}dxQ\left(x,\text{}\mathit{\lambda}\right)$. Therefore, we generally expect
Since $X$ represents the pixel contrast, we refer to $\left[a,b\right]$ as the contrast range and $a$ or $b$ separately as contrast bounds. In this section, we discuss in detail how $\mathit{\lambda}}^{\mathbf{\ast}$ depends on $\left[a,b\right]$. Since $\left[a,b\right]$ influences the MED, we say $\left[a,b\right]$ sets an implicit constraint on the MED.
A natural choice of the contrast range is the entire real line. For example, when one constrains the mean and variance of a distribution and defines $X$ on the entire real line, then the MED is the Gaussian distribution. Moreover, by rewriting the probability density of the Gaussian
in a slightly different form
we may identify the Lagrange multipliers that minimize the free energy as, $\lambda}_{1}^{\ast}=\frac{\mu}{{\sigma}^{2}$, $\lambda}_{2}^{\ast}=\frac{1}{2{\sigma}^{2}$.
In the Gaussian distribution, the probability density drops to zero very fast as $x$ goes to positive or negative infinity, because $\lambda}_{2}{x}^{2$ is a large negative number in either case. However, when one wants to find an MED whose highestorder constraint is an oddorder moment, a finite contrast bound becomes necessary. This is because the sign of $\lambda}_{2n+1}{x}^{2n+1$ depends on the sign of $x$ for any natural number $n$, thereby causing the probability density to approach zero on one half of the real line and to explode on the other half. To illustrate this point concretely, consider the firstorder MED that has a constrained mean. The firstorder MED must take the form
In this distribution, $\mathrm{exp}\left({\lambda}_{1}\left(x{\mu}_{1}\right)\right)$ blows up when $x$ either goes to $+\mathrm{\infty}$ or $\mathrm{\infty}$ (unless ${\lambda}_{1}=0$). In any case, no normalizing factor $Z$ could satisfy ${\int}_{\mathrm{\infty}}^{+\mathrm{\infty}}\frac{1}{Z}\mathrm{exp}\left({\lambda}_{1}\left(x{\mu}_{1}\right)\right)=1$. Therefore, the firstorder MED is generally illdefined on the real line.
We thus sought to compute the dependence of the firstorder MED on its contrast range. The parameters in the MED minimize the free energy
Without loss of generality, we can shift the $x$axis such that the contrast range is symmetric around 0. Setting ${x}^{\mathrm{\prime}}=x\frac{1}{2}\left(b+a\right)$, we find
where ${b}^{\mathrm{\prime}}=\frac{1}{2}\left(ba\right)$ and ${\mu}^{\mathrm{\prime}}={\mu}_{1}\frac{1}{2}\left(b+a\right)$.
This optimization problem for the firstorder MED can be solved analytically. First, note that the partition function is
This partition function is continuous at $\lambda =0$. To see this, note that
where we used L’Hôpital’s rule to evaluate the limit. Therefore the free energy,
is also continuous at $\lambda =0$. Furthermore, the partition function is differentiable at $\lambda =0$. In particular,
where we again used L’Hôpital’s rule. Consequently, the derivative of the free energy at $\lambda =0$ is
The derivative for $\lambda \ne 0$ is straightforward to evaluate, and we find
Therefore, the final formula is
We determine $\lambda}^{\ast$ by setting this expression equal to zero.
Most simply, if ${\mu}^{\mathrm{\prime}}=0$, then ${\lambda}^{\ast}=0$ minimizes the free energy, and the MED is a uniform distribution. This result is intuitive. Without an explicit mean constraint, the zerothorder MED with a finite contrast bound is a uniform distribution with zero mean, and the requirement that the distribution have zero mean is already satisfied.
If ${\mu}^{\mathrm{\prime}}\ne 0$, we need to solve
We rearrange this expression to find
where $\mathrm{coth}\left(x\right)$ is the hyperbolic cotangent function. If we define $f\left(x\right)=\mathrm{coth}\left(x\right)\frac{1}{x},$ then we can rewrite the above equation as
and we get
From the above equation, we observe that $\lambda}^{\ast$ depends only on $\frac{{\mu}^{\mathrm{\prime}}}{{b}^{\mathrm{\prime}}}$ and $\frac{1}{{b}^{\mathrm{\prime}}}$. From Appendix 2—figure 1A, we see that $\lambda}^{\ast$ monotonically increases with $\frac{{\mu}^{\mathrm{\prime}}}{{b}^{\mathrm{\prime}}}$. Intuitively, $\frac{{\mu}^{\mathrm{\prime}}}{{b}^{\mathrm{\prime}}}$ sets the relative scale of $\mu}^{\mathrm{\prime}$ in the contrast range and the degree of asymmetry in the distribution. For example, when $\mu}^{\mathrm{\prime}$ is zero, we know that $\lambda}^{\ast$ is zero and the MED is perfectly symmetric. As $\mu}^{\mathrm{\prime}$ deviates from 0, $\lambda}^{\ast$ deviates from zero and the distribution distorts asymmetrically from the uniform distribution to satisfy the nonzero mean. As the ratio of $\mu}^{\mathrm{\prime}$ and $b}^{\mathrm{\prime}$ grows larger, $\lambda}^{\ast$ has to deviate more from 0, and the distribution becomes increasingly asymmetric. When $\frac{{\mu}^{\mathrm{\prime}}}{{b}^{\mathrm{\prime}}}$ is small, $\lambda}^{\ast$ is roughly linear in $\frac{{\mu}^{\mathrm{\prime}}}{{b}^{\mathrm{\prime}}}$. Also note that there is a simple proportionality between $\lambda}^{\ast$ and $\frac{1}{{b}^{\mathrm{\prime}}}$. This dependence is easily understood by the requirement that ${\lambda}^{\ast}x$ is dimensionless.
The meanandvarianceconstrained MED also depends on the finite contrast range. To simplify the discussion, we again set the zero point of xaxis such that the contrast range is symmetric around 0. Thus, our task is to find $\lambda}_{1},{\lambda}_{2$ that minimize the free energy function,
where $\mu}_{1$ and $\mu}_{2$ are the shifted firstorder and secondorder moments imposed on the MED, and $[b,b]$ is the shifted contrast range. As was the case for the firstorder MED, when ${\mu}_{1}=0$, the distribution is symmetric and ${\lambda}_{1}^{\ast}=0$. Furthermore, as the contrast bound becomes large, the distribution approaches the Gaussian distribution. However, when ${\mu}_{1}\ne 0$, the distribution has to distort asymmetrically to achieve the nonzero mean, with large distortions occurring when the mean is within a few standard deviations of the boundary (Appendix 2—figure 1B). As a result, even though we did not require the distribution to have a specific skewness, the secondorder MED can have nonzero skewness (Appendix 2—figure 1C).
These mathematical considerations lead to the design of our numerical methods. Operationally, for each individual image, we found its contrast range, $\left[{c}_{min},{c}_{max}\right],$ and the largest contrast magnitude, $\delta c=max\left({c}_{min},{c}_{max}\right)$. Then we constructed a symmetric contrast range around the constrained mean $\mu}_{1$, and computed the MED on $\left[{\mu}_{1}\delta c,{\mu}_{1}+\delta c\right]$ . This avoided inducing nonzero skewness in the meanandvarianceconstrained MED. For consistency, we used the same symmetric contrast range for the meanvarianceandskewnessconstrained MED. Since the contrast bound implicitly influences the shape of the MED and different individual images have different contrast ranges, this method takes guidance from natural scenes and uses heterogenous contrast ranges that match the contrast range of each original natural scene. However, this heterogeneity introduces variability across images that could potentially impact the estimation accuracy we observed. Therefore, we also simulated secondary datasets where we used a common symmetric bound for all synthetic images. We chose this bound such that we could successfully solve the MED for the mean, variance, and skewness levels of most natural images.
References

1
Spatiotemporal energy models for the perception of motionJournal of the Optical Society of America A 2:284–299.https://doi.org/10.1364/JOSAA.2.000284
 2
 3

4
Receptive fields and functional architecture in the retinaThe Journal of Physiology 587:2753–2767.https://doi.org/10.1113/jphysiol.2009.170704
 5
 6

7
A maximum entropy approach to natural language processingComputational Linguistics 22:39–71.
 8
 9

10
Principles of visual motion detectionTrends in Neurosciences 12:297–306.https://doi.org/10.1016/01662236(89)900106

11
Common circuit design in fly and mammalian motion visionNature Neuroscience 18:1067–1076.https://doi.org/10.1038/nn.4050

12
Elementary movement detectors in an insect visual systemBiological Cybernetics 24:85–101.https://doi.org/10.1007/BF00360648
 13
 14

15
Visually mediated motor planning in the escape response of DrosophilaCurrent Biology 18:1300–1307.https://doi.org/10.1016/j.cub.2008.07.094
 16

17
Adaptation to temporal contrast in primate and salamander retinaThe Journal of Neuroscience 21:9904–9916.https://doi.org/10.1523/JNEUROSCI.212409904.2001

18
Functional asymmetries in ON and OFF ganglion cells of primate retinaThe Journal of Neuroscience 22:2737–2747.https://doi.org/10.1523/JNEUROSCI.220702737.2002
 19
 20

21
Parallel computations in insect and mammalian visual motion processingCurrent Biology 26:R1062–R1072.https://doi.org/10.1016/j.cub.2016.08.003
 22

23
A flexible geometry for panoramic visual and optogenetic stimulation during behavior and physiologyJournal of Neuroscience Methods 323:48–55.https://doi.org/10.1016/j.jneumeth.2019.05.005

24
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural SystemsMIT Press.
 25
 26

27
Accuracy of velocity estimation by Reichardt correlatorsJournal of the Optical Society of America A 18:241–252.https://doi.org/10.1364/JOSAA.18.000241

28
DirectionSelective calcium signals in starburst cell dendritesInvestigative Ophthalmology & Visual Science 43:4763.https://doi.org/10.1038/nature00931
 29
 30
 31
 32

33
Visual perception and the statistical properties of natural scenesAnnual Review of Psychology 59:167–192.https://doi.org/10.1146/annurev.psych.58.110405.085632
 34

35
Visual control of locomotion in the walking fruitfly DrosophilaJournal of Comparative Physiology 85:235–266.https://doi.org/10.1007/BF00694232
 36
 37

38
Systemtheoretische analyse der zeit, Reihenfolgen und vorzeichenauswertung bei der bewegungsperzeption des rüsselkäfers ChlorophanusZeitschrift Für Naturforschung B 11:513–524.https://doi.org/10.1515/znb195691004
 39

40
Local paths to global coherence: cutting networks down to sizePhysical Review E 89:032802.https://doi.org/10.1103/PhysRevE.89.032802
 41

42
Information theory and statistical mechanicsPhysical Review 106:620–630.https://doi.org/10.1103/PhysRev.106.620

43
Faster thalamocortical processing for dark than light visual targetsJournal of Neuroscience 31:17471–17479.https://doi.org/10.1523/JNEUROSCI.245611.2011

44
Visual pattern discriminationIEEE Transactions on Information Theory 8:84–92.https://doi.org/10.1109/TIT.1962.1057698
 45

46
Visual discrimination of textures with identical thirdorder statisticsBiological Cybernetics 31:137–140.https://doi.org/10.1007/BF00336998
 47
 48
 49
 50

51
Inferring hidden structure in multilayered neural circuitsPLOS Computational Biology 14:e1006291.https://doi.org/10.1371/journal.pcbi.1006291
 52
 53
 54
 55

56
From understanding computation to understanding neural circuitryNeurosciences Research Program Bulletin, 15.
 57

58
Deep learning models of the retinal response to natural scenesAdvances in Neural Information Processing Systems 29:1369–1377.
 59

60
The analysis of moving visual patternsIn: C Chagas, R Gattass, C Gross, editors. Pattern Recognition Mechanisms. SpringerVerlag Berlin Heidelberg. pp. 117–151.

61
Perceptual interaction of local motion signalsJournal of Vision 16:22.https://doi.org/10.1167/16.14.22
 62

63
The statistics of local motion signals in naturalistic moviesJournal of Vision 14:10.https://doi.org/10.1167/14.4.10
 64

65
Symmetry breakdown in the ON and OFF pathways of the retina at night: functional implicationsJournal of Neuroscience 30:10006–10014.https://doi.org/10.1523/JNEUROSCI.561609.2010

66
Triplets of spikes in a model of spike timingdependent plasticityJournal of Neuroscience 26:9673–9682.https://doi.org/10.1523/JNEUROSCI.142506.2006

67
Considerations on models of movement detectionKybernetik 13:223–227.https://doi.org/10.1007/BF00274887

68
A parametric texture model based on joint statistics of complex wavelet coefficientsInternational Journal of Computer Vision 40:49–70.https://doi.org/10.1023/A:1026553619983

69
Statistical mechanics and visual signal processingJournal De Physique 4:1755.https://doi.org/10.1051/jp1:1994219
 70

71
PathwaySpecific asymmetries between ON and OFF visual signalsThe Journal of Neuroscience 38:9728–9740.https://doi.org/10.1523/JNEUROSCI.200818.2018

72
Lightness scale from image intensity distributionsApplied Optics 21:2569–2582.https://doi.org/10.1364/AO.21.002569
 73
 74
 75
 76

77
Optimal prediction in the retina and natural motion statisticsJournal of Statistical Physics 162:1309–1323.https://doi.org/10.1007/s109550151439y
 78
 79
 80
 81
 82

83
A model of neuronal responses in visual area MTVision Research 38:743–761.https://doi.org/10.1016/S00426989(97)001831

84
Natural image statistics and neural representationAnnual Review of Neuroscience 24:1193–1216.https://doi.org/10.1146/annurev.neuro.24.1.1193
 85
 86
 87
 88

89
Angular and spectral sensitivity of fly photoreceptors II. Dependence on facet lens Fnumber and rhabdomere type in DrosophilaJournal of Comparative Physiology A 189:189–202.https://doi.org/10.1007/s0035900303906
 90
 91

92
Impact of network structure and cellular response on spike time correlationsPLOS Computational Biology 8:e1002408.https://doi.org/10.1371/journal.pcbi.1002408

93
Local image statistics: maximumentropy constructions and perceptual salienceJournal of the Optical Society of America A 29:1313–1345.https://doi.org/10.1364/JOSAA.29.001313
 94
 95
 96
 97

98
Different circuits for ON and OFF retinal ganglion cells cause different contrast sensitivitiesThe Journal of Neuroscience 23:2645–2654.https://doi.org/10.1523/JNEUROSCI.230702645.2003
 99
Decision letter

Stephanie PalmerReviewing Editor; University of Chicago, United States

Ronald L CalabreseSenior Editor; Emory University, United States

Fred RiekeReviewer; University of Washington, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Thank you for submitting your article "Measured perceptual nonlinearities show how ONOFF asymmetric processing improves motion estimation in natural scenes" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Ronald Calabrese as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Fred Rieke (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This paper takes a close look into the natural scene statistics that influence behavior. In particular, the work investigates the importance of second and thirdorder stimulus correlations on motion sensing in flies. Secondorder correlations have been emphasized in past work, and a few past studies have suggested that thirdorder correlations also contribute, but this work provides the first comprehensive treatment of the issue. The paper presents an analysis of fly behavioral responses to moving patterns based on rigid rotations of static natural scenes, either with fully natural statistics, or modified parametric stimuli that retain the skewness in the contrast distribution of natural scenes. Analysis of behavioral responses to these stimulus variants indicates that the fly uses third order computations effectively to counter some of the systematic (and wellknown) variations introduced by the secondorder computations prescribed by canonical motion detection models. Both the experiments and the analysis are presented clearly, and the authors make a convincing case for their conclusion that the thirdorder term improves velocity estimation in natural scenes. The computational strategies discovered here likely extend to other visual systems. The stimulus construction algorithm developed here could also be used for other model systems and analyses.
Essential revisions:
1) As pointed out in the Discussion section, the Volterra kernel approximation represents a general and systematic approach to the identification of nonlinear systems (with practical caveats, as also mentioned). But there is also a more fundamental issue: Volterra kernels are essentially high dimensional polynomial approximations, and as such, although they are general, they are extremely inefficient in describing compressive nonlinearities (and biological systems are essentially always compressive in their typical range of operation). Motion estimation by correlation is a secondorder Volterra operation, and therefore diverges quadratically. To counter that divergence, the system must somehow normalize the response to the contrast variance, such as in contrast normalization. And this is what seems to happen in biological motion detectors: It is well known that motion responses saturate as a function of contrast.
In this light, the finding that the thirdorder Volterra kernel improves velocity estimates can then be simply seen as a corollary of the fact that the thirdorder kernel is essentially a "fitting parameter" which is necessarily better than a fit to secondorder. Alternatively, as noted in the paper, the thirdorder behavior could be due to more or less fixed asymmetries in processing, or even due to nonlinear behavior in early vision. Something along these lines was proposed by Bulthoff and Goetz (Bulthoff and Goetz, 1979) to explain a motion illusion in human and fly vision (and that should be cited in this paper).
These points should be addressed fully in the text of the paper, both in the Results and in the Discussion.
2) Response predictions and quantification. The kernel approach estimates behavioral responses, but the predictions are then compared directly to image velocity. Implicit in this (at least not clearly stated in the text of the manuscript) is that the behavior is linearly related to the image velocity – or at least that the two can be swapped in the analysis. It would be good to clarify this issue somewhere around the second paragraph of the subsection “Thirdorder kernel improve velocity estimation in moving natural scenes” where this analysis is described. At present, the text could be misread to indicate that image velocity is being estimated directly.
3) In the Discussion section:
 The interpretation of the main result (that the thirdorder term improves velocity estimation in natural scenes) feels a bit narrow; the authors should remedy that in the Discussion section.
 Ties to neural data need to be made a bit more extensive and detailed. Is there evidence to suggest that the kinds of optimal temporal filters derived here are found in the fly brain? Perhaps suggest a few other places to look, beyond what's already discussed (measuring T4/T5 kernels).
 Comment a bit more on the limitations/advantages of considering only rigid rotations applied to static natural or naturalistic scene stats. Perhaps also comment on using walking versus flight, and note that flies spend most of their time walking.
 Comment a bit more on the potential behavioral relevance of the magnitude of the effect of the improvement from thirdorder filter input to secondorder processing.
 How prevalent is positive skewness in different natural scenes? Are there scenes/environments where one should expect more or less skewness? Would visual systems that operate in specialized niches be expected to have different nonlinearities in their motion processing as a result, or will all visual systems require some amount of compensation in their secondorder motion detectors?
Title revisions:
Replace "perceptual" with "behavioral"
Suggested stylistic revisions:
The papers dives into contrastpolarity sensitivity right away, before sufficiently motivating the research approach as a whole. It also seems like this feature might require a more intuitive figure to explain the effect to the uninitiated. Perhaps a few more words about the behavioral advantages of this level of careful motion dissection would be useful.
In the last paragraph of the Introduction, it would be good to broaden the perspective a bit on how filters are usually computed in behavioral and neural recording experiments, how this relates to other filterfinding techniques, and what the thirdorder kernels might reveal. For a broad audience, it might also be good to remind the reader what a Volterra expansion is and add a note about adding kernels to the expansion versus static nonlinearities, along the lines discussed above.
Here and there the paper reads a bit too much like a travel journal – a few examples:
 The paragraphs in the subsection “The structure of natural scenes induces noise in secondorder motion estimates” all have something like "We did something…Then we observed something, etc."
 The legend to Figure 3 is full of "we did this or that".
Consider revising the text to make the language just a bit more formal and deliberate in guiding the reader through the major scientific findings in the paper, by motivating and previewing the steps taken in this careful and thorough work.
There are a number of typos that need to be corrected, through a close reading of the text.
https://doi.org/10.7554/eLife.47579.sa1Author response
Essential revisions:
1) As pointed out in the Discussion section, the Volterra kernel approximation represents a general and systematic approach to the identification of nonlinear systems (with practical caveats, as also mentioned). But there is also a more fundamental issue: Volterra kernels are essentially high dimensional polynomial approximations, and as such, although they are general, they are extremely inefficient in describing compressive nonlinearities (and biological systems are essentially always compressive in their typical range of operation). Motion estimation by correlation is a secondorder Volterra operation, and therefore diverges quadratically. To counter that divergence, the system must somehow normalize the response to the contrast variance, such as in contrast normalization. And this is what seems to happen in biological motion detectors: It is well known that motion responses saturate as a function of contrast.
We agree that a major problem of quadratic motion estimators is that they are too responsive to large contrast stimuli. We also agree that contrast normalization is one popular approach to ameliorate this issue. We have added a citation to Simoncelli and Heeger to point out how contrast normalization has to been applied to visual motion processing.
Our current approach leverages a simple algorithmic description to summarize multiple incompletely known nonlinear processing mechanisms, and remains agnostic to their origins in the circuit. The fly community is currently dissecting the biophysics of visual circuits in significant detail, and it is now clear that many distinct nonlinear mechanisms contribute to visual motion computation. As such, we agree that we’re studying phenomena that could conceivably relate to contrast normalization, but we think it is premature to conclude that contrast normalization is the origin of the effects we observe here.
Furthermore, typical contrast normalization methods generate contrastsymmetric nonlinear processing, for instance by normalizing by the variance of the inputs or by passing the (symmetric) motion signal itself through a normalizing or saturating process. In comparison to these forms of contrastsymmetric processing, the thirdorder kernel captured the contrastasymmetric nonlinearity in the circuit, which is qualitatively different from standard contrast normalization.
We have addressed these issues through several changes in the manuscript. We made three changes to the Results section where the Volterra series expansion first appeared. First, we explicitly made the connection between the Volterra series and a polynomial expansion.
“Similar to the Taylor series from calculus, the Volterra series is a polynomial description of a nonlinearity, with a firstorder kernel that describes linear transformations, a secondorder kernel that captures quadratic terms, and higherorder kernels that combine to represent a wide variety of nonlinearities beyond the secondorder.”
We then mention that Volterra series might converge inefficiently to the true mechanistically accurate nonlinearity and mentioned that there are convenient alternatives to our approach. Updated text:
“However, many polynomial terms can be needed to describe some nonlinearities. For instance, the polynomial description of a compressive, saturating nonlinearity is inefficient, and it can be easier to describe such transformations using alternative nonlinear model architectures, such as linearnonlinear cascade models (Dayan et al., 2001).”
Finally, we pointed out that this expansion should be taken at the algorithmic, rather than the mechanistic level:
“We emphasize that the Volterra kernel description is explicitly algorithmic, as it aims to summarize the overall system processing without considering the mechanisms leading to this processing.”
This latter point permitted us to get into more details in the Discussion section. In particular, we added asymmetric contrast adaptation in the ONOFF pathway as a potential nonlinearity that could lead to lightdark asymmetric motion processing. The text there was updated as follows:
“Yet it remains unclear whether asymmetric responses of T4 and T5 are inherited from upstream neurons. For instance, contrast adaptation could differ between the two pathways (Chichilnisky and Kalmar, 2002).”
In this light, the finding that the thirdorder Volterra kernel improves velocity estimates can then be simply seen as a corollary of the fact that the thirdorder kernel is essentially a "fitting parameter" which is necessarily better than a fit to secondorder.
We do not think that this assertion is correct. It is true that one will better approximate a system by adding more fitting parameters, and we indeed better predicted optomotor turning responses with the added thirdorder kernel. However, we fitted a model to explain optomotor turning behavior, but we evaluated the model’s performance on a different task, i.e., velocity estimation. There is no guarantee that the system responsible for optomotor behavior is also optimized for velocity estimation, nor that including a thirdorder Volterra term in the fit would improve velocity estimates in natural scenes.
When we evaluated the model fitted to optomotor behavior with the task of velocity estimation in natural scenes, we had three implicit hypotheses. 1. We hypothesize that the visual system is optimized for velocity estimation. 2. We hypothesize that the turning response is a readout of that velocity estimate. 3. We hypothesize that the system is tailored specifically to the idiosyncrasies of natural scenes, rather than working for arbitrary image ensembles. The results are consistent with our hypotheses, and our finding that the thirdorder fit does improve natural scene performance is an exciting nontrivial result of this paper. It is not a corollary to fitting a model with more parameters.
We first added an introductory Figure 3 paragraph in the Results section to explicitly state our assumption that optomotor behavior is a proxy for the fly’s internal estimate of the image velocity:
“The kernels were fit to turning behavior, so the output of the model to moving visual stimuli is the predicted optomotor turning response. […] Using the fitted behavioral model, we could thus investigate how accurately the fly’s velocity estimate tracks the true image velocity.”
We then elaborated our results of improving velocity estimation with the thirdorder kernel in the Discussion section:
“In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second and thirdorder terms in the Volterra series contributed to the turning behavior. […] Therefore, these results can be taken together to motivate the hypothesis that the magnitude of the fly’s turning response is determined by an internal estimate of velocity.”
Alternatively, as noted in the paper, the thirdorder behavior could be due to more or less fixed asymmetries in processing, or even due to nonlinear behavior in early vision. Something along these lines was proposed by Bulthoff and Goetz (Bulthoff and Goetz, 1979) to explain a motion illusion in human and fly vision (and that should be cited in this paper).
We agree that it’s interesting to ask how nonlinear circuits could develop sensitivity to thirdorder motion cues. In previous work, we have investigated a variety of circuit nonlinearities that could give rise to thirdorder motion processing (Fitzgerald and Clark, 2015), including contrast asymmetric frontend nonlinearities of the type suggested by Bulthoff and Goetz. Thank you for pointing us to this reference. We have now cited it. Our text around this issue now reads:
“Previous work has suggested that frontend nonlinearities could account for certain optomotor illusions in flies (Bülthoff and Götz, 1979), and it is conceivable that such nonlinearities could generate contrast asymmetric motion responses (Clark and Demb, 2016; Fitzgerald et al., 2011). […] Alternatively, nonlinear processing at the level of directionselective T4 and T5 neurons could also generate the asymmetries we observed here.”
These points should be addressed fully in the text of the paper, both in the Results and in the Discussion.
We hope that you will find that these modifications have improved the paper and have fully addressed the concerns of the reviewers.
2) Response predictions and quantification. The kernel approach estimates behavioral responses, but the predictions are then compared directly to image velocity. Implicit in this (at least not clearly stated in the text of the manuscript) is that the behavior is linearly related to the image velocity – or at least that the two can be swapped in the analysis. It would be good to clarify this issue somewhere around the second paragraph of the subsection “Thirdorder kernel improve velocity estimation in moving natural scenes” where this analysis is described. At present, the text could be misread to indicate that image velocity is being estimated directly.
Thank you for pointing out that this issue was unclear in our manuscript. First, in line with comment 1, we extended the introduction to Figure 3 to state explicitly that our model fitted behavioral responses but predicted image velocity. Second, we have added a few sentences to clarify the role of the linear assumption in our chosen accuracy metric. In particular, before presenting the results of Figure 3C, we now say:
“We quantified how well the model’s responses predicted the image velocity using the Pearson correlation coefficient. This metric supposes that the model response and image velocity are linearly related, and its value summarizes intuitively the meansquarederror of the best linear fit between the model’s output and the image velocity.”
3) In the Discussion section:
 The interpretation of the main result (that the thirdorder term improves velocity estimation in natural scenes) feels a bit narrow; the authors should remedy that in the Discussion section.
Thank you for these suggestions. To address this comment, we first added a summary Discussion paragraph to broaden our interpretation of the main result:
“In this study, we first fit a Volterra series expansion to model the fly’s turning behavior in response to binary stochastic stimuli, and both second and thirdorder terms in the Volterra series contributed to the turning behavior. […] Since skewed scenes are prevalent across natural environments, and many visual systems exhibit ONOFF asymmetric visual processing (Chichilnisky and Kalmar, 2002; Jin et al., 2011; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010; Zaghloul et al., 2003), many animals are likely to use similar strategies for motion perception.”
We also related our work to other fields. We first discussed how the methodology of Volterra series expansion is used in other fields of neuroscience:
“Polynomial approximations to complex nonlinear systems have also been useful in other domains of neuroscience. […] These motifs might relate to measurable properties of the neocortex (Song et al., 2005).”
We then discussed how ONOFF asymmetric processing in other visual systems is connected to natural scenes statistics:
“This resonates with previous work showing that many visual systems process ON and OFF signals asymmetrically to improve other aspects of visual processing (Kremkow et al., 2014; Mazade et al., 2019; Pandarinath et al., 2010; Ratliff et al., 2010). […] ONOFF asymmetric visual processing also varies in other ways, and there is evidence that contrast adaptation in ON and OFF pathways is different between primate and salamander retinas (Chander and Chichilnisky, 2001).”
In the end, we now discuss why modeling behavior at algorithmic level is beneficial:
“David Marr famously asserted that neural computation needs to be understood at both the algorithmic and implementational levels (Marr and Poggio, 1976). […] Instead of algorithm and mechanism providing parallel or hierarchical goals, they should be treated as parts of one integrated understanding of the circuit.”
 Ties to neural data need to be made a bit more extensive and detailed. Is there evidence to suggest that the kinds of optimal temporal filters derived here are found in the fly brain? Perhaps suggest a few other places to look, beyond what's already discussed (measuring T4/T5 kernels).
To tie the measured behavioral kernel to neural data, we have added a Discussion section on potential mechanisms for the measured asymmetries. We first point out that asymmetric frontend filtering could, but would not necessarily, result in the measured thirdorder kernel, then discuss how asymmetric processing could originate upstream of T4/T5, in T4/T5, or differential weighting of T4 and T5 outputs downstream.
“Indeed, differentially affecting T4 and T5 activity, either through direct silencing or by manipulating upstream neurons, alters the behavioral responses of flies to triplet correlations (Clark et al., 2014; Leonhardt et al., 2016), and parallel experiments in humans similarly find that contrastasymmetric responses are mediated by neurons separately modulated by moving ON and OFF edges (Clark et al., 2014). […] By measuring behavior and distilling the abstract algorithmic properties of the system, we will be able to constrain the contributions of individual circuit components without confining ourselves to an overly narrow class of mechanistic models.”
 Comment a bit more on the limitations/advantages of considering only rigid rotations applied to static natural or naturalistic scene stats. Perhaps also comment on using walking versus flight, and note that flies spend most of their time walking.
Thank you for this suggestion. We now comment on the limitations and advantages of these rigid rotations when we introduce them in Figure 1:
“This rigid translation of images mimics the motion produced by an animal’s pure rotation, during which visual objects all move at the same rotational velocity and occlusion does not change over time. Real motion through an environment generates more complex signals than this, but rigid translations are straightforward to compute and rotational visual stimuli are known to induce the rotational optomotor response that we focus on in this manuscript.”
We also note at the outset that Drosophila spends a large fraction of its life walking:
“Flies spend a large portion of their lives standing and walking on surfaces, making walking optomotor responses ethologically critical (Carey et al., 2006).”
 Comment a bit more on the potential behavioral relevance of the magnitude of the effect of the improvement from thirdorder filter input to secondorder processing.
We added a Discussion section to address this suggestion. We started by comparing the behavioral magnitude in thirdorder glider stimuli and the computational benefits of the thirdorder kernel, we then compared the magnitude of the improvement added by the thirdorder kernel with other mechanisms such as spatial averaging, and we finally pointed out that the thirdorder kernel may underestimate the full improvement provided by its underlying mechanisms. This new section reads as follows:
“The HRC has explained a large number of phenomena and neural responses, and it is reasonable to ask how much we’ve gained by extending its secondorder algorithmic description of fly behavior to a thirdorder one. […] It will be interesting to investigate whether mechanistically accurate models that explain the origin of the thirdorder kernel also reveal larger performance improvements.”
 How prevalent is positive skewness in different natural scenes? Are there scenes/environments where one should expect more or less skewness? Would visual systems that operate in specialized niches be expected to have different nonlinearities in their motion processing as a result, or will all visual systems require some amount of compensation in their secondorder motion detectors?
Thank you for these interesting questions. There are studies showing that contrast adaptation in ON and OFF pathways in primates differs from those pathways in salamanders; that could be the result of different visual environments. We now added a section that discusses natural scene statistics in different habitats, and how natural environments would combine with early processing to shape the strategy of motion computation. This new section reads as follows:
“Adaptations to differences in both habitat and early sensory processing could potentially explain these divergences. […] These possibilities are not mutually exclusive, and in both cases, the early visual processing must work in concert with the downstream motion detectors to form robust and consistent perceptions.”
Title revisions:
Replace "perceptual" with "behavioral"
Thank you for this suggestion. We had many discussions about how well ‘perceptual’ fit here. We have now revised our title to read: “Asymmetric ONOFF processing of visual motion cancels variability induced by the structure of natural scenes”.
Suggested stylistic revisions:
The papers dives into contrastpolarity sensitivity right away, before sufficiently motivating the research approach as a whole. It also seems like this feature might require a more intuitive figure to explain the effect to the uninitiated. Perhaps a few more words about the behavioral advantages of this level of careful motion dissection would be useful.
Thank you for this suggestion. We now spend time in the Introduction motivating the systems analysis and studying motion computation in terms of spatiotemporal correlations before we introduce contrastpolarity sensitivity:
“Higherorder correlations could also contribute to motion computation, and Bayes optimal visual motion estimators can be written as a sum of terms specialized for detecting different correlation types (Poggio and Reichardt, 1973; Potters and Bialek, 1994).[…] This sensitivity to triplet spatiotemporal correlations shows that motion perception incorporates cues neglected by canonical motion detectors.”
In the last paragraph of the Introduction, it would be good to broaden the perspective a bit on how filters are usually computed in behavioral and neural recording experiments, how this relates to other filterfinding techniques, and what the thirdorder kernels might reveal. For a broad audience, it might also be good to remind the reader what a Volterra expansion is and add a note about adding kernels to the expansion versus static nonlinearities, along the lines discussed above.
Thank you for this suggestion. In line with previous comments, we now added a brief introduction to the Volterra expansions in the Introduction Section. In line with Essential Revisions 1, in the Results section, we added more detailed explanation for Volterra series expansion, and relate it to other modeling techniques for nonlinear system. We also added prose to motivate why we choose Volterra expansion in the study and what the thirdorder kernel might reveal:
“Volterra kernels are useful for studying visual motion processing because they allow us to rigorously group response properties by their order (Fitzgerald et al., 2011; Potters and Bialek, 1994), thereby permitting us to clearly describe both canonical and contrast polaritydependent components of the behavior. […] The thirdorder kernel directly measures sensitivities to triplet spatiotemporal correlations and probes ON/OFF asymmetries in motion processing.”
Here and there the paper reads a bit too much like a travel journal – a few examples:
 The paragraphs in the subsection “The structure of natural scenes induces noise in secondorder motion estimates” all have something like "We did something…Then we observed something, etc."
Thank you. We have revised to add motivation throughout.
 The legend to Figure 3 is full of "we did this or that".
The legend has been updated to avoid this. Thank you.
Consider revising the text to make the language just a bit more formal and deliberate in guiding the reader through the major scientific findings in the paper, by motivating and previewing the steps taken in this careful and thorough work.
We have revised the text throughout to eliminate some of this informality and to better motivate the analyses we performed.
There are a number of typos that need to be corrected, through a close reading of the text.
As we revised and reread the manuscript, we have corrected typos we found. We hope we caught them all.
https://doi.org/10.7554/eLife.47579.sa2Article and author information
Author details
Funding
National Institutes of Health (R01EY026555)
 Juyue Chen
 Damon A Clark
National Science Foundation (IOS1558103)
 Juyue Chen
 Damon A Clark
Chicago Community Trust (Searle Scholar Award)
 Holly B Mandel
 Damon A Clark
Howard Hughes Medical Institute
 James E Fitzgerald
Alfred P. Sloan Foundation (Research Fellowship)
 Damon A Clark
The Swartz Foundation
 James E Fitzgerald
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The authors thanks Ann Hermundstad and Emily Cooper for comments on the manuscript. JEF was supported by the Swartz Foundation and the Howard Hughes Medical Institute. DAC and this research were supported by NIH R01EY026555, NIH P30EY026878, NSF IOS1558103, a Searle Scholar Award, a Sloan Fellowship in Neuroscience, the Smith Family Foundation, and the E Matilda Ziegler Foundation.
Senior Editor
 Ronald L Calabrese, Emory University, United States
Reviewing Editor
 Stephanie Palmer, University of Chicago, United States
Reviewer
 Fred Rieke, University of Washington, United States
Publication history
 Received: April 10, 2019
 Accepted: October 12, 2019
 Accepted Manuscript published: October 15, 2019 (version 1)
 Version of Record published: November 29, 2019 (version 2)
 Version of Record updated: December 5, 2019 (version 3)
Copyright
© 2019, Chen et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,288
 Page views

 234
 Downloads

 5
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.